mmap'ed access to reference files #40

ilveroluca · 2015-01-23T19:20:34Z

This pull request implements memory mapped access to (larger) reference files in bwa mem. The mmaps are created as private (copy-on-write), read-only mode and are locked in memory. This approach presents two main advantages over allocating space in memory and freading the data into it.

The first advantage is that the reference data in memory is automatically shared in memory by all bwa processes using it. In fact, by memory mapping the file the reference is automatically loaded into a shared memory space.

The second advantage is that the OS will not evict the reference data from memory as soon as the process exits. Instead, it will be kept around until the memory is needed by the system. This means that for multiple invocations of BWA with the same reference only the first will have to wait for the reference to be loaded; the subsequent ones will re-use the same data so they will proceed directly to aligning reads. This of course is only true if the system isn't subject to intense memory use by other processes in the meantime.

To enable memory-mapped reference access call bwa mem with the new "-z" option.

Because we're locking the reference in memory, this approach will need more memory on the system than the standard malloc&fread strategy; if the system can't comfortably fit the entire reference in memory the OS will refuse to map the file to memory. In that case bwa mem will exit with an error message.

Comparison to SHM

I originally wrote this code against version 0.7.8, then got sidetracked before I managed to put together the pull request. I see that in the meantime I see that there's been some work towards using POSIX shared memory with the a similar objective in mind (sharing references in memory across multiple processes).

While both strategies manage to avoid loading multiple copies of the same reference in memory, I would argue that the one presented in this pull request should be preferred -- or at least co-exist -- especially for its simplicity. A user need not do anything special to take advantage of the feature other than passing the "-z" option, while using SHM objects requires that the user pre-load the references with bwa shm before running bwa mem, and that he remember to delete them afterwards to free the resources on the system.

Explicitly having to create a delete the shared reference is also problematic and less effective for anyone launching parallel alignment jobs, perhaps through a batch queueing system, where the nodes where the bwa instances will end up running are not easily predicted. In such a scenario, an alignment task should properly clean up after itself after finishing, but in doing so it may be trying to delete the shared memory object that is being used by other concurrent bwa jobs (not so bad; it should stick around until the last process releases it). Moreover, by deleting the shared memory object when it finishes, subsequent alignment runs using the same reference will not find it in memory and thus will not be able to benefit from caching effects; they will instead have to spend time reloading it.

Testing

I'm not sure whether there's an official test suite for contributions. I tested by mapping 200k pairs to the human reference with both SHM and -z memory mapping. I compared output by md5 checksum, after removing the @PG line and got identical results.

luispedro · 2016-06-08T09:15:15Z

+1 for this patch.

For many use cases, it has all the advantages of bwa shm without any of the drawbacks (bwa shm requires manual administration of memory, which limits its usefulness).

ilveroluca added 13 commits January 23, 2015 17:03

New cmd line option -z / use mmap

49c2b85

Implement mmap access to bwt

ef237dc

Lock mmap directly as it's created

586d916

Encapsulate file unmapping in a bwt_ function

ec70a60

mmap for pac file

654ad45

verr_fatal, taking a va_list argument

0a3c07b

perror_fatal function

e7df13f

mmap suffix array file

66832ad

More informative mmap error message

c60374e

Comments

f65dcc0

Use madvise to inform the kernel we'll need the memory soon

d55fc57

Fix problem emerged during rebase

d6910e5

Fix rebase mess

87e9513

ilveroluca mentioned this pull request May 11, 2017

Shared memory performance #126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap'ed access to reference files #40

mmap'ed access to reference files #40

ilveroluca commented Jan 23, 2015

luispedro commented Jun 8, 2016

mmap'ed access to reference files #40

Are you sure you want to change the base?

mmap'ed access to reference files #40

Conversation

ilveroluca commented Jan 23, 2015

Comparison to SHM

Testing

luispedro commented Jun 8, 2016