Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap'ed access to reference files #40

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ilveroluca
Copy link

This pull request implements memory mapped access to (larger) reference files in bwa mem. The mmaps are created as private (copy-on-write), read-only mode and are locked in memory. This approach presents two main advantages over allocating space in memory and freading the data into it.

The first advantage is that the reference data in memory is automatically shared in memory by all bwa processes using it. In fact, by memory mapping the file the reference is automatically loaded into a shared memory space.

The second advantage is that the OS will not evict the reference data from memory as soon as the process exits. Instead, it will be kept around until the memory is needed by the system. This means that for multiple invocations of BWA with the same reference only the first will have to wait for the reference to be loaded; the subsequent ones will re-use the same data so they will proceed directly to aligning reads. This of course is only true if the system isn't subject to intense memory use by other processes in the meantime.

To enable memory-mapped reference access call bwa mem with the new "-z" option.

Because we're locking the reference in memory, this approach will need more memory on the system than the standard malloc&fread strategy; if the system can't comfortably fit the entire reference in memory the OS will refuse to map the file to memory. In that case bwa mem will exit with an error message.

Comparison to SHM

I originally wrote this code against version 0.7.8, then got sidetracked before I managed to put together the pull request. I see that in the meantime I see that there's been some work towards using POSIX shared memory with the a similar objective in mind (sharing references in memory across multiple processes).

While both strategies manage to avoid loading multiple copies of the same reference in memory, I would argue that the one presented in this pull request should be preferred -- or at least co-exist -- especially for its simplicity. A user need not do anything special to take advantage of the feature other than passing the "-z" option, while using SHM objects requires that the user pre-load the references with bwa shm before running bwa mem, and that he remember to delete them afterwards to free the resources on the system.

Explicitly having to create a delete the shared reference is also problematic and less effective for anyone launching parallel alignment jobs, perhaps through a batch queueing system, where the nodes where the bwa instances will end up running are not easily predicted. In such a scenario, an alignment task should properly clean up after itself after finishing, but in doing so it may be trying to delete the shared memory object that is being used by other concurrent bwa jobs (not so bad; it should stick around until the last process releases it). Moreover, by deleting the shared memory object when it finishes, subsequent alignment runs using the same reference will not find it in memory and thus will not be able to benefit from caching effects; they will instead have to spend time reloading it.

Testing

I'm not sure whether there's an official test suite for contributions. I tested by mapping 200k pairs to the human reference with both SHM and -z memory mapping. I compared output by md5 checksum, after removing the @PG line and got identical results.

@luispedro
Copy link

+1 for this patch.

For many use cases, it has all the advantages of bwa shm without any of the drawbacks (bwa shm requires manual administration of memory, which limits its usefulness).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants