Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Make BAM readers distributable #4

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bicycle1885
Copy link
Member

This makes the BAM reader type distributable for parallel computing.

@TransGirlCodes
Copy link
Member

So this adds a reader state that is shared among processes?

@bicycle1885
Copy link
Member Author

No, it makes a reader stateless so that it does not share its state among processes. Sharing a reading position among processes is error-prone and will result in many bugs. My idea is to make all reader types stateless and create a reader's state only when it is needed (e.g. the moment when a reader starts to read a file).

@TransGirlCodes
Copy link
Member

Will this mean that if a reader gets copied to several processes, they will start from the same place and read the same things, or since I know pmap distributes parts of computation to available workers, will the readers have their own start and end points on each process?

@bicycle1885
Copy link
Member Author

I'm going to support both cases in different APIs. Since the BAM reader is stateless, all copied distributed to workers will start reading from the same position. But you often have some intervals you're interested in and will use eachoverlap to do random access. In such a case, for example, the following reader will read its allocated intervals:

# Note that the BAM reader does not open the file yet.
reader = BAM.Reader("somefile.bam")
# Distribute jobs to multiple processes.
pmap(intervals) do interval
    # `eachoverlap` opens the BAM file and returns a stateful iterator.
    for record in eachoverlap(reader, interval)
        # do some work...
    end
end

If you'd like to distribute a job that reads all records from top to bottom, you can use some function that logically splits a BAM file into chunks (say split(reader)). This would be used like this:

reader = BAM.Reader("somefile.bam")
pmap(split(reader)) do reader_part
    # A part of the BAM file will be assigned to a `reader_part` reader.
    for record in reader_part
        # do some work
    end
end

I've not yet decided the exact interfaces but I'll make it easier to use in parallel computing.

@TransGirlCodes
Copy link
Member

That is really cool. In the first example, I'd have a look at @code_warntype, I found in a recent script that in julia 0.6, variables captured in closures got boxed even if the type is known and predictable (it's a current julia bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants