Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accommodating pipes in programs.py #387

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cmarkello
Copy link

@cmarkello cmarkello commented Jul 27, 2016

Adding in support for pipes in docker_call function of 'programs.py' module. This is needed to accommodate more streamlined and efficient runs of bash piped commands that use docker containers.
This is done in one of two ways:

  • If the user provides a list of strings in the 'tool' argument each defining a docker container that is used in each section of the pipe command, then the function will generate the appropriate command that streams docker output through the subsequent steps of the pipe.
  • If the user provides a single string in the 'tool' argument and the command contains a pipe character, then the function will run the entire piped command within a single docker container.

This change is backwards compatible with old usage of docker_call.

@cmarkello
Copy link
Author

@hannes-ucsc — Ready for review

@hannes-ucsc
Copy link
Contributor

Can you describe a use case for the pipe-in-single-container mode? It seems to me that the pipe-of-containers mode is more useful. If we can eliminate the former mode, I might have a suggestion on how to assemble the pipe without passing shell=True.

Note that @jvivian, @arkal and @jpdna had a very difficult time piping out of a container. They had to disable a particular logging configuration in Docker to make it work.

@cmarkello
Copy link
Author

I believe, and I consulted with @jvivian that running a pipe in a container would be more efficient if the pipe only required that one containers set of binaries to run. For instance a lot of my vg pipeline has lots of steps with piped commands that contain only vg calls, it is simply faster to run a pipe command through a vg container than it is to run one part of the pipe through the vg container, pipe the output into the input of another vg container run, and so on and so forth.

Both options are supported by my addition to docker_call regardless, you'd just need to supply a list of docker-container names into the 'tool' parameter to use the pipe-of-containers functionality which is a little redundant if it's just the same docker-container.

I don't know the exact context that @jvivian, @arkal, or @jpdna faced when they had trouble piping out of a container. I believe I've resolved that here as I've tested the functionality on my vg pipeline and it works.

@hannes-ucsc
Copy link
Contributor

Fair enough. We need tests for both modes. The tests should pipe considerable amounts data (tens of GiB) since that is where we saw the issues in the pipe-of-containers mode.

@cmarkello
Copy link
Author

Does the machine that runs the unit tests support tens of GBs of memory? I'm guessing my unit test would need to generate a file of that size during run-time.

@hannes-ucsc
Copy link
Contributor

Does the machine that runs the unit tests support tens of GBs of memory?

No, but the very point of pipes is that their memory overhead is O(1) not O(n), with n being the number of bytes going through them. As we'll see there is a problem with pipes coming out of a docker container, but we'll cross that bridge when we get there.

You also don't need to pre-generate a file since that, again, contradicts the point of pipes. You can read infinite amounts of random data from /dev/urandom. Read up on that and dd, tee and md5sum or md5. You will need to prove in your tests that the data going into a pipe is the same as the data coming out of it, especially if the pipe crosses a container boundary.

Here's an idea:

dd if=/dev/urandom bs=1m count=100 | tee >(md5 1>&2) | gzip | gunzip | md5 1>&2

@cmarkello cmarkello force-pushed the docker_call_pipes branch from 2026a1c to bb7ab62 Compare July 28, 2016 22:48
@cmarkello
Copy link
Author

@hannes-ucsc — Ready for review


Running 'pipe-of-containers' mode for command
'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? A four element array?

It would also be helpful if there was less duplication so the difference stands out more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The - isn't necessary for gzip and gunzip, I think. It also is inconsistent with the example above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The - is necessary in order to have the piping between docker containers
work. The 4 array list is correct; one tool for each sub command in the
pipe. The main use case for it would be to run the pipe within that
container, busy just as an example it shows how the pipe between containers
works.

On Jul 28, 2016 6:38 PM, "Hannes Schmidt" [email protected] wrote:

In src/toil_scripts/lib/programs.py
#387 (comment)
:

 :param bool mock: Whether to run in mock mode. If this variable is unset, its value will be determined by
                   the environment variable.
  • Piping docker command examples:
  •    Running 'pipe-in-single-container' mode for command
    
  •    'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
    
  •        command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']
    
  •        work_dir=curr_work_dir
    
  •        docker_tools=['ubuntu']
    
  •        stdout = docker_call(work_dir=docker_work_dir, parameters=command, tool=docker_tools, check_output=True)
    
  •    Running 'pipe-of-containers' mode for command
    
  •    'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
    
  •        command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']
    

The - isn't necessary for gzip and gunzip, I think. It also is inconsistent
with the example above.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/BD2KGenomics/toil-scripts/pull/387/files/bb7ab62af239974049ec09a2f835cf348411efc7#r72731513,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJpTjFjNY3CYDKlRwgB8FqMs4Z8yWYV7ks5qaVmfgaJpZM4JVuTF
.

@hannes-ucsc
Copy link
Contributor

I added and then deleted comments so disregard the emails you may have gotten and instead look at GitHub directly.

Instead of overloading tool as typed list|str we should add a tools= kwarg and then do this:

if bool(tools) == bool(tool):
    raise …
if not tools:
    tools = [ tool ]

It's important that we get the function signature right because it will be hard to change later.

@cmarkello cmarkello force-pushed the docker_call_pipes branch 2 times, most recently from fb79203 to 887292f Compare July 30, 2016 21:33
@cmarkello
Copy link
Author

@hannes-ucsc — Ready for review

@cmarkello
Copy link
Author

@fnothaft -- ready for review

@@ -15,7 +15,8 @@ def mock_mode():
return True if int(os.environ.get('TOIL_SCRIPTS_MOCK_MODE', '0')) else False


def docker_call(tool,
def docker_call(tool='',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should never use mutable types as default value for keyword arguments.

It's also unnecessary, tool and tools should default to None and since bool(None) is False so you should be good for the tests below.

_log.debug("Calling docker with %s." % " ".join(base_docker_call + [tool] + parameters))

docker_call = base_docker_call + [tool] + parameters
if bool(tools) == bool(tool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use require(). Search the code base or ask @jvivian about it.

try:
if outfile:
subprocess.check_call(docker_call, stdout=outfile)
subprocess.check_call(docker_call, stdout=outfile, shell=shell_flag)
elif check_output and return_stderr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be rewritten with nested ifs to only examine each flag once.

@hannes-ucsc
Copy link
Contributor

Still -1

return_stderr=True should have the following meaning: The return value is a string that includes the stderr from all the programs in the pipe given via parameters=.

return_stdout=True (currently check_output) should have the following semantics: The return value is a string that includes the stdout of the last program in the pipe given via tools=.

I thought we can make tool='foo' 100% equivalent to tools=['foo'] but I don't think that's the case because the former should cause the container entry point to be overridden. You tried to follow my misguided proposal but that just made it worse, since you had to introduce another flag. Don't map tool onto tools[ tool ] and eliminate that extra flag.

Whatever we do, the signature is going to be ugly which is why I hesitate to merge this PR, even after the other issues are resolved.

There is also the issue that ' '.join()-ing the commands naively will prevent this function from being able to support command arguments with spaces in them.

@fnothaft is the merge master now. He will have to make that decision.

@cmarkello
Copy link
Author

@fnothaft -- ready for review

@fnothaft
Copy link
Contributor

fnothaft commented Aug 3, 2016

I'm in -1 agreement with @hannes-ucsc, for several reasons, primarily:

Whatever we do, the signature is going to be ugly which is why I hesitate to merge this PR, even after the other issues are resolved.

There is also the issue that ' '.join()-ing the commands naively will prevent this function from being able to support command arguments with spaces in them.

RE:

I believe, and I consulted with @jvivian that running a pipe in a container would be more efficient if the pipe only required that one containers set of binaries to run. For instance a lot of my vg pipeline has lots of steps with piped commands that contain only vg calls, it is simply faster to run a pipe command through a vg container than it is to run one part of the pipe through the vg container, pipe the output into the input of another vg container run, and so on and so forth.

This is true, but I don't think that overloading docker_run is the correct approach. Honestly, I would rather:

  • Write the command with pipes that I am running to a temp file (as a bash script, let's say)
  • Mount that file inside the container
  • And then override the entrypoint and run the command via /bin/bash <path_to_file>

This proposed approach is just clunky.

@cmarkello
Copy link
Author

The problem with running commands contained within files that are passed to a mounted directory within a single container is that you can't run a pipeline command that uses multiple tools that are contained in different Docker containers. A way to work around that would be to create a new Docker container that contains all of the necessary tools, but that wouldn't be granular or practical.

@fnothaft
Copy link
Contributor

fnothaft commented Aug 5, 2016

The problem with running commands contained within files that are passed to a mounted directory within a single container is that you can't run a pipeline command that uses multiple tools that are contained in different Docker containers.

Do we currently have a use case for this? I thought the use case we currently had was running multiple vg commands through a pipe. I agree with your point but I prefer to not solve problems before they exist.

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented Aug 30, 2016

Charlie, right now, you need just one of the two modes, the one where the entire pipeline runs inside a single container, correct? Do you want to take a stab at implementing just that and see if if it is less invasive?

In the long run we need to clean up docker_call. Maybe we can rewrite it in a way that allows for composing multiple calls to docker_call into a pipeline rather than overloading the signature to accommodate the three modes (no pipe, pipe-inside-container, pipe-between-containers). I think this can be done by making docker_call a factory for subprocess.Popen objects.

@cmarkello
Copy link
Author

@hannes-ucsc Yeah I can take a stab at it. It shouldn't be hard since I'm basically just removing the pipe-of-containers functionality.
The one thing though that is nice about the pipe-of-containers is that I don't have to inefficiently create intermediate files in one of my pipe commands which uses multiple docker containers. I do have one use-case in my vg toil-scripts pipeline which needs that functionality for convenience.

@jvivian
Copy link
Collaborator

jvivian commented Aug 30, 2016

Any PRs targeting common lib, like this one, are going to be complicated by DataBiosphere/toil#922

@cmarkello
Copy link
Author

@jvivian Should future PRs like this one be moved to toil-lib instead?

@jvivian
Copy link
Collaborator

jvivian commented Aug 30, 2016

Ya, soon common lib won't exist in toil-scripts and instead will list toil-lib as a dependency. That should happen soon, I just have a couple things blocking me that I'm trying to get resolved.

@hannes-ucsc
Copy link
Contributor

As a hobby, I've been dabbling with a composite approach of constructing a docker client invocation, also using the Builder pattern. Not quite happy with it yet, though.

Adding in docker pipes unit test

Added docs to docker_call and fixed docker_call unit test

Changed datasize for docker_call pipes unit test to 1GB

Polished docker_call functionality

Added stderr to file handle support in docker_call

Factored out pipe_of_containers in docker_call
@cmarkello
Copy link
Author

@hannes-ucsc Ready for review.

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented Sep 2, 2016

I pushed a preliminary version of my efforts towards an improved Docker wrapper to the new toil-lib project. Take a look at the doctests for pipe_to() as an example of how to pipe containers together without the use of a shell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants