- Why use streamparse?
- Is streamparse compatible with Python 3?
- How can I contribute to streamparse?
- How do I trigger some code before or after I submit my topology?
- How should I install streamparse on the cluster nodes?
- Should I install Clojure?
- How do I deploy into a VPC?
- How do I override SSH settings?
- How do I dynamically generate the worker list?
To lay your Python code out in topologies which can be automatically parallelized in a Storm cluster of machines. This lets you scale your computation horizontally and avoid issues related to Python's GIL. See :ref:`parallelism`.
Yes, streamparse is fully compatible with Python 3 starting with version 3.3 which we use in our unit tests.
Please see the CONTRIBUTING document in Github
After you create a streamparse project using sparse quickstart
, you'll have
a fabfile.py
in that directory. In that file, you can specify two
functions (pre_submit
and post_submit
) which are expected to accept four arguments:
topology_name
: the name of the topology being submittedenv_name
: the name of the environment where the topology is being submitted (e.g."prod"
)env_config
: the relevant config portion from theconfig.json
file for the environment you are submitting the topology tooptions
: the fully resolved Storm options
Here is a sample fabfile.py
file that sends a message to IRC after a
topology is successfully submitted to prod.
# my_project/fabfile.py
from __future__ import absolute_import, print_function, unicode_literals
from my_project import write_to_irc
def post_submit(topo_name, env_name, env_config):
if env_name == "prod":
write_to_irc("Deployed {} to {}".format(topo_name, env_name))
streamparse assumes your Storm servers have Python, pip, and virtualenv
installed. After that, the installation of all required dependencies (including
streamparse itself) is taken care of via the config.json file for the
streamparse project and the sparse submit
command.
No, the Java requirements for streamparse are identical to that of Storm itself. Storm requires Java and bundles Clojure as a requirement, so you do not need to do any separate installation of Clojure. You just need Java on all Storm servers.
Update your ~/.ssh/config
to use a bastion host inside your VPC for your
commands:
Host *.internal.example.com ProxyCommand ssh bastion.example.com exec nc %h %p
If you don't have a common subdomain you'll have to list all of the hosts individually:
Host host1.example.com ProxyCommand ssh bastion.example.com exec nc %h %p ...
Set up your streamparse config to use all of the hosts normally (without bastion host).
It is highly recommended that you just modify your ~/.ssh/config
file if you
need to tweak settings for setting up the SSH tunnel to your Nimbus server, but
you can also set your SSH password or port in your config.json
by setting
the ssh_password
or ssh_port
environment settings.
{
"topology_specs": "topologies/",
"virtualenv_specs": "virtualenvs/",
"envs": {
"prod": {
"user": "somebody",
"ssh_password": "THIS IS A REALLY BAD IDEA",
"ssh_port": 52,
"nimbus": "streamparse-box",
"workers": [
"streamparse-box"
],
"virtualenv_root": "/data/virtualenvs"
}
}
}
In a small cluster it's sufficient to specify the list of workers in config.json
.
However, if you have a large or complex environment where workers are numerous
or short-lived, streamparse
supports querying the nimbus server for a list of hosts.
An undefined list (empty or None) of workers
will trigger the lookup.
Explicitly defined hosts are preferred over a lookup.
Lookups are configured on a per-environment basis, so the prod
environment
below uses the dynamic lookup, while beta
will not.
{
"topology_specs": "topologies/",
"virtualenv_specs": "virtualenvs/",
"envs": {
"prod": {
"nimbus": "streamparse-prod",
"virtualenv_root": "/data/virtualenvs"
},
"beta": {
"nimbus": "streamparse-beta",
"workers": [
"streamparse-beta"
],
"virtualenv_root": "/data/virtualenvs"
}
}
}