-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added study distribution #144
base: main
Are you sure you want to change the base?
Conversation
☂️ Python Coverage
Overall Coverage
New Files
Modified FilesNo covered modified files...
|
1fe6c8e
to
92c0439
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dockerfile is where it is so it can correctly access the contents of the shared
folder. I'm hoping this will be the only one we need to use in this way.
# Force setup of some initial matplotlib configuration artifacts | ||
RUN mkdir /tmp/matlplotlib | ||
ENV MPLCONFIGDIR=/tmp/matlplotlib | ||
RUN cumulus-library version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was stupid enough that it almost caused me to pull matplotlib from the library and move the PSM charting to a different library. I may still do this.
Otherwise - matplotlib freaks out over not having home access and tries to find a new place to put things, and this often blows our 10 second window for initializing the lambda.
ImageConfig: | ||
Command: ["post_distribute.distribute_handler"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the contents of the meta fields below) are the bits that switch the lambda from the normal execution environment to a container-based execution.
Type: AWS::SNS::Topic | ||
Properties: | ||
ContentBasedDeduplication: True | ||
FifoTopic: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition does two things:
- insures that a message won't be delivered twice
- insures delivery order to all consumers
@@ -46,7 +47,8 @@ test = [ | |||
"moto[s3,athena,sns] == 4.1.5", | |||
"pytest", | |||
"pytest-cov", | |||
"pytest-mock" | |||
"pytest-mock", | |||
"responses" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: toml can have trailing newlines, FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something something ruff but for toml
200, | ||
), | ||
("invalid_study", "https://github.com/smart-on-fhir/invalid_study", 500), | ||
("non_cumulus_repo", "https://github.com/user/non_cumulus_repo", 500), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is trying to test that a ValueError is thrown when no smart-on-fhir
string is detected, yeah? Can this test code distinguish from the case where the code does try to hit that URL but no responses
mock has been set up for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably with some work.
this would be easier if we could use something other than the generic error handler, but security has requested these endpoints be extremely vague in their error reporting until the dashboard is deployed inside AWS.
|
||
import boto3 | ||
import requests | ||
from cumulus_library import cli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does elevate cli.main
to "API we have to support" whereas calling the process would rely on an already-supported interface (the shell CLI).
Granted, it's a small API - but maybe worth a comment next time you're in the Library like "this is not official public API, so we don't expose it in __init__.py
, but the aggregator uses it, so don't change or move this". Or switch to calling subprocess.run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - i'll switch to the latter.
# lambda performance tuning - moving these outside mean that they will not | ||
# contribute to the lambda init window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification: "these" is referring to initializing the boto3 SNS client and "outside" means outside a function, at the global/module level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - let me reword/add a link to the docs
@@ -0,0 +1,89 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there deserves to be a file-level docstring that explains what this handler is about / why it exists / explain some design choices. I don't see that kind of "explainer for dummies / Mike" elsewhere in function docstrings or a markdown file.
LIke, I think I get it, just saying. My rough mental summary: this is a lambda that takes an indicated study, finds it in github, runs its prepare step to generate the study SQL, then sends that SQL to an endpoint on the site side. But also some words about why we even do this setup might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add this, but this is still very POC-y, so expect it to not be final
|
||
def prepare_study(body: dict): | ||
write_path = pathlib.Path(f"{BASE_DIR}/prepared") | ||
write_path.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be nice if the Library did this for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... won't fix now, but i added an issue for it.
"-s", | ||
f"{BASE_DIR}/studies", | ||
"--prepare", | ||
f"{BASE_DIR}/prepared", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should use the write_path
variable here and below
|
||
|
||
def get_study_from_github(url): | ||
if "smart-on-fhir" not in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a weak check if malice is a concern - could put that string in the repo name, not the org name.
This also doesn't validate that we are hitting github.com (which might be good to do even if you aren't worried about malice but just misconfiguration/typos).
Or whether URL params exist already.
Maybe a regex check? Or parse the URL with urllib.parse
and check more of the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point - lemme take a robustness swing at this.
if res.status_code != 200: | ||
raise ValueError(f"{url} is not a valid git repository") | ||
files = res.json()["tree"] | ||
for file in files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just replace most of this function with a call to "git clone {url}"
? And then not have to hit different hardcoded github APIs and write files manually.
Now that we have more space via docker images, there is probably room for git?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can squeeze git in here now.
|
||
|
||
def get_study_from_github(url): | ||
if "smart-on-fhir" not in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately from the validation side of things - this is an interesting distribution mechanism. Not bad, but maybe some discussion of the tradeoffs here (and/or maybe a comment about them).
Other options brainstorming:
- A pip-installable package name (pros: easier for python folks to understand, keeps us honest about having the pip package in working order, allows us to version releases, cons: more hassle for authors, harder to check ownership is ours)
- Allow pointing at any git URL, not just github (pros: less logic here, more flexibility in future, can point to unofficial studies, can point to URLs with params/odd paths like versioned tags, cons: no ownership checking)
- we can still get some of these benefits if we don't require the URL to be in a strict format - i.e. if we just pass the URL to git clone, after we do ownership validation
- Allow pointing at a zip file URL (which github does generate for all repos) (pros: not tied to git 🤷 which only really matters if we want ownership flexibility, can do ownership validation by looking at URL, allows for versioning)
I'm especially interested in the options/choices around versioning, because currently this is just gulping from the tip of the git repo, yeah? That maybe has some downstream effects on our dev process.
Also interested whether ownership checking was for malice reasons or just something you were doing to check for misconfiguration/typos. Cause that affects the above pro/con weighting obviously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here are the future use cases for this:
- This is a second tier distribution mechanism. The primary one, parsing workflows/query builder output, is more complex. This is a fallback for 'we need something that workflow hasn't implemented yet and you/me/Andy set it up manually'.
- This is basically only something that we are allowed to do. If we had a third party repo, we could fork it and release it that way, but the contract with distributed sites is 'any studies that are distributed in this fashion are ones we have authored/approved to prevent PHI/LDS leakage'.
I can see a case for versioning here, but for simplicity, I'd probably argue for 'get the most recent tag' or 'explictly get a tag of some name'. We could do it via pip? I'm mildly agnostic to it, but I think that means more intervention from the engineering side of things when a study author wants to push one of these things out. That may be unavoidable.
This PR adds a mechanism for shipping prepared studies from github directly to sites, by adding a new lambda and an SNS queue with cross-account subscription permissions.