Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added study distribution #144

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Added study distribution #144

wants to merge 1 commit into from

Conversation

dogversioning
Copy link
Contributor

This PR adds a mechanism for shipping prepared studies from github directly to sites, by adding a new lambda and an SNS queue with cross-account subscription permissions.

Copy link

github-actions bot commented Dec 17, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
840 820 98% 90% 🟢

New Files

File Coverage Status
src/dashboard/post_distribute/init.py 100% 🟢
src/dashboard/post_distribute/post_distribute.py 100% 🟢
TOTAL 100% 🟢

Modified Files

No covered modified files...

updated for commit: 92c0439 by action🐍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dockerfile is where it is so it can correctly access the contents of the shared folder. I'm hoping this will be the only one we need to use in this way.

Comment on lines +9 to +12
# Force setup of some initial matplotlib configuration artifacts
RUN mkdir /tmp/matlplotlib
ENV MPLCONFIGDIR=/tmp/matlplotlib
RUN cumulus-library version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was stupid enough that it almost caused me to pull matplotlib from the library and move the PSM charting to a different library. I may still do this.

Otherwise - matplotlib freaks out over not having home access and tries to find a new place to put things, and this often blows our 10 second window for initializing the lambda.

Comment on lines +752 to +753
ImageConfig:
Command: ["post_distribute.distribute_handler"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the contents of the meta fields below) are the bits that switch the lambda from the normal execution environment to a container-based execution.

Type: AWS::SNS::Topic
Properties:
ContentBasedDeduplication: True
FifoTopic: True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition does two things:

  • insures that a message won't be delivered twice
  • insures delivery order to all consumers

@@ -46,7 +47,8 @@ test = [
"moto[s3,athena,sns] == 4.1.5",
"pytest",
"pytest-cov",
"pytest-mock"
"pytest-mock",
"responses"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: toml can have trailing newlines, FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something something ruff but for toml

200,
),
("invalid_study", "https://github.com/smart-on-fhir/invalid_study", 500),
("non_cumulus_repo", "https://github.com/user/non_cumulus_repo", 500),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is trying to test that a ValueError is thrown when no smart-on-fhir string is detected, yeah? Can this test code distinguish from the case where the code does try to hit that URL but no responses mock has been set up for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably with some work.

this would be easier if we could use something other than the generic error handler, but security has requested these endpoints be extremely vague in their error reporting until the dashboard is deployed inside AWS.


import boto3
import requests
from cumulus_library import cli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does elevate cli.main to "API we have to support" whereas calling the process would rely on an already-supported interface (the shell CLI).

Granted, it's a small API - but maybe worth a comment next time you're in the Library like "this is not official public API, so we don't expose it in __init__.py, but the aggregator uses it, so don't change or move this". Or switch to calling subprocess.run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - i'll switch to the latter.

Comment on lines +10 to +11
# lambda performance tuning - moving these outside mean that they will not
# contribute to the lambda init window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification: "these" is referring to initializing the boto3 SNS client and "outside" means outside a function, at the global/module level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - let me reword/add a link to the docs

@@ -0,0 +1,89 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there deserves to be a file-level docstring that explains what this handler is about / why it exists / explain some design choices. I don't see that kind of "explainer for dummies / Mike" elsewhere in function docstrings or a markdown file.

LIke, I think I get it, just saying. My rough mental summary: this is a lambda that takes an indicated study, finds it in github, runs its prepare step to generate the study SQL, then sends that SQL to an endpoint on the site side. But also some words about why we even do this setup might be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add this, but this is still very POC-y, so expect it to not be final


def prepare_study(body: dict):
write_path = pathlib.Path(f"{BASE_DIR}/prepared")
write_path.mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be nice if the Library did this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... won't fix now, but i added an issue for it.

"-s",
f"{BASE_DIR}/studies",
"--prepare",
f"{BASE_DIR}/prepared",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should use the write_path variable here and below



def get_study_from_github(url):
if "smart-on-fhir" not in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a weak check if malice is a concern - could put that string in the repo name, not the org name.

This also doesn't validate that we are hitting github.com (which might be good to do even if you aren't worried about malice but just misconfiguration/typos).

Or whether URL params exist already.

Maybe a regex check? Or parse the URL with urllib.parse and check more of the fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point - lemme take a robustness swing at this.

if res.status_code != 200:
raise ValueError(f"{url} is not a valid git repository")
files = res.json()["tree"]
for file in files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just replace most of this function with a call to "git clone {url}"? And then not have to hit different hardcoded github APIs and write files manually.

Now that we have more space via docker images, there is probably room for git?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly can squeeze git in here now.



def get_study_from_github(url):
if "smart-on-fhir" not in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately from the validation side of things - this is an interesting distribution mechanism. Not bad, but maybe some discussion of the tradeoffs here (and/or maybe a comment about them).

Other options brainstorming:

  • A pip-installable package name (pros: easier for python folks to understand, keeps us honest about having the pip package in working order, allows us to version releases, cons: more hassle for authors, harder to check ownership is ours)
  • Allow pointing at any git URL, not just github (pros: less logic here, more flexibility in future, can point to unofficial studies, can point to URLs with params/odd paths like versioned tags, cons: no ownership checking)
    • we can still get some of these benefits if we don't require the URL to be in a strict format - i.e. if we just pass the URL to git clone, after we do ownership validation
  • Allow pointing at a zip file URL (which github does generate for all repos) (pros: not tied to git 🤷 which only really matters if we want ownership flexibility, can do ownership validation by looking at URL, allows for versioning)

I'm especially interested in the options/choices around versioning, because currently this is just gulping from the tip of the git repo, yeah? That maybe has some downstream effects on our dev process.

Also interested whether ownership checking was for malice reasons or just something you were doing to check for misconfiguration/typos. Cause that affects the above pro/con weighting obviously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here are the future use cases for this:

  • This is a second tier distribution mechanism. The primary one, parsing workflows/query builder output, is more complex. This is a fallback for 'we need something that workflow hasn't implemented yet and you/me/Andy set it up manually'.
    • This is basically only something that we are allowed to do. If we had a third party repo, we could fork it and release it that way, but the contract with distributed sites is 'any studies that are distributed in this fashion are ones we have authored/approved to prevent PHI/LDS leakage'.

I can see a case for versioning here, but for simplicity, I'd probably argue for 'get the most recent tag' or 'explictly get a tag of some name'. We could do it via pip? I'm mildly agnostic to it, but I think that means more intervention from the engineering side of things when a study author wants to push one of these things out. That may be unavoidable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants