New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added study distribution #144

Open

dogversioning wants to merge 1 commit into main from mg/sns-fanning

Contributor

dogversioning commented Dec 17, 2024

This PR adds a mechanism for shipping prepared studies from github directly to sites, by adding a new lambda and an SNS queue with cross-account subscription permissions.

github-actions bot commented Dec 17, 2024 •

edited

Loading

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
840	820	98%	90%	🟢

New Files

File	Coverage	Status
src/dashboard/post_distribute/init.py	100%	🟢
src/dashboard/post_distribute/post_distribute.py	100%	🟢
TOTAL	100%	🟢

Modified Files

No covered modified files...

updated for commit: 92c0439 by action🐍


          Added study distribution

92c0439

dogversioning force-pushed the mg/sns-fanning branch from 1fe6c8e to 92c0439 Compare

December 17, 2024 15:43

dogversioning commented

View reviewed changes

src/post_distribute.Dockerfile

Contributor Author

dogversioning Dec 17, 2024

This dockerfile is where it is so it can correctly access the contents of the shared folder. I'm hoping this will be the only one we need to use in this way.

src/post_distribute.Dockerfile

Comment on lines +9 to +12

+              # Force setup of some initial matplotlib configuration artifacts
+              RUN mkdir /tmp/matlplotlib
+              ENV MPLCONFIGDIR=/tmp/matlplotlib
+              RUN cumulus-library version

Contributor Author

dogversioning Dec 17, 2024

This was stupid enough that it almost caused me to pull matplotlib from the library and move the PSM charting to a different library. I may still do this.

Otherwise - matplotlib freaks out over not having home access and tries to find a new place to put things, and this often blows our 10 second window for initializing the lambda.

template.yaml

Comment on lines +752 to +753

		ImageConfig:
		Command: ["post_distribute.distribute_handler"]

Contributor Author

dogversioning Dec 17, 2024

This (and the contents of the meta fields below) are the bits that switch the lambda from the normal execution environment to a container-based execution.

template.yaml

+                  Type: AWS::SNS::Topic
+                  Properties:
+                    ContentBasedDeduplication: True
+                    FifoTopic: True

Contributor Author

dogversioning Dec 17, 2024

This addition does two things:

insures that a message won't be delivered twice
insures delivery order to all consumers

mikix approved these changes

View reviewed changes

pyproject.toml

@@ @@ -46,7 +47,8 @@ test = [ @@
                   "moto[s3,athena,sns] == 4.1.5",
                   "pytest",
                   "pytest-cov",
-                  "pytest-mock"
+                  "pytest-mock",
+                  "responses"

Contributor

mikix Dec 26, 2024

nit: toml can have trailing newlines, FYI

Contributor Author

dogversioning Jan 2, 2025

something something ruff but for toml

tests/dashboard/test_post_distribute.py

+,
+                      ),
+                      ("invalid_study", "https://github.com/smart-on-fhir/invalid_study", 500),
+                      ("non_cumulus_repo", "https://github.com/user/non_cumulus_repo", 500),

Contributor

mikix Dec 26, 2024

This line is trying to test that a ValueError is thrown when no smart-on-fhir string is detected, yeah? Can this test code distinguish from the case where the code does try to hit that URL but no responses mock has been set up for it?

Contributor Author

dogversioning Jan 2, 2025

probably with some work.

this would be easier if we could use something other than the generic error handler, but security has requested these endpoints be extremely vague in their error reporting until the dashboard is deployed inside AWS.

src/dashboard/post_distribute/post_distribute.py

+              import boto3
+              import requests
+              from cumulus_library import cli

Contributor

mikix Dec 26, 2024

This does elevate cli.main to "API we have to support" whereas calling the process would rely on an already-supported interface (the shell CLI).

Granted, it's a small API - but maybe worth a comment next time you're in the Library like "this is not official public API, so we don't expose it in __init__.py, but the aggregator uses it, so don't change or move this". Or switch to calling subprocess.run.

Contributor Author

dogversioning Jan 2, 2025

good point - i'll switch to the latter.

src/dashboard/post_distribute/post_distribute.py

Comment on lines +10 to +11

		# lambda performance tuning - moving these outside mean that they will not
		# contribute to the lambda init window

Contributor

mikix Dec 26, 2024

clarification: "these" is referring to initializing the boto3 SNS client and "outside" means outside a function, at the global/module level?

Contributor Author

dogversioning Jan 2, 2025

yes - let me reword/add a link to the docs

src/dashboard/post_distribute/post_distribute.py

		@@ -0,0 +1,89 @@
		import json

Contributor

mikix Dec 26, 2024

Maybe there deserves to be a file-level docstring that explains what this handler is about / why it exists / explain some design choices. I don't see that kind of "explainer for dummies / Mike" elsewhere in function docstrings or a markdown file.

LIke, I think I get it, just saying. My rough mental summary: this is a lambda that takes an indicated study, finds it in github, runs its prepare step to generate the study SQL, then sends that SQL to an endpoint on the site side. But also some words about why we even do this setup might be nice.

Contributor Author

dogversioning Jan 2, 2025

I can add this, but this is still very POC-y, so expect it to not be final

src/dashboard/post_distribute/post_distribute.py

+              def prepare_study(body: dict):
+                  write_path = pathlib.Path(f"{BASE_DIR}/prepared")
+                  write_path.mkdir(parents=True, exist_ok=True)

Contributor

mikix Dec 26, 2024

nit: might be nice if the Library did this for you.

Contributor Author

dogversioning Jan 2, 2025

yeah... won't fix now, but i added an issue for it.

src/dashboard/post_distribute/post_distribute.py

+                          "-s",
+                          f"{BASE_DIR}/studies",
+                          "--prepare",
+                          f"{BASE_DIR}/prepared",

Contributor

mikix Dec 26, 2024

nit: should use the write_path variable here and below

src/dashboard/post_distribute/post_distribute.py



		def get_study_from_github(url):
		if "smart-on-fhir" not in url:

Contributor

mikix Dec 26, 2024

nit: this is a weak check if malice is a concern - could put that string in the repo name, not the org name.

This also doesn't validate that we are hitting github.com (which might be good to do even if you aren't worried about malice but just misconfiguration/typos).

Or whether URL params exist already.

Maybe a regex check? Or parse the URL with urllib.parse and check more of the fields?

Contributor Author

dogversioning Jan 2, 2025

Yeah good point - lemme take a robustness swing at this.

src/dashboard/post_distribute/post_distribute.py

+                  if res.status_code != 200:
+                      raise ValueError(f"{url} is not a valid git repository")
+                  files = res.json()["tree"]
+                  for file in files:

Contributor

mikix Dec 26, 2024

Could you just replace most of this function with a call to "git clone {url}"? And then not have to hit different hardcoded github APIs and write files manually.

Now that we have more space via docker images, there is probably room for git?

Contributor Author

dogversioning Jan 2, 2025

We certainly can squeeze git in here now.

src/dashboard/post_distribute/post_distribute.py



		def get_study_from_github(url):
		if "smart-on-fhir" not in url:

Contributor

mikix Dec 26, 2024

Separately from the validation side of things - this is an interesting distribution mechanism. Not bad, but maybe some discussion of the tradeoffs here (and/or maybe a comment about them).

Other options brainstorming:

A pip-installable package name (pros: easier for python folks to understand, keeps us honest about having the pip package in working order, allows us to version releases, cons: more hassle for authors, harder to check ownership is ours)
Allow pointing at any git URL, not just github (pros: less logic here, more flexibility in future, can point to unofficial studies, can point to URLs with params/odd paths like versioned tags, cons: no ownership checking)
- we can still get some of these benefits if we don't require the URL to be in a strict format - i.e. if we just pass the URL to git clone, after we do ownership validation
Allow pointing at a zip file URL (which github does generate for all repos) (pros: not tied to git 🤷 which only really matters if we want ownership flexibility, can do ownership validation by looking at URL, allows for versioning)

I'm especially interested in the options/choices around versioning, because currently this is just gulping from the tip of the git repo, yeah? That maybe has some downstream effects on our dev process.

Also interested whether ownership checking was for malice reasons or just something you were doing to check for misconfiguration/typos. Cause that affects the above pro/con weighting obviously.

Contributor Author

dogversioning Jan 2, 2025

So here are the future use cases for this:

This is a second tier distribution mechanism. The primary one, parsing workflows/query builder output, is more complex. This is a fallback for 'we need something that workflow hasn't implemented yet and you/me/Andy set it up manually'.
- This is basically only something that we are allowed to do. If we had a third party repo, we could fork it and release it that way, but the contract with distributed sites is 'any studies that are distributed in this fashion are ones we have authored/approved to prevent PHI/LDS leakage'.

I can see a case for versioning here, but for simplicity, I'd probably argue for 'get the most recent tag' or 'explictly get a tag of some name'. We could do it via pip? I'm mildly agnostic to it, but I think that means more intervention from the engineering side of things when a study author wants to push one of these things out. That may be unavoidable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet