Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ability to run tasks dataproc. #948

Merged
merged 32 commits into from
Jan 6, 2025

Conversation

bpblanken
Copy link
Collaborator

@bpblanken bpblanken commented Nov 1, 2024

Note that it is not actually enabled and the feature flag isn't created here.

@bpblanken bpblanken changed the base branch from dev to benb/create_dataproc_cluster_task November 1, 2024 15:57
Base automatically changed from benb/create_dataproc_cluster_task to dev November 6, 2024 06:08
@bpblanken bpblanken changed the title Benb/run pipeline on dataproc task feat: Add ability to run tasks dataproc. Jan 2, 2025
@bpblanken bpblanken marked this pull request as ready for review January 2, 2025 17:59
@bpblanken bpblanken requested a review from a team as a code owner January 2, 2025 17:59
)
except google.api_core.exceptions.NotFound:
return False
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this else needed here? won't the following lines be executed anyway if no exception is thrown in the try block?

Copy link
Collaborator Author

@bpblanken bpblanken Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I was getting a lint error if I had the return within the try. I don't think the rule properly accounts for the return in the expect block. I can lift them out of the "else" no problem though.

request={
'project_id': Env.GCLOUD_PROJECT,
'region': Env.GCLOUD_REGION,
'job_id': f'{self.task_name}-{self.run_id}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but you could make job_id an instance attribute self.job_id = f'{self.task_name}-{self.run_id}' so that you do it in just 1 spot instead of 4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -166,11 +175,13 @@ def run(self):
'cluster': get_cluster_config(self.reference_genome, self.run_id),
},
)
while True:
wait_s = 0
while wait_s < TIMEOUT_S:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also want this waiting behavior in BaseRunJobOnDataprocTask?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it out because I couldn't think of a good timeout... there's going to be wide variability depending on who's using it and how much compute they as for. Setting an extreme timout (like 48 hours) might be better though.

@bpblanken bpblanken merged commit 25db277 into dev Jan 6, 2025
3 checks passed
@bpblanken bpblanken deleted the benb/run_pipeline_on_dataproc_task branch January 6, 2025 18:57
bpblanken added a commit that referenced this pull request Jan 7, 2025
* Add service account credentialing (#997)

* Add service account credentialing

* ruff

* feat: Handle parsing empty predicted sex into Unknown (#1000)

* Add helper functions for querying `Terra Data Repository` (#998)

* Add service account credentialing

* ruff

* First pass

* tests passing

* add coverage of bigquery test

* change function names

* use generators everywhere

* bq requirement

* resolver

* Update sample id name

* Build Sex Check Table from TDR Metrics (#999)

* refactor: Move feature flags to FeatureFlag enum. (#1002)

* refactor: Move feature flags out of environment to their own dataclass

* lint: ruff

* ruff

* bugfix: exclude samples from relationship checking that are not present in the expected loadable samples (#1003)

* bugfix: exclude samples from relationship checking that are not present in the expected loadable samples

* cleanup

* feat: add remap and family loading failures as validation exceptions … (#1005)

* feat: add remap and family loading failures as validation exceptions rather than runtime errors

* move on

* Update write_remapped_and_subsetted_callset_test.py

* ruff

* feat: Add ability to run tasks dataproc. (#948)

* Support gcs dirs in rsync

* ws

* Add create dataproc cluster task

* add dataproc

* ruff

* requirements

* still struggling

* Gencode refactor to remove gcs

* bump reqs

* Run dataproc job

* lib

* running

* merge requirements

* Flip'em

* Better exception handling

* Cleaner approach if less generalizable

* write a test

* Fix tests

* lint

* Add test for success

* refactor to use a base class... better for adding support for multiple jobs

* cleanup

* ruff

* Fix missing mock

* Fix flapping test

* pr comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants