-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New feature]: Concurrent Execution of N SBG end-to-end workflows #64
Comments
@LucaCinquini ping for status |
This is still under development. If I don't get a resolution before going on vacation, I asked @drewm-jpl to prioritize this after the OGC impementation is done. |
Autoscaling of nodes with Karpenter seems to be working fine. We have demonstrated this by using a "busybox" Docker image and running concurrently on 10 nodes. There is an "anti-affinity" constraint on the pods, which means that the pods cannot run on the same node: so every time a task is started, a new EC2 node is provisioned (this is for demonstration purposed only, it's not a constraint that we necessarily need to use in production). |
Scaling of the SBG e2e workflow does not seem to work for now. When a single workflow is executed standalone, it succeeds (see the first execution in the diagram). But when 10 are executed concurrently (the last 10), tasks start failing for multiple reasons. These seem to include provisioning errors and errors interacting with the Data Services: o Insufficient ephemeral-storage |
Also demonstrated scalability with "cwl_dag_new": a DAG that execute a generic CWL (which means it will instantiate a Pod that executes "cwl-runner" including submitting the CWL task to the local Docker engine). 10 jobs run concurrently successfully on 10 separate nodes - see the first 10 DAGs in this diagram. |
Experimenting with using a new Karpenter "high workload" NodePool, backed up by a new EC2NodeClass that has 200GB of attached EBS disk. k get nodes |
This is the relevant log part from one of them. I wonder if there is a problem of concurrency when updating a DS collection, at least in the case where all jobs try to update the same collection at once? [2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] {"type": "FeatureCollection", "features": [{"type": "Feature", "stac_version": "1.0.0", "id": "urn:nasa:unity:unity:dev:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20240103T131936_001", "properties": {"datetime": "2024-01-03T13:19:36Z", "start_datetime": "2024-01-03T13:19:36Z", "end_datetime": "2024-01-03T13:19:48Z", "created": "2024-07-15T11:19:50.342708+00:00", "updated": "2024-07-15T11:19:50.343427Z"}, "geometry": null, "links": [{"rel": "root", "href": "./catalog.json", "type": "application/json"}, {"rel": "parent", "href": "./catalog.json", "type": "application/json"}], "assets": {"SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.json", "title": "text/json file", "description": "", "roles": ["metadata"]}}, "stac_extensions": [], "collection": "urn:nasa:unity:unity:dev:SBG-L1B_PRE___1"}]} |
@ngachung see comment above from @LucaCinquini |
On a call with Nga and William, the following ideas were suggested: It is possible that the Python module gets stuck while waiting for the processes to finish. |
These are the CWL arguments used when testing: DEFAULT_CWL_WORKFLOW = "https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.cwl" |
ok, to start, the previous versions of the stage_out CWL did not set a hard cap on the amount of parallelization: here we can see we're doing We may need to set up debugging which will destroy your logs (very, very verbose) to see what's going on there. |
Let me 1. update the packages with newest U-DS versions and 2. create a version that has the debug logging on- maybe we can see if there are errors being encountered. |
Single job finishes under this condition: o Specifically request c5.9xlarge, disk=150Gi, no other requests for container resources o Container resources: CPU=32, memory=64Gi, disk=150Gi --> Karpenter selects node c5.12xlarge: |
I still need to update the workflow for isofit to maximize CPU usage. |
Indeed - I was going to ask you about it... And perhaps provide some different sets of parameters that we can use to run multiple jobs at the same time. |
@mike-gangl : great analysis! It seems to me we could try 2 things right away: |
|
Closing this ticket since investigation will now be tracked by ticket #216 |
Description: Tune the Airflow and DAG parameters so that a large number N or SBG end-to-end workflows can be executed successfully.
Dependency: from SE: a reliable estimate of the memory and CPU needed by each step of the SBG workflow
Acceptance Criteria:
o Demonstrated successful execution of 10 SBG e2e workflows that are submitted at the same time.
The text was updated successfully, but these errors were encountered: