[New feature]: Concurrent Execution of N SBG end-to-end workflows #64

LucaCinquini · 2024-04-10T14:28:00Z

Description: Tune the Airflow and DAG parameters so that a large number N or SBG end-to-end workflows can be executed successfully.

Dependency: from SE: a reliable estimate of the memory and CPU needed by each step of the SBG workflow

Acceptance Criteria:
o Demonstrated successful execution of 10 SBG e2e workflows that are submitted at the same time.

GodwinShen · 2024-05-14T15:45:22Z

@LucaCinquini ping for status

LucaCinquini · 2024-05-14T15:47:02Z

This is still under development. If I don't get a resolution before going on vacation, I asked @drewm-jpl to prioritize this after the OGC impementation is done.

LucaCinquini · 2024-06-07T20:32:18Z

Autoscaling of nodes with Karpenter seems to be working fine. We have demonstrated this by using a "busybox" Docker image and running concurrently on 10 nodes. There is an "anti-affinity" constraint on the pods, which means that the pods cannot run on the same node: so every time a task is started, a new EC2 node is provisioned (this is for demonstration purposed only, it's not a constraint that we necessarily need to use in production).
All DAG runs succeed, see picture.

LucaCinquini · 2024-06-11T09:44:28Z

Scaling of the SBG e2e workflow does not seem to work for now. When a single workflow is executed standalone, it succeeds (see the first execution in the diagram). But when 10 are executed concurrently (the last 10), tasks start failing for multiple reasons. These seem to include provisioning errors and errors interacting with the Data Services:

o Insufficient ephemeral-storage
o error downloading granule: urn:nasa:unity:unity:dev:SBG-L2A_RSRFL___1:SISTER_EMIT_L2A_RSRFL_20240103T131936_001 [2024-06-10, 15:58:41 UTC] {pod_manager.py:468} INFO - [base] Traceback (most recent call last):
o [ERROR] [cumulus_lambda_functions.stage_in_out.download_granules_abstract::63] error downloading granule: urn:nasa:unity:unity:dev:SBG-L2A_RSRFL___1:SISTER_EMIT_L2A_RSRFL_20240103T131936_001

LucaCinquini · 2024-06-13T16:46:07Z

Also demonstrated scalability with "cwl_dag_new": a DAG that execute a generic CWL (which means it will instantiate a Pod that executes "cwl-runner" including submitting the CWL task to the local Docker engine). 10 jobs run concurrently successfully on 10 separate nodes - see the first 10 DAGs in this diagram.

LucaCinquini · 2024-06-13T22:46:04Z

The same CWL DAG was used to execute 10 SBG PreProcess workflows successfully:

LucaCinquini · 2024-07-14T08:34:39Z

Experimenting with using a new Karpenter "high workload" NodePool, backed up by a new EC2NodeClass that has 200GB of attached EBS disk.
Submitting 10 concurrent SBG end-to-end workflows. Confirmed that 10 additional nodes are provisioned:

k get nodes
NAME STATUS ROLES AGE VERSION
ip-10-6-33-215.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-36-39.us-west-2.compute.internal Ready 20m v1.29.3-eks-ae9a62a
ip-10-6-44-195.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-46-31.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-47-154.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-47-235.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-48-222.us-west-2.compute.internal Ready 3d12h v1.29.3-eks-ae9a62a
ip-10-6-50-121.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-50-13.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-57-181.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-61-119.us-west-2.compute.internal Ready 19m v1.29.3-eks-ae9a62a
ip-10-6-61-70.us-west-2.compute.internal Ready 2d11h v1.29.3-eks-ae9a62a
ip-10-6-62-197.us-west-2.compute.internal Ready 2d11h v1.29.3-eks-ae9a62a

LucaCinquini · 2024-07-15T13:33:34Z

It turns out that 3 concurrent SBG e2e jobs finish correctly, but 5 do not... they seem to all get stuck in the stage out step, and they already took more than double the normal time:

LucaCinquini · 2024-07-15T13:34:38Z

This is the relevant log part from one of them. I wonder if there is a problem of concurrency when updating a DS collection, at least in the case where all jobs try to update the same collection at once?

[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] {"type": "FeatureCollection", "features": [{"type": "Feature", "stac_version": "1.0.0", "id": "urn:nasa:unity:unity:dev:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20240103T131936_001", "properties": {"datetime": "2024-01-03T13:19:36Z", "start_datetime": "2024-01-03T13:19:36Z", "end_datetime": "2024-01-03T13:19:48Z", "created": "2024-07-15T11:19:50.342708+00:00", "updated": "2024-07-15T11:19:50.343427Z"}, "geometry": null, "links": [{"rel": "root", "href": "./catalog.json", "type": "application/json"}, {"rel": "parent", "href": "./catalog.json", "type": "application/json"}], "assets": {"SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_OBS.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.bin": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001_LOC.bin", "title": "binary file", "description": "", "roles": ["data"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.hdr": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.hdr", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.met.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.met.json", "title": "None file", "description": "", "roles": ["metadata"]}, "SISTER_EMIT_L1B_RDN_20240103T131936_001.json": {"href": "./SISTER_EMIT_L1B_RDN_20240103T131936_001.json", "title": "text/json file", "description": "", "roles": ["metadata"]}}, "stac_extensions": [], "collection": "urn:nasa:unity:unity:dev:SBG-L1B_PRE___1"}]}
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [job stage_in_2] Max memory used: 140MiB
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [job stage_in_2] completed success
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [step stage_in_2] completed success
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [workflow isofit] starting step stage_aux_in
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [step stage_aux_in] start
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] INFO [job stage_aux_in] /scratch/raolabhb$ docker
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] run
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] -i
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --mount=type=bind,source=/scratch/raolabhb,target=/tcLmCg
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --mount=type=bind,source=/tmp/crmj0z1j,target=/tmp
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --workdir=/tcLmCg
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --user=0:0
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --rm
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --cidfile=/tmp/_82b2ivr/20240715125129-169555.cid
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=TMPDIR=/tmp
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=HOME=/tcLmCg
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=CLIENT_ID=40c2s0ulbhp9i0fmaph3su9jch
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=COGNITO_URL=https://cognito-idp.us-west-2.amazonaws.com/
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] '--env=DOWNLOADING_KEYS=data, metadata'
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] '--env=DOWNLOADING_ROLES=data, metadata'
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=DOWNLOAD_DIR=/tcLmCg
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=DOWNLOAD_RETRY_TIMES=5
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=DOWNLOAD_RETRY_WAIT_TIME=30
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=EDL_BASE_URL=https://urs.earthdata.nasa.gov/
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=EDL_PASSWORD=/sps/processing/workflows/edl_password
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=EDL_PASSWORD_TYPE=PARAM_STORE
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=EDL_USERNAME=/sps/processing/workflows/edl_username
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=GRANULES_DOWNLOAD_TYPE=S3
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=LOG_LEVEL=20
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=OUTPUT_FILE=/tcLmCg/stage-in-results.json
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=PARALLEL_COUNT=-1
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=PASSWORD=/sps/processing/workflows/unity_password
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=PASSWORD_TYPE=PARAM_STORE
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=STAC_AUTH_TYPE=NONE
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=USERNAME=/sps/processing/workflows/unity_username
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] --env=VERIFY_SSL=TRUE
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] '--env=STAC_JSON={"numberMatched":{"total_size":1},"numberReturned":1,"stac_version":"1.0.0","type":"FeatureCollection","links":{"rel":"self","href":"[https://d3vc8w9zcq658.cloudfront.net/am-uds-dapa/collections/urn:nasa:unity:unity:dev:SBG-L1B_PRE___1/items?limit=10"},{"rel":"root","href":"https://d3vc8w9zcq658.cloudfront.net"}],"features":[{"type":"Feature","stac_version":"1.0.0","id":"urn:nasa:unity:unity:dev:SBG-AUX___1:sRTMnet_v120","properties":{"datetime":"2024-02-14T22:04:41.078000Z","start_datetime":"2024-01-03T13:19:36Z","end_datetime":"2024-01-03T13:19:48Z","created":"2024-01-03T13:19:36Z","updated":"2024-02-14T22:05:25.248000Z","status":"completed","provider":"unity"},"geometry":{"type":"Point","coordinates":[0,0]},"links":[{"rel":"collection","href":"."}],"assets":{"sRTMnet_v120.h5":{"href":"s3://sps-dev-ds-storage/urn:nasa:unity:unity:dev:SBG-AUX___1/urn:nasa:unity:unity:dev:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5","title":"sRTMnet_v120.h5","description":"size=-1;checksumType=md5;checksum=unknown;","roles":["data"]},"sRTMnet_v120_aux.npz":{"href":"s3://sps-dev-ds-storage/urn:nasa:unity:unity:dev:SBG-AUX___1/urn:nasa:unity:unity:dev:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz","title":"sRTMnet_v120_aux.npz","description":"size=-1;checksumType=md5;checksum=unknown;","roles":["data"]}},"bbox":[-180,-90,180,90],"stac_extensions":[],"collection":"urn:nasa:unity:unity:dev:SBG-AUX___1"}]}'
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] ghcr.io/unity-sds/unity-data-services:6.4.3
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] DOWNLOAD
[2024-07-15, 12:51:29 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:29,444 [INFO] [root::19] starting DOWNLOAD script
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:29,913 [INFO] [botocore.credentials::1075] Found credentials from IAM Role: unity-dev-sps-eks-luca-7-eks-node-role
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,049 [INFO] [botocore.credentials::1075] Found credentials from IAM Role: unity-dev-sps-eks-luca-7-eks-node-role
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,128 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::151] multithread processing starting with process_count: 36
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,130 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 18, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,131 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 19, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,133 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 22, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,134 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 25, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,135 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 28, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,137 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 31, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,138 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 34, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,139 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 37, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,140 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 40, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,142 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 43, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,143 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 46, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,144 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 49, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,146 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 52, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,147 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 55, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,148 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 58, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,150 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 61, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,151 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 64, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,152 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 67, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,154 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 70, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,155 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 73, exit_code: None
[2024-07-15, 12:51:30 UTC] {pod_manager.py:468} INFO - [base] 2024-07-15 12:51:30,156 [INFO] [cumulus_lambda_functions.lib.processing_jobs.multithread_processor::158] starting consumer pid: 76, exit_code: None

GodwinShen · 2024-07-16T18:26:06Z

@ngachung see comment above from @LucaCinquini

LucaCinquini · 2024-07-16T21:33:53Z

On a call with Nga and William, the following ideas were suggested:
o Luca to try to resubmit 5 jobs but stagger them by 5 minutes, to see if the problem is about writing 5 identical outputs to S3
o @mike-gangl to upgrade the SBG e2e workflow to use the latest version of the Data Services container ghcr.io/unity-sds/unity-data-services:7.10.1. Also possibly increase the PARALLEL_COUNT parameter to 3 or higher

It is possible that the Python module gets stuck while waiting for the processes to finish.

LucaCinquini · 2024-07-23T21:36:02Z

These are the CWL arguments used when testing:

DEFAULT_CWL_WORKFLOW = "https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.cwl"
DEFAULT_CWL_ARGUMENTS = "https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.dev.yml"

mike-gangl · 2024-07-23T21:57:44Z

ok, to start, the previous versions of the stage_out CWL did not set a hard cap on the amount of parallelization:

http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-preprocess/versions/21/PLAIN-CWL/descriptor/%2Fstage_out.cwl

here we can see we're doing PARALLEL_COUNT: '-1' - this means it will use double the number of CPUs available on the system. If these were all running on a single node, i'd say that would be a problem, but if they are all running on separate machines, that's actually ideal.

We may need to set up debugging which will destroy your logs (very, very verbose) to see what's going on there.

mike-gangl · 2024-07-23T22:03:12Z

Let me 1. update the packages with newest U-DS versions and 2. create a version that has the debug logging on- maybe we can see if there are errors being encountered.

LucaCinquini · 2024-07-29T14:58:25Z

Single job finishes under this condition:

o Specifically request c5.9xlarge, disk=150Gi, no other requests for container resources
(~47 minutes)

o Container resources: CPU=32, memory=64Gi, disk=150Gi --> Karpenter selects node c5.12xlarge:
kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller
{"level":"INFO","time":"2024-07-29T14:10:24.987Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"airflow-kubernetes-pod-operator-high-workload","nodeclaim":"airflow-kubernetes-pod-operator-high-workload-dvbgl","requests":{"cpu":"32210m","ephemeral-storage":"150Gi","memory":"65776Mi","pods":"6"},"instance-types":"c5.12xlarge, c6i.12xlarge, c7i.12xlarge"}
(~45 minutes)

LucaCinquini · 2024-07-29T17:22:56Z

3 concurrent SBG e2e workflows completed, but took and increasingly longer time:

LucaCinquini · 2024-08-14T21:51:01Z

Using latest SPS from "devel" branch, executed successfully 10 concurrent "sbg-preprocess" DAGs, see picture.
Using EC2=r7i.2xlarge with 100Gi disk, each pod on a separate node.

mike-gangl · 2024-08-15T16:05:39Z

I still need to update the workflow for isofit to maximize CPU usage.

LucaCinquini · 2024-08-15T16:07:08Z

Indeed - I was going to ask you about it... And perhaps provide some different sets of parameters that we can use to run multiple jobs at the same time.

mike-gangl · 2024-09-16T17:57:33Z

so the update for CPU usage was done and is available in the workflows. The next issue i've discovered is the reliance on external sources for data/images. this isn't a "problem" per se, we should be able to get data or images or workflows from any location.

The problem is that all data ingress for instances running in the private subnets need to read internet data through the NAT Instance or Gateway. We are currently setup to use NAT instances (voa MCP). This will change in the future to NAT Gateways which will shift the problem from a performance one to a financial one.

Currently, with a NAT Instance:

all network traffic must flow through the NAT for a private subnets internet read.
This means docker images on github or dockerhub must go through the NAT.
the NAT is currently set up as a tg4.micro instance

that means it gets, essentially a max 5 gbps throughput for ALL instances in the subnets. not so bad if we have a single instance. slower and slower as we add more instances.
This means when we have 10 instances trying to pull the stage-in/out docker containers or any of the processes, this instance gets hammered and causes a lot of residual errors. See below for network usage when running a 10 pod run.

Future NAT Gateways

NAT Gateways scale to me the demand... at a price. you'll pay $0.045 per hour for the NAT Gateway and then $0.045/GB processed. so the more isntances pulling data through the NAT, the greater that charge will be. not bad for a few isntances, but when you have thousands of instances pulling data though, that will be costly.

New Work

Create s3 gateways in the VPC to provide non-NAT access to AWS S3 resources.
Use ECR as the image repository for most things. We can/should keep a version in github and dockerhub, but we should have a shared ECR repository (or other internal repo) that we use to bypass the NAT for data, docker
app-pack-gen should publish docker containers to this ECR/shared repo along with/instead of dockerhub?
Cache the docker images instead of re-pulling them. This won't prevent a thundering herd issue, but would be nice if we re-use the same instance for processing. Though the cache might be at the pod, not node level, so we'd have to work on that.

LucaCinquini · 2024-09-16T19:30:28Z

@mike-gangl : great analysis! It seems to me we could try 2 things right away:
o Increase the EC2 type of the NAT instance (in the Dev environment), for example
o Create a version of the SBG E2E workflow which pulls the images from ECR
WDYT?

mike-gangl · 2024-09-16T20:12:10Z

We have to request an increase from MCP- they own that instance. But might be worthwhile.
Definitely we should create a version of the SBG E2E flow (or a subset of it) and run that with ECR images. Should we do that in the unity-venue-test environment?

LucaCinquini · 2024-10-10T11:19:22Z

Closing this ticket since investigation will now be tracked by ticket #216

LucaCinquini added this to Unity Project Board Apr 10, 2024

LucaCinquini added the U-SPS label Apr 10, 2024

LucaCinquini self-assigned this Apr 10, 2024

LucaCinquini moved this to In Progress in Unity Project Board Apr 10, 2024

LucaCinquini assigned drewm-swe May 14, 2024

LucaCinquini mentioned this issue Jun 7, 2024

Develop luca #95

Merged

drewm-swe mentioned this issue Jun 12, 2024

Sbg e2e concurrent executions #97

Merged

LucaCinquini mentioned this issue Jun 13, 2024

Improving the CWL DAG #107

Merged

LucaCinquini mentioned this issue Jun 25, 2024

Adding autoscaling to the CWL DAG and SBG end-to-end step-by-step DAG #114

Merged

LucaCinquini mentioned this issue Jul 18, 2024

Add new "high workload" Karpenter Node Class and Node Pool to make the SBG end-to-end workflow succeed #173

Merged

mike-gangl mentioned this issue Sep 16, 2024

Create s3 gateways in the Proj/Venue VPC to provide non-NAT access to AWS S3 resources. unity-sds/unity-project-management#209

Open

3 tasks

LucaCinquini changed the title ~~Concurrent Execution of N SBG end-to-end workflows~~ [New feature]: Concurrent Execution of N SBG end-to-end workflows Sep 25, 2024

LucaCinquini closed this as completed Oct 10, 2024

github-project-automation bot moved this from In Progress to Done in Unity Project Board Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New feature]: Concurrent Execution of N SBG end-to-end workflows #64

[New feature]: Concurrent Execution of N SBG end-to-end workflows #64

LucaCinquini commented Apr 10, 2024

GodwinShen commented May 14, 2024

LucaCinquini commented May 14, 2024

LucaCinquini commented Jun 7, 2024

LucaCinquini commented Jun 11, 2024

LucaCinquini commented Jun 13, 2024

LucaCinquini commented Jun 13, 2024

LucaCinquini commented Jul 14, 2024

LucaCinquini commented Jul 15, 2024

LucaCinquini commented Jul 15, 2024

GodwinShen commented Jul 16, 2024

LucaCinquini commented Jul 16, 2024

LucaCinquini commented Jul 23, 2024

mike-gangl commented Jul 23, 2024

mike-gangl commented Jul 23, 2024

LucaCinquini commented Jul 29, 2024 •

edited

Loading

LucaCinquini commented Jul 29, 2024

LucaCinquini commented Aug 14, 2024

mike-gangl commented Aug 15, 2024

LucaCinquini commented Aug 15, 2024

mike-gangl commented Sep 16, 2024 •

edited

Loading

LucaCinquini commented Sep 16, 2024

mike-gangl commented Sep 16, 2024

LucaCinquini commented Oct 10, 2024

[New feature]: Concurrent Execution of N SBG end-to-end workflows #64

[New feature]: Concurrent Execution of N SBG end-to-end workflows #64

Comments

LucaCinquini commented Apr 10, 2024

GodwinShen commented May 14, 2024

LucaCinquini commented May 14, 2024

LucaCinquini commented Jun 7, 2024

LucaCinquini commented Jun 11, 2024

LucaCinquini commented Jun 13, 2024

LucaCinquini commented Jun 13, 2024

LucaCinquini commented Jul 14, 2024

LucaCinquini commented Jul 15, 2024

LucaCinquini commented Jul 15, 2024

GodwinShen commented Jul 16, 2024

LucaCinquini commented Jul 16, 2024

LucaCinquini commented Jul 23, 2024

mike-gangl commented Jul 23, 2024

mike-gangl commented Jul 23, 2024

LucaCinquini commented Jul 29, 2024 • edited Loading

LucaCinquini commented Jul 29, 2024

LucaCinquini commented Aug 14, 2024

mike-gangl commented Aug 15, 2024

LucaCinquini commented Aug 15, 2024

mike-gangl commented Sep 16, 2024 • edited Loading

Currently, with a NAT Instance:

Future NAT Gateways

New Work

LucaCinquini commented Sep 16, 2024

mike-gangl commented Sep 16, 2024

LucaCinquini commented Oct 10, 2024

LucaCinquini commented Jul 29, 2024 •

edited

Loading

mike-gangl commented Sep 16, 2024 •

edited

Loading