-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Feature]: Experiment with using EC2 types with attached storage for better performance #339
Comments
Wouldn’t this be fixed if we have a DAG per App Pack/Algorithm? So it’s just all pre-built and ready to go? |
I don't think so... The Docker container that encapsulates the algorithm is totally separate from the EC2 node on which it will run. |
I did some investigating and think I have a better understanding of NVMe SSD instance store volumes for EC2. This link contains some helpful general information: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html I launched a [ec2-user@ip-xx-xx-xx-xx ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme1n1 259:0 0 139.7G 0 disk
nvme0n1 259:1 0 30G 0 disk
├─nvme0n1p1 259:2 0 30G 0 part /
└─nvme0n1p128 259:3 0 1M 0 part I think this means that I can just try launching the Here is another example: https://github.com/aws/karpenter-provider-aws/blob/main/examples/v1/instance-store-ephemeral-storage.yaml It does look like this may require the NVMe devices be configured as a RAID0 but I am not sure when this is necessary. Maybe when an instance is requested that has more than 1 attached SSD? For now I am going to test running the DAG on the |
Thanks @nikki-t , let's try your last suggestion first. Once the m5ad worker is up and running, maybe you can ssh into it and verify that the "/data" partition is created onto the "nvmen1" blockstore. |
I am able to run this CWL which runs the Here are the logs from a run: [2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] List block devices
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] nvme1n1 259:0 0 139.7G 0 disk
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] nvme0n1 259:1 0 30G 0 disk
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] |-nvme0n1p1 259:2 0 30G 0 part /etc/hosts
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] | /etc/hostname
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] | /etc/resolv.conf
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] `-nvme0n1p128 259:3 0 1M 0 part
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] List mounted drives
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] Filesystem Type Size Used Avail Use% Mounted on
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] overlay overlay 30G 5.7G 25G 19% / It looks like sudo nvme list-subsys
nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:1d0f1d0fvol0715b6caa297bbed5Amazon Elastic Block Store
\
+- nvme0 pcie 0000:00:04.0 live
nvme-subsys1 - NQN=nqn.2014.08.org.nvmexpress:1d0f0000AWS22813CFDD6743CD6AAmazon EC2 NVMe Instance Storage
\
+- nvme1 pcie 0000:00:1f.0 live So I need to mount {
"key": "karpenter.k8s.aws/instance-local-nvme",
"operator": "Gt",
"values": ["99"]
}, The logs from running with this requirement are the same as above. It seems to be best practice to run a node with NVME attached storage and then mount that to the container using Kubernetes Volumes. I don't know if there is a way to run a "user-data" script on the EKS pods when they launch as it seems like this functionality is documented for nodes. Next steps might be:
|
would it make sense to add a param to the instance types (selectable) that specify if NVME is included in the type? and then we can adjust the KPO accordingly? |
I guess I imagined that when you "deploy" an app to Airflow, all the parts of it get copied "into" Airflow somehow as a local copy. Then when you execute it wouldn't need to pull the container (etc) from the public repository. This isn't very easy with the single-DAG approach we have now, but if we have a DAG per algorithm I could imagine storing all the algorithm parts (App Package) in "local" storage for Airflow to pull quickly into a new job. |
@rtapella you're right- this is a different issue though. I'll create a ticket for this- it's a large piece of work but removes external docker registries as dependencies. |
@rtapella , I created a feature here: unity-sds/unity-project-management#244 for what you're describing (and more). |
@mike-gangl - I think we will want to do something like this down the road but I need to figure out how to configure everything to use the NVMe storage. |
I ran into a hurdle with the next steps and trying to mount the NVMe SSD storage on the node. I added the mounts to the pods defined in the Airflow Helm chart and create a PV and PVC for local storage: https://kubernetes.io/docs/concepts/storage/volumes/#local. This caused Airflow to deploy but never complete the deployment, so I must have gotten the configuration wrong in some way. So, I pivoted to trying to mount the NVMe SSD storage in the pod rather than the node as I think this better fits how we want to set things up. More specifically I mounted the Modifications included:
I ran a test to see if this impacted the Pod's local storage but got the same results as above: [2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] List block devices
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] nvme1n1 259:0 0 139.7G 0 disk
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] nvme0n1 259:1 0 30G 0 disk
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] |-nvme0n1p1 259:2 0 30G 0 part /etc/hosts
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] | /etc/hostname
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] | /etc/resolv.conf I will have to dig a little deeper to see if I can figure out the Pod configuration. |
I have pushed my code up to the I don't have a completely functional example quite yet. But here are the details: Solution 1
|
The ASIPS team has identified a bottleneck in DAG execution related to downloading the Docker images to the Pod. It is possible that performance would nbe improved if we use EC2 instances with attached SSD storage.
o Try to use m5ad.xlarge as worker node. Must make sure that the Pod uses the attached storage for all I/O operations
o Also, probably less important, try to use m5ad.xlarge to host the Pods for all Airflow services ("airflow-core-components" and "celery-workers").
The text was updated successfully, but these errors were encountered: