Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Feature]: Experiment with using EC2 types with attached storage for better performance #339

Open
LucaCinquini opened this issue Feb 27, 2025 · 12 comments
Assignees
Labels
enhancement New feature or request U-SPS

Comments

@LucaCinquini
Copy link
Collaborator

The ASIPS team has identified a bottleneck in DAG execution related to downloading the Docker images to the Pod. It is possible that performance would nbe improved if we use EC2 instances with attached SSD storage.

o Try to use m5ad.xlarge as worker node. Must make sure that the Pod uses the attached storage for all I/O operations

o Also, probably less important, try to use m5ad.xlarge to host the Pods for all Airflow services ("airflow-core-components" and "celery-workers").

@rtapella
Copy link

Wouldn’t this be fixed if we have a DAG per App Pack/Algorithm? So it’s just all pre-built and ready to go?

@LucaCinquini
Copy link
Collaborator Author

I don't think so... The Docker container that encapsulates the algorithm is totally separate from the EC2 node on which it will run.

@nikki-t nikki-t moved this to In Progress in Unity Project Board Feb 28, 2025
@nikki-t
Copy link
Collaborator

nikki-t commented Feb 28, 2025

I did some investigating and think I have a better understanding of NVMe SSD instance store volumes for EC2.

This link contains some helpful general information: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html

I launched a m5ad.xlarge instance which has 1 SSD attached to it and it looks like the NVMe device is already mounted to the root volume on the EC2 instance, so there is no need to mount it:

[ec2-user@ip-xx-xx-xx-xx ~]$ lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1       259:0    0 139.7G  0 disk 
nvme0n1       259:1    0    30G  0 disk 
├─nvme0n1p1   259:2    0    30G  0 part /
└─nvme0n1p128 259:3    0     1M  0 part 

I think this means that I can just try launching the m5ad.xlarge instance and running the CWL DAG and it will take advantage of the attached SSD. But I do want to note that Karpenter has the ability to request an instance with NVMe disks. See this link: https://karpenter.sh/docs/concepts/scheduling/#advanced-scheduling-techniques

Here is another example: https://github.com/aws/karpenter-provider-aws/blob/main/examples/v1/instance-store-ephemeral-storage.yaml

It does look like this may require the NVMe devices be configured as a RAID0 but I am not sure when this is necessary. Maybe when an instance is requested that has more than 1 attached SSD?

For now I am going to test running the DAG on the m5ad.xlarge instance and see how far I get.

@LucaCinquini
Copy link
Collaborator Author

Thanks @nikki-t , let's try your last suggestion first. Once the m5ad worker is up and running, maybe you can ssh into it and verify that the "/data" partition is created onto the "nvmen1" blockstore.

@nikki-t
Copy link
Collaborator

nikki-t commented Mar 3, 2025

I am able to run this CWL which runs the lsblk and df commands to view mounts inside of the Pod. (I am not sure how to SSH into the Pod; my attempts were not successful.)

Here are the logs from a run:

[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] List block devices
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] nvme1n1       259:0    0 139.7G  0 disk
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] nvme0n1       259:1    0    30G  0 disk
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] |-nvme0n1p1   259:2    0    30G  0 part /etc/hosts
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] |                                       /etc/hostname
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] |                                       /etc/resolv.conf
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] `-nvme0n1p128 259:3    0     1M  0 part
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] List mounted drives
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] Filesystem     Type     Size  Used Avail Use% Mounted on
[2025-03-03, 20:00:07 UTC] {pod_manager.py:471} INFO - [base] overlay        overlay   30G  5.7G   25G  19% /

It looks like nvme1n1 is the SSD drive mounted to the underlying EC2 per this documentation. I did some further investigation on an manually launched EC2 instance and saw that the nvme0n1 is an EBS while nvme1n1 is the EC2 NVMe Instance Storage.

sudo nvme list-subsys
nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:1d0f1d0fvol0715b6caa297bbed5Amazon Elastic Block Store              
\
 +- nvme0 pcie 0000:00:04.0 live 
nvme-subsys1 - NQN=nqn.2014.08.org.nvmexpress:1d0f0000AWS22813CFDD6743CD6AAmazon EC2 NVMe Instance Storage        
\
 +- nvme1 pcie 0000:00:1f.0 live 

So I need to mount nvme1n1 on /data which is the working directory for the DAG entrypoint script so I tested the Karpenter NodePool Requirements:

{
	"key": "karpenter.k8s.aws/instance-local-nvme",
	"operator": "Gt",
	"values": ["99"]
},

The logs from running with this requirement are the same as above. It seems to be best practice to run a node with NVME attached storage and then mount that to the container using Kubernetes Volumes.

I don't know if there is a way to run a "user-data" script on the EKS pods when they launch as it seems like this functionality is documented for nodes.

Next steps might be:

  1. Run a node as m5ad.xlarge
  2. Mount NVME to /data
  3. Modify Airflow Terraform to define a volume and volume mount for /data
  4. Modify DAG KPO to include volume and volume mount

@mike-gangl
Copy link

would it make sense to add a param to the instance types (selectable) that specify if NVME is included in the type? and then we can adjust the KPO accordingly?

@rtapella
Copy link

rtapella commented Mar 4, 2025

I don't think so... The Docker container that encapsulates the algorithm is totally separate from the EC2 node on which it will run.

I guess I imagined that when you "deploy" an app to Airflow, all the parts of it get copied "into" Airflow somehow as a local copy. Then when you execute it wouldn't need to pull the container (etc) from the public repository. This isn't very easy with the single-DAG approach we have now, but if we have a DAG per algorithm I could imagine storing all the algorithm parts (App Package) in "local" storage for Airflow to pull quickly into a new job.

@mike-gangl
Copy link

@rtapella you're right- this is a different issue though. I'll create a ticket for this- it's a large piece of work but removes external docker registries as dependencies.

@mike-gangl
Copy link

mike-gangl commented Mar 4, 2025

@rtapella , I created a feature here: unity-sds/unity-project-management#244 for what you're describing (and more).

@nikki-t
Copy link
Collaborator

nikki-t commented Mar 6, 2025

would it make sense to add a param to the instance types (selectable) that specify if NVME is included in the type? and then we can adjust the KPO accordingly?

@mike-gangl - I think we will want to do something like this down the road but I need to figure out how to configure everything to use the NVMe storage.

@nikki-t
Copy link
Collaborator

nikki-t commented Mar 6, 2025

I ran into a hurdle with the next steps and trying to mount the NVMe SSD storage on the node. I added the mounts to the pods defined in the Airflow Helm chart and create a PV and PVC for local storage: https://kubernetes.io/docs/concepts/storage/volumes/#local. This caused Airflow to deploy but never complete the deployment, so I must have gotten the configuration wrong in some way.

So, I pivoted to trying to mount the NVMe SSD storage in the pod rather than the node as I think this better fits how we want to set things up. More specifically I mounted the /data directory to the Pod's emptyDir volume after formatting and mounting the NVMe SSD on the node following these instructions: https://medium.com/@sandeep.kadyan/ephemeral-volume-emptydir-backed-on-ec2-instance-store-nvme-4cdd500a8331

Modifications included:

  1. I updated my TFVARs for the Airflow deployment to include a requirement for karpenter.k8s.aws/instance-local-nvme for the NodePools.
  2. I updated the node config userData to mount the NVMe SSD (may want to modify this later to use RAID0).
  3. I added the karpenter.k8s.aws/instance-local-nvme label to the Pod in the Airflow Helm chart. I also added a volume that references the emptyDir and a volume mount that mounts it to /data on the Pod.

I ran a test to see if this impacted the Pod's local storage but got the same results as above:

[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] List block devices
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] nvme1n1       259:0    0 139.7G  0 disk
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] nvme0n1       259:1    0    30G  0 disk
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] |-nvme0n1p1   259:2    0    30G  0 part /etc/hosts
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] |                                       /etc/hostname
[2025-03-06, 22:03:49 UTC] {pod_manager.py:471} INFO - [base] |                                       /etc/resolv.conf

I will have to dig a little deeper to see if I can figure out the Pod configuration.

@nikki-t
Copy link
Collaborator

nikki-t commented Mar 10, 2025

I have pushed my code up to the 339-ssd-ec2-instance-type branch and it can be found here: https://github.com/unity-sds/unity-sps/tree/339-ssd-ec2-instance-type

I don't have a completely functional example quite yet. But here are the details:

Solution 1 emptyDir

Mount the NVMe instance store to the underlying node and then mount the emptyDir volume to the Pods.

I can get the underlying node to mount the NVMe SSD via this user-data script. I have confirmed that it mounts to /var/lib/kubelet/pods which is the directory used by emptyDir volume.

I thought that I could then mount the emptyDir volume to the Pod using the KubernetesPodOperator like so:

volume_mounts=[
        k8s.V1VolumeMount(name="workers-data", mount_path=WORKING_DIR)
],
volumes = [
  k8s.V1Volume(
	  name="workers-data",
	  empty_dir=k8s.V1EmptyDirVolumeSource(medium=""),
  )  
]

But the Pod is not able to locate the /data directory.

Solution 2: hostPath

So I updated to using a volume that is mounted the the node at the /data directory in the user-data script and then defined a hostPath volume using the KubernetesPodOperator like this:

volume_mounts=[
	k8s.V1VolumeMount(name="workers-data", mount_path=WORKING_DIR)
],
volumes = [
  k8s.V1Volume(
	  name="workers-data",
	  host_path=k8s.V1HostPathVolumeSource(path=WORKING_DIR, type="DirectoryOrCreate"),
  )
], 

But the Pod is also not able to locate the /data directory.

It seems like the KubernetesPodOperator is not actually creating the mounting any of the volumes.

@LucaCinquini - do you have any thoughts on how to mount the node directory to the pod executed by the KubernetesPodOperator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request U-SPS
Projects
Status: In Progress
Development

No branches or pull requests

4 participants