Skip to content

Commit 054d487

Browse files
authored
Add Self-Hosted AWS GPU Runner (#100)
* added self hosted GPU runner CI file * should switch to micromamba * if this doesn't work we are using micromamba * use micromamba * needed to give env a name * see if switching the cudatoolkit to 11.7 works * should be able to use nvcc from the ami * fix some version pins * Remove pins from environment.yml * set HOME * forgot how to set envars * Add some debugging * getting some weird activation problems * Remove debugging output * see if now that things are working, I can override the pins * see if this works without activating * Fix the build that doesn't use cuda * Accidently kept a GPU package in base env * keep the environment.yml in the root of the repo intact, move custom env to a folder * missed a path * Add caching to speed up env creation * revert to keep PR as small as possible * accidently checkouted wrong versions from stale fork * forgot to use the cudatoolkit from the ami instead of Jimver/cuda-toolkit * forgot to set home * make sure we init the shell * missed a reference to the build matrix * set timeout to be 1 hr * make it easier to update the versions * had the env in the wrong spot * don't run on a schedule and timeout after 25 minutes
1 parent b63fc70 commit 054d487

File tree

1 file changed

+130
-0
lines changed

1 file changed

+130
-0
lines changed
+130
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
name: self-hosted-gpu-test
2+
on:
3+
push:
4+
branches:
5+
- master
6+
workflow_dispatch:
7+
8+
defaults:
9+
run:
10+
shell: bash -l {0}
11+
12+
jobs:
13+
start-runner:
14+
name: Start self-hosted EC2 runner
15+
runs-on: ubuntu-latest
16+
outputs:
17+
label: ${{ steps.start-ec2-runner.outputs.label }}
18+
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
19+
steps:
20+
- name: Configure AWS credentials
21+
uses: aws-actions/configure-aws-credentials@v1
22+
with:
23+
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
24+
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
25+
aws-region: ${{ secrets.AWS_REGION }}
26+
- name: Try to start EC2 runner
27+
id: start-ec2-runner
28+
uses: machulav/ec2-github-runner@main
29+
with:
30+
mode: start
31+
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
32+
ec2-image-id: ami-04d16a12bbc76ff0b
33+
ec2-instance-type: g4dn.xlarge
34+
subnet-id: subnet-0dee8543e12afe0cd # us-east-1a
35+
security-group-id: sg-0f9809618550edb98
36+
# iam-role-name: self-hosted-runner # optional, requires additional permissions
37+
aws-resource-tags: > # optional, requires additional permissions
38+
[
39+
{"Key": "Name", "Value": "ec2-github-runner"},
40+
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
41+
]
42+
43+
do-the-job:
44+
name: Do the job on the runner
45+
needs: start-runner # required to start the main job when the runner is ready
46+
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
47+
timeout-minutes: 25
48+
steps:
49+
50+
51+
- name: Check out
52+
uses: actions/checkout@v3
53+
54+
- name: Install Miniconda
55+
uses: conda-incubator/setup-miniconda@v2
56+
env:
57+
HOME: /home/ec2-user
58+
59+
with:
60+
activate-environment: ""
61+
auto-activate-base: true
62+
miniforge-variant: Mambaforge
63+
64+
- name: Prepare dependencies (with CUDA)
65+
env:
66+
cudatoolkit: "11.7.*"
67+
gxx_linux-64: "10.3.*"
68+
torchani: "2.2.*"
69+
nvcc_linux-64: "11.7.*"
70+
python: "3.10.*"
71+
pytorch-gpu: "2.0.*"
72+
run: |
73+
sed -i -e "/cudatoolkit/c\ - cudatoolkit ${{ env.cudatoolkit }}" \
74+
-e "/gxx_linux-64/c\ - gxx_linux-64 ${{ env.gxx_linux-64 }}" \
75+
-e "/torchani/c\ - torchani ${{ env.torchani }}" \
76+
-e "/nvcc_linux-64/c\ - nvcc_linux-64 ${{ env.nvcc_linux-64 }}" \
77+
-e "/python/c\ - python ${{ env.python }}" \
78+
-e "/pytorch-gpu/c\ - pytorch-gpu ${{ env.pytorch-gpu }}" \
79+
environment.yml
80+
81+
- name: Show dependency file
82+
run: cat environment.yml
83+
84+
- name: Install dependencies
85+
run: |
86+
mamba env create -n nnpops -f environment.yml
87+
conda init
88+
89+
- name: List conda environment
90+
run: |
91+
conda activate nnpops
92+
conda list
93+
94+
- name: Configure, compile, and install
95+
run: |
96+
conda activate nnpops
97+
mkdir build && cd build
98+
cmake .. \
99+
-DENABLE_CUDA=true \
100+
-DTorch_DIR=$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')/Torch \
101+
-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
102+
make install
103+
104+
- name: Test
105+
run: |
106+
conda activate nnpops
107+
cd build
108+
ctest --verbose
109+
110+
stop-runner:
111+
name: Stop self-hosted EC2 runner
112+
needs:
113+
- start-runner # required to get output from the start-runner job
114+
- do-the-job # required to wait when the main job is done
115+
runs-on: ubuntu-latest
116+
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
117+
steps:
118+
- name: Configure AWS credentials
119+
uses: aws-actions/configure-aws-credentials@v1
120+
with:
121+
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
122+
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
123+
aws-region: ${{ secrets.AWS_REGION }}
124+
- name: Stop EC2 runner
125+
uses: machulav/ec2-github-runner@main
126+
with:
127+
mode: stop
128+
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
129+
label: ${{ needs.start-runner.outputs.label }}
130+
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}

0 commit comments

Comments
 (0)