Name	Name	Last commit message	Last commit date
parent directory ..
aws-vpc	aws-vpc
README.md	README.md
aws-config.docker.yaml	aws-config.docker.yaml
aws-config.yaml	aws-config.yaml
aws-minimal.yaml	aws-minimal.yaml
task_pattern_tree.py	task_pattern_tree.py

Cluster

https://docs.ray.io/en/master/cluster/quickstart.html#ref-cluster-quick-start

AWS

prerequisite
- aws configure (aws profile cannot be specified?)
- IAM policy to create IAM/EC2...
- VPC and subnets (You can create with Terraform: AWS VPC)

Setup a Ray Cluster

pip install -U "ray[default]" boto3

ray up -y aws-config.docker.yaml

⚠ aws-minimal.yaml doesn't work as no default AMI is available for the region ap-northeast-1.

ray up -y aws-config.docker.yaml
Cluster: minimal

2022-06-15 09:39:52,887 INFO util.py:335 -- setting max workers for head node type to 0
Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): ray-autoscaler_ap-northeast-1 [default]
  VPC Subnets (all available node types): subnet-0c99b9f8cc6d2d982, subnet-001594b9fef9558b8, subnet-0b0fe6e3b97b2d2e6 [default]
  EC2 Security groups (all available node types): sg-00de431beac12f62f [default]
  EC2 AMI (all available node types): ami-088da9557aae42f39

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]:
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-0c99b9f8cc6d2d982]
    Launched instance i-0d2f905979d8127f7 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: 3.115.12.55
ssh: connect to host 3.115.12.55 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 00:41:01 up 0 min,  1 user,  load average: 0.45, 0.10, 0.03
Shared connection to 3.115.12.55 closed.
    Success.
  Updating cluster configuration. [hash=4885ab5452cbfb61bc0ed1c45d02531c1893ad6d]
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] Running initialization commands
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Connection to 3.115.12.55 closed.
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

# Executing docker install script, commit: b2e29ef7a9a89840d2333637f7d1900a83e7153f
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null
+ sh -c mkdir -p /etc/apt/keyrings && chmod -R 0755 /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" | gpg --dearmor --yes -o /etc/apt/keyrings/docker.gpg
+ sh -c chmod a+r /etc/apt/keyrings/docker.gpg
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu focal stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq --no-install-recommends docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-scan-plugin >/dev/null
+ version_gte 20.10
+ [ -z  ]
+ return 0
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq docker-ce-rootless-extras >/dev/null
+ sh -c docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:57 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:03 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

================================================================================

To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


To run the Docker daemon as a fully privileged service, but granting non-root
users access, refer to https://docs.docker.com/go/daemon-access/

WARNING: Access to the remote API on a privileged Docker daemon is equivalent
         to root access on the host. Refer to the 'Docker daemon attack surface'
         documentation for details: https://docs.docker.com/go/attack-surface/

================================================================================

Connection to 3.115.12.55 closed.
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
Connection to 3.115.12.55 closed.
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
Connection to 3.115.12.55 closed.
  [5/7] Initalizing command runner
Warning: Permanently added '3.115.12.55' (ED25519) to the list of known hosts.
Shared connection to 3.115.12.55 closed.
latest: Pulling from rayproject/ray
d5fd17ec1767: Pull complete
341eeba1871f: Pull complete
913d7f86391e: Pull complete
2107148d3c4b: Pull complete
d0a777325523: Pull complete
1dc2e272e6b0: Pull complete
df3e0297f017: Pull complete
da37cde4b300: Pull complete
9d617ad85b1c: Pull complete
Digest: sha256:7831c923acc3b761f52ac9cfa8d449d3b8ad02d611cf4615795c08982bc41c9d
Status: Downloaded newer image for rayproject/ray:latest
docker.io/rayproject/ray:latest
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
4efe0306408fccb7f2df1515ca361997d494decb32715132b6101fda8243930c
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
sending incremental file list
ray_bootstrap_config.yaml

sent 1,026 bytes  received 35 bytes  2,122.00 bytes/sec
total size is 2,123  speedup is 2.00
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
sending incremental file list
ray_bootstrap_key.pem

sent 1,401 bytes  received 35 bytes  957.33 bytes/sec
total size is 1,674  speedup is 1.17
Shared connection to 3.115.12.55 closed.
Shared connection to 3.115.12.55 closed.
  [6/7] Running setup commands
    (0/1) pip install 'boto3>=1.4.8'
Requirement already satisfied: boto3>=1.4.8 in ./anaconda3/lib/python3.7/site-packages (1.4.8)
Requirement already satisfied: s3transfer<0.2.0,>=0.1.10 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.1.13)
Requirement already satisfied: botocore<1.9.0,>=1.8.0 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (1.8.50)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.10.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (2.8.2)
Requirement already satisfied: docutils>=0.10 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (0.18.1)
Requirement already satisfied: six>=1.5 in ./anaconda3/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (1.13.0)
Shared connection to 3.115.12.55 closed.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 3.115.12.55 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 10.0.103.184
2022-06-14 17:43:17,159 INFO services.py:1476 -- View the Ray dashboard at http://127.0.0.1:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.0.103.184:6379'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')

  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:10001')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop
Shared connection to 3.115.12.55 closed.
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /Users/nakamasato/repos/nakamasato/ml-training/ray/03-cluster/aws-config.docker.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /Users/nakamasato/repos/nakamasato/ml-training/ray/03-cluster/aws-config.docker.yaml
  Get a remote shell to the cluster manually:
    ssh -tt -o IdentitiesOnly=yes -i /Users/nakamasato/.ssh/ray-autoscaler_ap-northeast-1.pem [email protected] docker exec -it ray_container /bin/bash

Get ip address of head.

2022-06-15 09:45:03,967 VINFO utils.py:145 -- Creating AWS resource `ec2` in `ap-northeast-1`
2022-06-15 09:45:04,495 VINFO utils.py:145 -- Creating AWS resource `ec2` in `ap-northeast-1`
3.115.12.55

Connect to a terminal on the cluster head
```
ray attach aws-config.docker.yaml
```

Monitor autoscaling

ray exec aws-config.docker.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Submit a job.

ray submit aws-config.docker.yaml task_pattern_tree.py

Instances are being created and terminated continuously..

ray submit aws-config.docker.yaml task_pattern_tree.py
2022-05-25 05:51:39,806 INFO util.py:335 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 35.77.53.85
Shared connection to 35.77.53.85 closed.
Shared connection to 35.77.53.85 closed.
2022-05-25 05:51:42,919 INFO util.py:335 -- setting max workers for head node type to 0
Fetched IP: 35.77.53.85
Shared connection to 35.77.53.85 closed.
Array size: 200000
Sequential execution: 0.039
Distributed execution: 1.131
--------------------
Array size: 4000000
Sequential execution: 5.835
(scheduler +9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +9s) Adding 1 nodes of type ray.worker.default.
(scheduler +14s) Adding 1 nodes of type ray.worker.default.
2022-05-24 13:52:02,353 WARNING worker.py:1382 -- WARNING: 8 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Distributed execution: 18.052
--------------------
Array size: 8000000
Sequential execution: 15.108
2022-05-24 13:52:40,207 WARNING worker.py:1382 -- WARNING: 10 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:44,694 WARNING worker.py:1382 -- WARNING: 12 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:46,193 WARNING worker.py:1382 -- WARNING: 14 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:47,724 WARNING worker.py:1382 -- WARNING: 16 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:49,588 WARNING worker.py:1382 -- WARNING: 19 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:51,313 WARNING worker.py:1382 -- WARNING: 20 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:58,805 WARNING worker.py:1382 -- WARNING: 22 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:52:59,761 WARNING worker.py:1382 -- WARNING: 24 PYTHON worker processes have been started on node: d1b27708d013cf095512c80ddafaf0d9ad9cc8f8c31f9d0b24850579 with address: 10.0.103.208. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Distributed execution: 45.697
--------------------
Array size: 10000000
Sequential execution: 22.814
(scheduler +2m2s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +2m28s) Resized to 4 CPUs.
Distributed execution: 50.749
--------------------
Array size: 20000000
(scheduler +3m14s) Resized to 6 CPUs.
Sequential execution: 55.681
2022-05-24 13:55:41,083 WARNING worker.py:1382 -- WARNING: 8 PYTHON worker processes have been started on node: fbd302fdeae6c78aaa9a68efcf4966a40eb8ac252203af42b9b27b60 with address: 10.0.103.232. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:42,331 WARNING worker.py:1382 -- WARNING: 10 PYTHON worker processes have been started on node: fbd302fdeae6c78aaa9a68efcf4966a40eb8ac252203af42b9b27b60 with address: 10.0.103.232. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:42,870 WARNING worker.py:1382 -- WARNING: 8 PYTHON worker processes have been started on node: 096b57a8b91d0919ed8224be96e20497710c6fc188162079e5a3551f with address: 10.0.103.17. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:43,523 WARNING worker.py:1382 -- WARNING: 12 PYTHON worker processes have been started on node: fbd302fdeae6c78aaa9a68efcf4966a40eb8ac252203af42b9b27b60 with address: 10.0.103.232. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:53,981 WARNING worker.py:1382 -- WARNING: 10 PYTHON worker processes have been started on node: 096b57a8b91d0919ed8224be96e20497710c6fc188162079e5a3551f with address: 10.0.103.17. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(scheduler +4m10s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
2022-05-24 13:55:55,562 WARNING worker.py:1382 -- WARNING: 12 PYTHON worker processes have been started on node: 096b57a8b91d0919ed8224be96e20497710c6fc188162079e5a3551f with address: 10.0.103.17. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:56,807 WARNING worker.py:1382 -- WARNING: 14 PYTHON worker processes have been started on node: 096b57a8b91d0919ed8224be96e20497710c6fc188162079e5a3551f with address: 10.0.103.17. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
2022-05-24 13:55:58,590 WARNING worker.py:1382 -- WARNING: 16 PYTHON worker processes have been started on node: 096b57a8b91d0919ed8224be96e20497710c6fc188162079e5a3551f with address: 10.0.103.17. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Distributed execution: 54.941
--------------------
Shared connection to 35.77.53.85 closed.

While running the job, also checked the cluster:

esources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:58:39,155 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:58:39.155496 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:58:44,235 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:58:44.235089 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:58:49,312 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:58:49.312784 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:58:54,384 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:58:54.384371 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:58:59,465 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:58:59.465703 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:59:04,540 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:59:04.540569 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:59:09,610 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:59:09.610690 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:59:14,682 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:59:14.682181 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
2022-05-24 13:59:19,772 INFO autoscaler.py:330 --
======== Autoscaler status: 2022-05-24 13:59:19.772812 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/4.358 GiB memory
 0.00/2.179 GiB object_store_memory

Demands:
 (no resource demands)
Shared connection to 35.77.53.85 closed.
Error: Command failed:

  ssh -tt -i /Users/nakamasato/.ssh/ray-autoscaler_ap-northeast-1.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_a09e268f38/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"''"'"' )'

Drop ray cluster (EC2 instances are also terminated.)

ray down aws-config.docker.yaml

ray down aws-config.docker.yaml
2022-06-15 10:05:37,879 INFO util.py:335 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y
2022-06-15 10:05:44,582 INFO util.py:335 -- setting max workers for head node type to 0
Fetched IP: 3.115.12.55
Stopped all 7 Ray processes.
Shared connection to 3.115.12.55 closed.
Fetched IP: 3.115.12.55
Requested 1 nodes to shut down. [interval=1s]
0 nodes remaining after 5 second(s).
No nodes remaining.

Clean up remaining resources:

ec2:

aws ec2 delete-key-pair --key-name ray-autoscaler_ap-northeast-1
security_group_id=$(aws ec2 describe-security-groups --filters Name=group-name,Values=ray-autoscaler-minimal | jq -r '.SecurityGroups[].GroupId')
echo $security_group_id
aws ec2 delete-security-group --group-id $security_group_id

local:
```
rm ~/.ssh/ray-autoscaler_*
```

iam:

aws iam remove-role-from-instance-profile --instance-profile-name ray-autoscaler-v1  --role-name ray-autoscaler-v1
aws iam delete-instance-profile --instance-profile-name ray-autoscaler-v1
aws iam detach-role-policy --role-name ray-autoscaler-v1 --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess
aws iam detach-role-policy --role-name ray-autoscaler-v1 --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam delete-role --role-name ray-autoscaler-v1

If you created VPC and subnet with Terraform, you can clean them up. (complete within a few seconds)

cd aws-vpc
terraform destroy

If you get this error, you might forget to delete security group.

│ Error: error deleting EC2 VPC (vpc-0e8f61401b5cc4c96): DependencyViolation: The vpc 'vpc-0e8f61401b5cc4c96' has dependencies and cannot be deleted.
│       status code: 400, request id: f09016c6-7c65-4e55-b467-41fdc830eb3e
│

Errors

Error1: No usable subnets found

➜  ray git:(ray-getting-started) ✗ ray down -y aws-config.docker.yaml
2022-04-28 13:33:15,797 INFO util.py:335 -- setting max workers for head node type to 0
2022-04-28 13:33:15,798 INFO util.py:339 -- setting max workers for ray.worker.default to 2
Checking AWS environment settings
No usable subnets found, try manually creating an instance in your specified region to populate the list of subnets and trying this again.
Note that the subnet must map public IPs on instance launch unless you set `use_internal_ips: true` in the `provider` config.

solution:

cd aws-vpc
terraform init
terraform apply

Error2: no default AMI is available for the region ap-northeast-1

Node type `ray.head.default` has no ImageId in its node_config and no default AMI is available for the region `ap-northeast-1`. ImageId will need to be set manually in your cluster config.

solution:

Get from ec2 console -> ami-088da9557aae42f39

Error3: The architecture 'x86_64' of the specified instance type does not match the architecture 'arm64'

botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RunInstances operation: The architecture 'x86_64' of the specified instance type does not match the architecture 'arm64' of the specified AMI. Specify an instance type and an AMI that have matching architectures, and try again. You can use 'describe-instance-types' or 'describe-images' to discover the architecture of the instance type or AMI.

Get from ec2 console -> ami-088da9557aae42f39

Error4: pip not found

Command 'pip' not found, but can be installed with:

sudo apt install python3-pip

Shared connection to 35.78.246.199 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

Try using docker with

docker:
    image: "rayproject/ray-ml:latest"
    container_name: "ray_container"

Error5: Command 'docker' not found

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
Cluster: minimal

2022-04-28 18:43:30,627 INFO util.py:335 -- setting max workers for head node type to 0
Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): ray-autoscaler_1_ap-northeast-1 [default]
  VPC Subnets (all available node types): subnet-0d56536828c7b98d5, subnet-04cbf4481ef89650b, subnet-042a84abdec0ad522 [default]
  EC2 Security groups (all available node types): sg-016a1f14dce3f5504 [default]
  EC2 AMI (all available node types): ami-088da9557aae42f39

Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y [automatic, due to --yes]

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 35.78.246.199
Warning: Permanently added '35.78.246.199' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 09:43:36 up 7 min,  1 user,  load average: 0.00, 0.04, 0.02
Shared connection to 35.78.246.199 closed.
    Success.
  Updating cluster configuration. [hash=e704226942567d8bf66edaedb68b995897fcca43]
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 35.78.246.199 closed.
Shared connection to 35.78.246.199 closed.
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
Shared connection to 35.78.246.199 closed.
2022-04-28 18:43:39,917 ERROR command_runner.py:790 -- Docker not installed. You can install Docker by adding the following commands to 'initialization_commands':
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
sudo systemctl restart docker -f
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.


Command 'docker' not found, but can be installed with:

sudo apt install docker.io

Shared connection to 35.78.246.199 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

solution:

initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f

Implementation

sdk.create_or_update_cluster calls _private.commands.create_or_update_cluster
_private.commands.create_or_update_cluster calls _bootstrap_config
[_bootstrap_config] gets importer = _NODE_PROVIDERS.get(config["provider"]["type"]) -> provider_cls = importer(config["provider"]) -> calls provider_cls.bootstrap_config(config)
AWSNodeProvider(NodeProvider).bootstrap_config calls bootstrap_aws.
bootstrap_aws(config)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03-cluster

03-cluster

README.md

Cluster

AWS

Errors

Implementation

Files

03-cluster

Directory actions

More options

Directory actions

More options

Latest commit

History

03-cluster

Folders and files

parent directory

README.md

Cluster

AWS

Errors

Implementation