Skip to content

Latest commit

 

History

History
90 lines (57 loc) · 1.73 KB

README.md

File metadata and controls

90 lines (57 loc) · 1.73 KB

GCN example

1. Installation

For All-reduce distributed training or Single node training, TensorFlow version 2.x.x is required.

To use Parameter Server, install TF version at least 2.4.

$ pip install spektral==0.6.2

1-1. Prepare dataset

Dataset must be downloaded (not public)

  1. Move downloaded dataset to Data directory

  2. cd Data

  3. tar -zxvf gdp_dataset.tgz

  4. Run python preprocess_dataset_v3.py

2. Training GCN

2-1. Train Model with Single Node

Run python train_gcn_v3.py at the root dir.

2-2. [All-reduce] Distributed Learning

  1. Set nodes' IP addresses in dist_gcn_v3.py file

  2. Run dist_gcn_v3.py in each node. The chief node uses 0 for argument, and the others use increasing number from 1.

# Chief node
$ python dist_gcn_v3.py 0

# Other worker nodes
$ python dist_gcn_v3.py 1
$ python dist_gcn_v3.py 2
...

2-3. [Parameter Server] Distributed Learning

  • 1 Chief node & 2 Worker nodes & 1 PS node
# Node 1
$ python dist_ps_gcn_v2.py 0

# Node 2
$ python dist_ps_gcn_v2.py 1

# Node 3
$ python dist_ps_gcn_v2.py 2

# Node 4
$ python dist_ps_gcn_v2.py 3
  • 1 Chief node & 3 [Worker + PS] nodes
# Node 1
$ python dist_ps_gcn_v3.py chief
 
# Node 2
$ python dist_ps_gcn_v3.py worker 1
$ python dist_ps_gcn_v3.py ps 1

# Node 3
$ python dist_ps_gcn_v3.py worker 2
$ python dist_ps_gcn_v3.py ps 2

# Node 4
$ python dist_ps_gcn_v3.py worker 3
$ python dist_ps_gcn_v3.py ps 3

3. Model Checkpoint & Logging

Results of training saved in to path Model_v3/[datetime]/FOLD-[CV]/ or Model_dist_v3/[datetime]/FOLD-[CV]/.

Logging file (train.log) will be saved in to the same path with model.

Only chief node saves results in distributed training.