For All-reduce distributed training or Single node training, TensorFlow version 2.x.x is required.
To use Parameter Server, install TF version at least 2.4.
$ pip install spektral==0.6.2
Dataset must be downloaded (not public)
-
Move downloaded dataset to
Data
directory -
cd Data
-
tar -zxvf gdp_dataset.tgz
-
Run
python preprocess_dataset_v3.py
Run python train_gcn_v3.py
at the root dir.
-
Set nodes' IP addresses in
dist_gcn_v3.py
file -
Run
dist_gcn_v3.py
in each node. The chief node uses0
for argument, and the others use increasing number from1
.
# Chief node
$ python dist_gcn_v3.py 0
# Other worker nodes
$ python dist_gcn_v3.py 1
$ python dist_gcn_v3.py 2
...
- 1 Chief node & 2 Worker nodes & 1 PS node
# Node 1
$ python dist_ps_gcn_v2.py 0
# Node 2
$ python dist_ps_gcn_v2.py 1
# Node 3
$ python dist_ps_gcn_v2.py 2
# Node 4
$ python dist_ps_gcn_v2.py 3
- 1 Chief node & 3 [Worker + PS] nodes
# Node 1
$ python dist_ps_gcn_v3.py chief
# Node 2
$ python dist_ps_gcn_v3.py worker 1
$ python dist_ps_gcn_v3.py ps 1
# Node 3
$ python dist_ps_gcn_v3.py worker 2
$ python dist_ps_gcn_v3.py ps 2
# Node 4
$ python dist_ps_gcn_v3.py worker 3
$ python dist_ps_gcn_v3.py ps 3
Results of training saved in to path Model_v3/[datetime]/FOLD-[CV]/
or Model_dist_v3/[datetime]/FOLD-[CV]/
.
Logging file (train.log
) will be saved in to the same path with model.
Only chief node saves results in distributed training.