diff --git a/OneFlow/Recognition/insightface/rn50/README.md b/OneFlow/Recognition/insightface/rn50/README.md new file mode 100644 index 00000000..d32b187b --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/README.md @@ -0,0 +1,220 @@ +# OneFlow InsightFace(r50) 测评 + + +## 概述 Overview + +本测试基于 OneFlow框架,对 [oneflow_face](https://github.com/Oneflow-Inc/oneflow_face/tree/dev_rn_50_test) 仓库中的 InsightFace 网络进行了从单机单卡到多机多卡的评测,评判框架的训练速度以及在分布式训练情况下的横向拓展能力。 + +本次测评使用的backbone为r50网络,在数据并行、模型并行的情况下分别进行了测试。 + + + + +## 环境 Environment + + +### 系统 + + +- #### 硬件 + + + - GPU:Tesla V100-SXM2-16GB x 8 + + +- #### 软件 + + + - 驱动:NVIDIA 440.33.01 + + + - 系统:[ Ubuntu 16.04](http://releases.ubuntu.com/16.04/) + + - CUDA:10.2 + + + - cuDNN:7.6.5 + + +- OneFlow:0.34 +- Python: 3.7 + + + + #### Feature support matrix + +| Feature | ResNet50 v1.5 PyTorch | +| ------------------------------- | --------------------- | +| Multi-gpu training | Yes | +| Automatic mixed precision (AMP) | Yes | +| Model Parallelism | Yes | +| Partial FC | Yes | + + + + +## 快速开始 Quick Start + + +### 1. 前期准备 + + +- #### 数据集 + +准备 Face Emore 和 Glint360k 的 OFReocord 数据集,可以选择根据 [加载与准备 OFRecord 数据集](https://docs.oneflow.org/extended_topics/how_to_make_ofdataset.html)文档中 Python 脚本生成所有数据的完整 OFRecord + Spark Shuffle + Spark Partition 的方式,也可以选择只使用 Python 脚本生成多块 OFRecord 的方式,用以进行 InsightFace 的测试。 + +可以参考 deepinsight [Training Data](https://github.com/deepinsight/insightface#training-data) 小节,下载 [Dataset-Zoo](https://github.com/deepinsight/insightface/wiki/Dataset-Zoo) 中的 [MS1M-ArcFace](https://pan.baidu.com/s/1S6LJZGdqcZRle1vlcMzHOQ) 数据集或者 [Glint360k](https://pan.baidu.com/share/init?surl=GsYqTTt7_Dn8BfxxsLFN0w) 数据集。 + +具体制作方法请参考 [InsightFace 在 OneFlow 中的实现](https://github.com/Oneflow-Inc/oneflow_face/blob/master/README_CH.md#insightface-%E5%9C%A8-oneflow-%E4%B8%AD%E7%9A%84%E5%AE%9E%E7%8E%B0)中的[准备数据集](https://github.com/Oneflow-Inc/oneflow_face/blob/master/README_CH.md#%E5%87%86%E5%A4%87%E6%95%B0%E6%8D%AE%E9%9B%86)部分。 + +更多关于 OneFlow OFRecord 数据集的信息,请参考 [加载与准备 OFRecord 数据集](https://docs.oneflow.org/extended_topics/how_to_make_ofdataset.html) 和 [将图片文件制作为 OFRecord 数据集](https://docs.oneflow.org/extended_topics/how_to_convert_image_to_ofrecord.html)。 + + + + + + +### 2. 运行测试 + + +本次测试集群中有 4 台节点: + + +- NODE1=10.11.0.2 + +- NODE2=10.11.0.3 + +- NODE3=10.11.0.4 + +- NODE4=10.11.0.5 + + +每个节点有 8 张 V100 显卡, 每张显卡显存 16 G。 + + +- #### 单机测试 + + +在节点 1 的容器内下载 oneflow_face 源码和本仓库源码: + + +```` +git clone https://github.com/Oneflow-Inc/oneflow_face.git +git clone https://github.com/Oneflow-Inc/DLPerf.git +```` + +将本仓库 DLPerf/OneFlow/Recognition/InsightFace/r50/ 路径下的scripts文件夹复制到 oneflow_face 路径下。 + +sample_config.py中配置好数据集及其他默认参数后,执行以下命令在单机单卡、4 卡、8 卡情况下进行测试: + +``` +bash scripts/run_single_node.sh +``` +默认batch size为128,使用r50的backbone网络,emore数据集,arcface的loss。 + +也可以修改脚本`run_single_node.sh` 中的参数以测试其他选项,如设置`model_parallel=${9:-1}`以开启全连接层的模型并行。 + +- #### 多机测试 +其中运行 `bash scripts/run_two_node.sh` 以进行2机测试,运行`bash scripts/run_multi_node.sh` 以进行4机测试 + + +### 3. 数据处理 + +测试进行了多组训练(本测试中取 5 次),每次训练过程取第 1 个 epoch 的前 150 iter,计算训练速度时取后 100 iter 的数据,以降低抖动。最后将 5 次训练的结果取中位数得到最终速度,并最终以此数据计算加速比。 + + +运行 DLPerf/OneFlow/Recognition/InsightFace/rn50/extract_pytorch_logs_time.py,即可得到针对不同配置测试结果 log 数据处理的结果: + + +``` +python extract_oneflow_logs_time.py -ld 20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/ +``` + +结果打印如下 + + +``` +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/4n8g/r50_b128_fp32_1.log {1: 12281.66} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/4n8g/r50_b128_fp32_4.log {1: 12281.66, 4: 12320.24} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/4n8g/r50_b128_fp32_2.log {1: 12281.66, 4: 12320.24, 2: 12397.28} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/4n8g/r50_b128_fp32_3.log {1: 12281.66, 4: 12320.24, 2: 12397.28, 3: 12373.61} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/4n8g/r50_b128_fp32_5.log {1: 12281.66, 4: 12320.24, 2: 12397.28, 3: 12373.61, 5: 12285.6} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n8g/r50_b128_fp32_1.log {1: 3248.24} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n8g/r50_b128_fp32_4.log {1: 3248.24, 4: 3285.21} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n8g/r50_b128_fp32_2.log {1: 3248.24, 4: 3285.21, 2: 3286.14} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n8g/r50_b128_fp32_3.log {1: 3248.24, 4: 3285.21, 2: 3286.14, 3: 3278.55} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n8g/r50_b128_fp32_5.log {1: 3248.24, 4: 3285.21, 2: 3286.14, 3: 3278.55, 5: 3276.42} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n4g/r50_b128_fp32_1.log {1: 1649.55} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n4g/r50_b128_fp32_4.log {1: 1649.55, 4: 1653.18} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n4g/r50_b128_fp32_2.log {1: 1649.55, 4: 1653.18, 2: 1652.16} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n4g/r50_b128_fp32_3.log {1: 1649.55, 4: 1653.18, 2: 1652.16, 3: 1654.45} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n4g/r50_b128_fp32_5.log {1: 1649.55, 4: 1653.18, 2: 1652.16, 3: 1654.45, 5: 1651.67} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n1g/r50_b128_fp32_1.log {1: 425.11} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n1g/r50_b128_fp32_4.log {1: 425.11, 4: 424.65} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n1g/r50_b128_fp32_2.log {1: 425.11, 4: 424.65, 2: 424.75} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n1g/r50_b128_fp32_3.log {1: 425.11, 4: 424.65, 2: 424.75, 3: 424.84} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/1n1g/r50_b128_fp32_5.log {1: 425.11, 4: 424.65, 2: 424.75, 3: 424.84, 5: 424.35} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/2n8g/r50_b128_fp32_1.log {1: 6330.97} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/2n8g/r50_b128_fp32_4.log {1: 6330.97, 4: 6343.74} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/2n8g/r50_b128_fp32_2.log {1: 6330.97, 4: 6343.74, 2: 6340.23} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/2n8g/r50_b128_fp32_3.log {1: 6330.97, 4: 6343.74, 2: 6340.23, 3: 6354.92} +20210319_r50_fp32_b128_oneflow_model_parallel_1_partial_fc_0/2n8g/r50_b128_fp32_5.log {1: 6330.97, 4: 6343.74, 2: 6340.23, 3: 6354.92, 5: 6384.45} +{'r50': {'1n1g': {'average_speed': 424.74, + 'batch_size_per_device': 128, + 'median_speed': 424.75, + 'speedup': 1.0}, + '1n4g': {'average_speed': 1652.2, + 'batch_size_per_device': 128, + 'median_speed': 1652.16, + 'speedup': 3.89}, + '1n8g': {'average_speed': 3274.91, + 'batch_size_per_device': 128, + 'median_speed': 3278.55, + 'speedup': 7.72}, + '2n8g': {'average_speed': 6350.86, + 'batch_size_per_device': 128, + 'median_speed': 6343.74, + 'speedup': 14.94}, + '4n8g': {'average_speed': 12331.68, + 'batch_size_per_device': 128, + 'median_speed': 12320.24, + 'speedup': 29.01}}} +Saving result to ./result/_result.json +``` + + + +## 性能结果 Performance + + +该小节提供针对 OneFlow 框架的 InsightFace 模型单机测试的性能结果和完整 log 日志。 + +### Face Emore & R50 & FP32 + +#### Data Parallelism + +**batch_size = 128** + +| node_num | gpu_num_per_node | batch_size_per_device | samples/s | speedup | +| -------- | ---------------- | --------------------- | --------- | ------- | +| 1 | 1 | 128 | 424.57 | 1.00 | +| 1 | 4 | 128 | 1635.63 | 3.85 | +| 1 | 8 | 128 | 3266.08 | 7.69 | +| 2 | 8 | 128 | 5827.13 | 13.72 | +| 4 | 8 | 128 | 11383.94 | 26.81 | + +### Face Emore & R50 & FP32 + +#### Model Parallelism + +**batch_size = 128** + +| node_num | gpu_num_per_node | batch_size_per_device | samples/s | speedup | +| -------- | ---------------- | --------------------- | --------- | ------- | +| 1 | 1 | 128 | 424.75 | 1.00 | +| 1 | 4 | 128 | 1652.16 | 3.89 | +| 1 | 8 | 128 | 3278.55 | 7.72 | +| 2 | 8 | 128 | 6343.74 | 14.94 | +| 4 | 8 | 128 | 12320.24 | 29.01 | + +详细 Log 信息可下载:[logs-20210319.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/OneFlow/InsightFace/r50/logs-20210319.zip) \ No newline at end of file diff --git a/OneFlow/Recognition/insightface/rn50/extract_oneflow_logs_time.py b/OneFlow/Recognition/insightface/rn50/extract_oneflow_logs_time.py new file mode 100644 index 00000000..a5bb7b95 --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/extract_oneflow_logs_time.py @@ -0,0 +1,138 @@ +import os +import re +import sys +import glob +import json +import argparse +import pprint +import time +import datetime +import numpy as np + +pp = pprint.PrettyPrinter(indent=1) +os.chdir(sys.path[0]) + +parser = argparse.ArgumentParser(description="flags for cnn benchmark tests data process") +parser.add_argument("-ld", "--log_dir", type=str, default="/workspace/oneflow_face/scripts/oneflow", required=True) +parser.add_argument("-od", "--output_dir", type=str, default="./result", required=False) +parser.add_argument("-wb", "--warmup_batches", type=int, default=50) +parser.add_argument("-tb", "--train_batches", type=int, default=150) +parser.add_argument("-bz", "--batch_size_per_devic", type=int, default=128) + +args = parser.parse_args() + + +class AutoVivification(dict): + """Implementation of perl's autovivification feature.""" + + def __getitem__(self, item): + try: + return dict.__getitem__(self, item) + except KeyError: + value = self[item] = type(self)() + return value + + +def extract_info_from_file(log_file, result_dict, speed_dict): + # extract info from file name + fname = os.path.basename(log_file) + run_case = log_file.split("/")[-2] # eg: 1n1g + model = fname.split("_")[0] + batch_size = int(fname.split("_")[1].strip("b")) + precision = fname.split("_")[2] + test_iter = int(fname.split("_")[3].strip(".log")) + node_num = int(run_case[0]) + if len(run_case) == 4: + card_num = int(run_case[-2]) + elif len(run_case) == 5: + card_num = int(run_case[-3:-1]) + + total_batch_size = node_num * card_num * batch_size + + tmp_dict = { + 'average_speed': 0, + 'batch_size_per_device': batch_size, + } + + avg_speed = 0 + # extract info from file content + pt = re.compile(r"throughput: (.*)", re.S) + start_time = '' + end_time = '' + line_num = 0 + thoughput_data = [] + with open(log_file) as f: + lines = f.readlines() + for line in lines: + if "train: iter" in line: + pt1 = re.compile(r"train: iter (.*), loss") + skip_time = int(re.findall(pt1, line)[0]) + line_num += 1 + + if line_num >= args.warmup_batches: + thoughoutput = float(re.findall(pt, line)[0]) + thoughput_data.append(thoughoutput) + + if line_num == args.train_batches-1: + iter_num = args.train_batches - args.warmup_batches + avg_speed = round(np.mean(thoughput_data), 2) + break + + # compute avg throughoutput + tmp_dict['average_speed'] = avg_speed + result_dict[model][run_case]['average_speed'] = avg_speed + result_dict[model][run_case]['batch_size_per_device'] = tmp_dict['batch_size_per_device'] + + speed_dict[model][run_case][test_iter] = avg_speed + + print(log_file, speed_dict[model][run_case]) + + +def compute_speedup(result_dict, speed_dict): + model_list = [key for key in result_dict] # eg.['vgg16', 'rn50'] + for m in model_list: + run_case = [key for key in result_dict[m]] # eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g'] + for d in run_case: + speed_up = 1.0 + if result_dict[m]['1n1g']['average_speed']: + result_dict[m][d]['average_speed'] = compute_average(speed_dict[m][d]) + result_dict[m][d]['median_speed'] = compute_median(speed_dict[m][d]) + speed_up = result_dict[m][d]['median_speed'] / compute_median(speed_dict[m]['1n1g']) + result_dict[m][d]['speedup'] = round(speed_up, 2) + +def compute_average(iter_dict): + i = 0 + total_speed = 0 + for iter in iter_dict: + i += 1 + total_speed += iter_dict[iter] + return round(total_speed / i, 2) + +def compute_median(iter_dict): + speed_list = [i for i in iter_dict.values()] + return round(np.median(speed_list), 2) + +def extract_result(): + result_dict = AutoVivification() + speed_dict = AutoVivification() + logs_list = glob.glob(os.path.join(args.log_dir, "*/*.log")) + for l in logs_list: + extract_info_from_file(l, result_dict, speed_dict) + + # compute speedup + compute_speedup(result_dict, speed_dict) + + # print result + pp.pprint(result_dict) + + # write to file as JSON format + os.makedirs(args.output_dir, exist_ok=True) + framwork = args.log_dir.split('/')[-1] + result_file_name = os.path.join(args.output_dir, framwork + "_result.json") + print("Saving result to {}".format(result_file_name)) + with open(result_file_name, 'w') as f: + json.dump(result_dict, f) + + +if __name__ == "__main__": + extract_result() diff --git a/OneFlow/Recognition/insightface/rn50/scripts/run_multi_node.sh b/OneFlow/Recognition/insightface/rn50/scripts/run_multi_node.sh new file mode 100644 index 00000000..aeb246bd --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/scripts/run_multi_node.sh @@ -0,0 +1,26 @@ +workspace=/home/leinao/lyon_test/oneflow_face +network=${1:-"r50"} +dataset=${2:-"emore"} +loss=${3:-"arcface"} +num_nodes=${4:-4} +bz_per_device=${5:-128} +train_unit=${6:-"batch"} +train_iter=${7:-150} +precision=${8:-fp32} +model_parallel=${9:-0} +partial_fc=${10:-0} +test_times=${11:-5} +sample_ratio=${12:-1.0} +num_classes=${13:-85744} +use_synthetic_data=${14:-False} + + + +i=1 +while [ $i -le $test_times ]; do + rm -rf new_models + bash ${workspace}/scripts/train_insightface.sh ${workspace} ${network} ${dataset} ${loss} 4 ${bz_per_device} ${train_unit} ${train_iter} 8 ${precision} ${model_parallel} ${partial_fc} $i ${sample_ratio} ${num_classes} ${use_synthetic_data} + echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished Test Case ${i}! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + let i++ + sleep 20 +done diff --git a/OneFlow/Recognition/insightface/rn50/scripts/run_single_node.sh b/OneFlow/Recognition/insightface/rn50/scripts/run_single_node.sh new file mode 100644 index 00000000..578a828e --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/scripts/run_single_node.sh @@ -0,0 +1,45 @@ +workspace=/home/leinao/lyon_test/oneflow_face +network=${1:-"r50"} +dataset=${2:-"emore"} +loss=${3:-"arcface"} +num_nodes=${4:-1} +bz_per_device=${5:-128} +train_unit=${6:-"batch"} +train_iter=${7:-150} +precision=${8:-fp32} +model_parallel=${9:-0} +partial_fc=${10:-0} +test_times=${11:-5} +sample_ratio=${12:-1.0} +num_classes=${13:-85744} +use_synthetic_data=${14:-False} + + +i=1 +while [ $i -le $test_times ]; do + rm -rf new_models + bash ${workspace}/scripts/train_insightface.sh ${workspace} ${network} ${dataset} ${loss} 1 ${bz_per_device} ${train_unit} ${train_iter} 1 ${precision} ${model_parallel} ${partial_fc} $i ${sample_ratio} ${num_classes} ${use_synthetic_data} + echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished Test Case ${i}! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + let i++ + sleep 20 +done + + +i=1 +while [ $i -le $test_times ]; do + rm -rf new_models + bash ${workspace}/scripts/train_insightface.sh ${workspace} ${network} ${dataset} ${loss} 1 ${bz_per_device} ${train_unit} ${train_iter} 4 ${precision} ${model_parallel} ${partial_fc} $i ${sample_ratio} ${num_classes} ${use_synthetic_data} + echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished Test Case ${i}! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + let i++ + sleep 20 +done + + +i=1 +while [ $i -le $test_times ]; do + rm -rf new_models + bash ${workspace}/scripts/train_insightface.sh ${workspace} ${network} ${dataset} ${loss} 1 ${bz_per_device} ${train_unit} ${train_iter} 8 ${precision} ${model_parallel} ${partial_fc} $i ${sample_ratio} ${num_classes} ${use_synthetic_data} + echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished Test Case ${i}! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + let i++ + sleep 20 +done diff --git a/OneFlow/Recognition/insightface/rn50/scripts/run_two_node.sh b/OneFlow/Recognition/insightface/rn50/scripts/run_two_node.sh new file mode 100644 index 00000000..f4e7fec5 --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/scripts/run_two_node.sh @@ -0,0 +1,26 @@ +workspace=/home/leinao/lyon_test/oneflow_face +network=${1:-"r50"} +dataset=${2:-"emore"} +loss=${3:-"arcface"} +num_nodes=${4:-2} +bz_per_device=${5:-128} +train_unit=${6:-"batch"} +train_iter=${7:-150} +precision=${8:-fp32} +model_parallel=${9:-0} +partial_fc=${10:-0} +test_times=${11:-5} +sample_ratio=${12:-1.0} +num_classes=${13:-85744} +use_synthetic_data=${14:-False} + + + +i=1 +while [ $i -le $test_times ]; do + rm -rf new_models + bash ${workspace}/scripts/train_insightface.sh ${workspace} ${network} ${dataset} ${loss} 2 ${bz_per_device} ${train_unit} ${train_iter} 8 ${precision} ${model_parallel} ${partial_fc} $i ${sample_ratio} ${num_classes} ${use_synthetic_data} + echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished Test Case ${i}! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + let i++ + sleep 20 +done diff --git a/OneFlow/Recognition/insightface/rn50/scripts/train_insightface.sh b/OneFlow/Recognition/insightface/rn50/scripts/train_insightface.sh new file mode 100644 index 00000000..38a1d47f --- /dev/null +++ b/OneFlow/Recognition/insightface/rn50/scripts/train_insightface.sh @@ -0,0 +1,90 @@ +# export ONEFLOW_DEBUG_MODE=True +export PYTHONUNBUFFERED=1 + +workspace=${1:-"/home/leinao/lyon_test/oneflow_face"} +network=${2:-"r50"} +dataset=${3:-"emore"} +loss=${4:-"arcface"} +num_nodes=${5:-1} +batch_size_per_device=${6:-128} +train_unit=${7:-"batch"} +train_iter=${8:-150} +gpu_num_per_node=${9:-1} +precision=${10:-fp32} +model_parallel=${11:-0} +partial_fc=${12:-0} +test_times=${13:-1} +sample_ratio=${14:-1.0} +num_classes=${15:-85744} +use_synthetic_data=${16:-False} + +MODEL_SAVE_DIR=logs_${network}_${precision}_b${batch_size_per_device}_oneflow_model_parallel_${model_parallel}_partial_fc_${partial_fc}/${num_nodes}n${gpu_num_per_node}g +LOG_DIR=$MODEL_SAVE_DIR + +if [ $gpu_num_per_node -gt 1 ]; then + if [ $network = "r100" ]; then + data_part_num=32 + elif [ $network = "r50" ]; then + data_part_num=32 + elif [ $network = "r100_glint360k" ]; then + data_part_num=200 + else + echo "Please modify exact data part num in sample_config.py!" + fi +else + data_part_num=1 +fi +sed -i "s/${dataset}.train_data_part_num = [[:digit:]]*/${dataset}.train_data_part_num = $data_part_num/g" $workspace/sample_config.py +sed -i "s/${dataset}.num_classes = [[:digit:]]*/${dataset}.num_classes = $num_classes/g" $workspace/sample_config.py +sed -i "s/num_nodes = [[:digit:]]*/num_nodes = $num_nodes/g" $workspace/sample_config.py +PREC="" +if [ "$precision" = "fp16" ]; then + PREC=" --use_fp16=True" +elif [ "$precision" = "fp32" ]; then + PREC=" --use_fp16=False" +else + echo "Unknown argument" + exit -2 +fi + +LOG_FILE=${LOG_DIR}/${network}_b${batch_size_per_device}_${precision}_$test_times.log + +mkdir -p $MODEL_SAVE_DIR + +time=$(date "+%Y-%m-%d %H:%M:%S") +echo $time + +CMD="$workspace/insightface_train.py" +CMD+=" --network=${network}" +CMD+=" --dataset=${dataset}" +CMD+=" --loss=${loss}" +CMD+=" --train_batch_size=$(expr $gpu_num_per_node '*' $batch_size_per_device)" +CMD+=" --train_unit=${train_unit}" +CMD+=" --train_iter=${train_iter}" +CMD+=" --device_num_per_node=${gpu_num_per_node}" +CMD+=" --model_parallel=${model_parallel}" +CMD+=" --partial_fc=${partial_fc}" +CMD+=" --sample_ratio=${sample_ratio}" +CMD+=" --log_dir=${LOG_DIR}" +CMD+=" $PREC" +CMD+=" --sample_ratio=${sample_ratio}" +CMD+=" --use_synthetic_data=${use_synthetic_data}" +CMD+=" --iter_num_in_snapshot=5000" +CMD+=" --validation_interval=5000" + + +CMD="/home/leinao/anaconda3/envs/oneflow-master/bin/python3 $CMD " + + +set -x +if [ -z "$LOG_FILE" ]; then + $CMD +else + ( + $CMD + ) |& tee $LOG_FILE + +fi +set +x +echo "Writing log to ${LOG_FILE}" +