We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好!我最近在使用该仓库进行评测,发现在多卡运行并且设置 --mode 为 infer 时,在推理完第一个数据集之后会卡住。检查了一下 run.py 发现是因为 rank=0 的进程在 https://github.com/lihytotoro/VLMEvalKit/blob/main/run.py#L337 的地方会处理评测事宜,但由于只进行推理,因此会在 if 内部 continue 返回循环顶部,而其他的进程会跳过这个 if 到达下方 https://github.com/lihytotoro/VLMEvalKit/blob/main/run.py#L415 处的 dist.barrier,从而被阻塞,由于一直等不到 rank=0 到达因此会一直卡住,直到 nccl 超时退出。 我感觉这里的 dist.barrier 的位置应该可以有所调整,从而避免上述的现象。
The text was updated successfully, but these errors were encountered:
感谢指出,我们会进行修复
Sorry, something went wrong.
No branches or pull requests
您好!我最近在使用该仓库进行评测,发现在多卡运行并且设置 --mode 为 infer 时,在推理完第一个数据集之后会卡住。检查了一下 run.py 发现是因为 rank=0 的进程在 https://github.com/lihytotoro/VLMEvalKit/blob/main/run.py#L337 的地方会处理评测事宜,但由于只进行推理,因此会在 if 内部 continue 返回循环顶部,而其他的进程会跳过这个 if 到达下方 https://github.com/lihytotoro/VLMEvalKit/blob/main/run.py#L415 处的 dist.barrier,从而被阻塞,由于一直等不到 rank=0 到达因此会一直卡住,直到 nccl 超时退出。
我感觉这里的 dist.barrier 的位置应该可以有所调整,从而避免上述的现象。
The text was updated successfully, but these errors were encountered: