-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wandb): enable logging #225
base: main
Are you sure you want to change the base?
Conversation
Please fix errors following pylint instructions: ************* Module models.train_helper
examples/megatron/models/train_helper.py:58: [C0303(trailing-whitespace), ] Trailing whitespace
************* Module models.utils
examples/megatron/models/utils.py:298: [C0303(trailing-whitespace), ] Trailing whitespace |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. BTW, could you please show an example usage (e.g. environment variables, yaml configurations, or training scripts) on enabling wandb in ChatLearn?
it is very simple because this is already supported in Megatron-LM.
|
cc @adoda @charles9304 PTAL |
Thanks for your contribution, that helps a lot. BTW, could u please share the necessary PR code in Megatron-LM you mentioned? |
very simple. it is in the wandb init call here. add any needed wandb argument like this wandb_kwargs = {
'dir': save_dir,
'name': args.wandb_exp_name,
'project': args.wandb_project,
'config': vars(args),
# add any other arg if you need
'id': args.wandb_run_id, # in newer wandb version, it is called resume_from
'resume': 'allow'
} if you pass |
Very good work! Could you make it a configurable option, allowing the choice between TensorBoard or wandb? |
Did some logging testing. There is some caveats. |
Perhaps we should init another wandb for these actors to log the stats, rather than get_wandb_writer from Megatron? I have found that wandb offers multiprocessing logging (https://docs.wandb.ai/guides/runs/grouping/) to tackle the issue. In this case, wandb advices us to use different ids for each actor and group them into a single task. (https://colab.research.google.com/github/wandb/examples/blob/master/colabs/Grouping_Feature.ipynb) |
yeah we can but the design of args will be more complicated, i.e wand (and tensorboard) args need to be defined globally (under runtime section maybe ?) then transfered to model_args if Module is a megatron one. |
Yeah. We are working on adding time summary to wandb. It requires start an wandb instance from engine. Perhaps we could define another yaml called log.yaml, and include it from runtime yaml and base yaml? What do you think of this solution? |
Enable Wandb logging following the same design of tensorboard.
N.B: Because ray mutinode training doesnt share global var (unless explicitely passing them through ray.put chan), the wandb writer (initialized as global var in megatron code) would be different for each ray process. As such, logging with wandb will create N experiments (all empty except for one where all logging happens thanks to is_last_rank if). To avoid this, we can pass a predefined Wandb run_id with a new megatron arg in wandb init call. This requires a tiny PR in Megatron LM.
If you have an idea of a better way to solve, I am open for discussion.