Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wandb): enable logging #225

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

slimfrkha
Copy link
Contributor

Enable Wandb logging following the same design of tensorboard.

N.B: Because ray mutinode training doesnt share global var (unless explicitely passing them through ray.put chan), the wandb writer (initialized as global var in megatron code) would be different for each ray process. As such, logging with wandb will create N experiments (all empty except for one where all logging happens thanks to is_last_rank if). To avoid this, we can pass a predefined Wandb run_id with a new megatron arg in wandb init call. This requires a tiny PR in Megatron LM.
If you have an idea of a better way to solve, I am open for discussion.

@haolin-nju
Copy link
Collaborator

Please fix errors following pylint instructions:

************* Module models.train_helper
examples/megatron/models/train_helper.py:58: [C0303(trailing-whitespace), ] Trailing whitespace
************* Module models.utils
examples/megatron/models/utils.py:298: [C0303(trailing-whitespace), ] Trailing whitespace

@slimfrkha
Copy link
Contributor Author

Please fix errors following pylint instructions:

************* Module models.train_helper
examples/megatron/models/train_helper.py:58: [C0303(trailing-whitespace), ] Trailing whitespace
************* Module models.utils
examples/megatron/models/utils.py:298: [C0303(trailing-whitespace), ] Trailing whitespace

Done

Copy link
Collaborator

@haolin-nju haolin-nju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. BTW, could you please show an example usage (e.g. environment variables, yaml configurations, or training scripts) on enabling wandb in ChatLearn?

@slimfrkha
Copy link
Contributor Author

it is very simple because this is already supported in Megatron-LM.

  1. set WANDB_ENTITY and WANDB_API_KEY env variables
  2. set wandb_project and wandb_exp_name in base.yaml (they are megatron-lm args)

@haolin-nju
Copy link
Collaborator

cc @adoda @charles9304 PTAL

@charles9304
Copy link
Collaborator

s a tiny PR in

Thanks for your contribution, that helps a lot.

BTW, could u please share the necessary PR code in Megatron-LM you mentioned?

@slimfrkha
Copy link
Contributor Author

s a tiny PR in

Thanks for your contribution, that helps a lot.

BTW, could u please share the necessary PR code in Megatron-LM you mentioned?

very simple. it is in the wandb init call here. add any needed wandb argument like this

        wandb_kwargs = {
            'dir': save_dir,
            'name': args.wandb_exp_name,
            'project': args.wandb_project,
            'config': vars(args),
            # add any other arg if you need
            'id': args.wandb_run_id, # in newer wandb version, it is called resume_from
            'resume': 'allow'
            }

if you pass wandb_run_id(randomly generated of a run) in base.yaml, the above mentioned problem will disappear.

@adoda
Copy link
Collaborator

adoda commented Feb 21, 2025

Very good work! Could you make it a configurable option, allowing the choice between TensorBoard or wandb?

@slimfrkha
Copy link
Contributor Author

Did some logging testing. There is some caveats. examples/megatron/models/reward_math.py and examples/megatron/models/train_helper.py used SummaryWriter for a reason: because those 2 files dont have the megatron global variables set, get_tensorboard_writer would yield None (so would get_wandb_writer as well).
What do you think we should do ? I tried to bypass this by injecting megatron global vars in those files but no success for the moment.

@haolin-nju
Copy link
Collaborator

Did some logging testing. There is some caveats. examples/megatron/models/reward_math.py and examples/megatron/models/train_helper.py used SummaryWriter for a reason: because those 2 files dont have the megatron global variables set, get_tensorboard_writer would yield None (so would get_wandb_writer as well). What do you think we should do ? I tried to bypass this by injecting megatron global vars in those files but no success for the moment.

Perhaps we should init another wandb for these actors to log the stats, rather than get_wandb_writer from Megatron? I have found that wandb offers multiprocessing logging (https://docs.wandb.ai/guides/runs/grouping/) to tackle the issue. In this case, wandb advices us to use different ids for each actor and group them into a single task. (https://colab.research.google.com/github/wandb/examples/blob/master/colabs/Grouping_Feature.ipynb)

@slimfrkha
Copy link
Contributor Author

yeah we can but the design of args will be more complicated, i.e wand (and tensorboard) args need to be defined globally (under runtime section maybe ?) then transfered to model_args if Module is a megatron one.

@haolin-nju
Copy link
Collaborator

yeah we can but the design of args will be more complicated, i.e wand (and tensorboard) args need to be defined globally (under runtime section maybe ?) then transfered to model_args if Module is a megatron one.

Yeah. We are working on adding time summary to wandb. It requires start an wandb instance from engine. Perhaps we could define another yaml called log.yaml, and include it from runtime yaml and base yaml? What do you think of this solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants