feat(wandb): enable logging #225

slimfrkha · 2025-01-31T13:06:14Z

Enable Wandb logging following the same design of tensorboard.

N.B: Because ray mutinode training doesnt share global var (unless explicitely passing them through ray.put chan), the wandb writer (initialized as global var in megatron code) would be different for each ray process. As such, logging with wandb will create N experiments (all empty except for one where all logging happens thanks to is_last_rank if). To avoid this, we can pass a predefined Wandb run_id with a new megatron arg in wandb init call. This requires a tiny PR in Megatron LM.
If you have an idea of a better way to solve, I am open for discussion.

haolin-nju · 2025-02-06T05:06:15Z

Please fix errors following pylint instructions:

************* Module models.train_helper
examples/megatron/models/train_helper.py:58: [C0303(trailing-whitespace), ] Trailing whitespace
************* Module models.utils
examples/megatron/models/utils.py:298: [C0303(trailing-whitespace), ] Trailing whitespace

slimfrkha · 2025-02-06T07:00:29Z

Please fix errors following pylint instructions:

************* Module models.train_helper
examples/megatron/models/train_helper.py:58: [C0303(trailing-whitespace), ] Trailing whitespace
************* Module models.utils
examples/megatron/models/utils.py:298: [C0303(trailing-whitespace), ] Trailing whitespace

Done

haolin-nju

LGTM. BTW, could you please show an example usage (e.g. environment variables, yaml configurations, or training scripts) on enabling wandb in ChatLearn?

slimfrkha · 2025-02-10T14:11:08Z

it is very simple because this is already supported in Megatron-LM.

set WANDB_ENTITY and WANDB_API_KEY env variables
set wandb_project and wandb_exp_name in base.yaml (they are megatron-lm args)

haolin-nju · 2025-02-11T02:40:39Z

cc @adoda @charles9304 PTAL

charles9304 · 2025-02-11T02:51:48Z

s a tiny PR in

Thanks for your contribution, that helps a lot.

BTW, could u please share the necessary PR code in Megatron-LM you mentioned?

slimfrkha · 2025-02-11T11:12:14Z

s a tiny PR in

Thanks for your contribution, that helps a lot.

BTW, could u please share the necessary PR code in Megatron-LM you mentioned?

very simple. it is in the wandb init call here. add any needed wandb argument like this

        wandb_kwargs = {
            'dir': save_dir,
            'name': args.wandb_exp_name,
            'project': args.wandb_project,
            'config': vars(args),
            # add any other arg if you need
            'id': args.wandb_run_id, # in newer wandb version, it is called resume_from
            'resume': 'allow'
            }

if you pass wandb_run_id(randomly generated of a run) in base.yaml, the above mentioned problem will disappear.

adoda · 2025-02-21T14:36:35Z

Very good work! Could you make it a configurable option, allowing the choice between TensorBoard or wandb?

slimfrkha · 2025-02-27T11:44:04Z

Did some logging testing. There is some caveats. examples/megatron/models/reward_math.py and examples/megatron/models/train_helper.py used SummaryWriter for a reason: because those 2 files dont have the megatron global variables set, get_tensorboard_writer would yield None (so would get_wandb_writer as well).
What do you think we should do ? I tried to bypass this by injecting megatron global vars in those files but no success for the moment.

haolin-nju · 2025-02-27T11:56:08Z

Did some logging testing. There is some caveats. examples/megatron/models/reward_math.py and examples/megatron/models/train_helper.py used SummaryWriter for a reason: because those 2 files dont have the megatron global variables set, get_tensorboard_writer would yield None (so would get_wandb_writer as well). What do you think we should do ? I tried to bypass this by injecting megatron global vars in those files but no success for the moment.

Perhaps we should init another wandb for these actors to log the stats, rather than get_wandb_writer from Megatron? I have found that wandb offers multiprocessing logging (https://docs.wandb.ai/guides/runs/grouping/) to tackle the issue. In this case, wandb advices us to use different ids for each actor and group them into a single task. (https://colab.research.google.com/github/wandb/examples/blob/master/colabs/Grouping_Feature.ipynb)

slimfrkha · 2025-02-27T12:05:11Z

yeah we can but the design of args will be more complicated, i.e wand (and tensorboard) args need to be defined globally (under runtime section maybe ?) then transfered to model_args if Module is a megatron one.

haolin-nju · 2025-02-28T06:36:52Z

yeah we can but the design of args will be more complicated, i.e wand (and tensorboard) args need to be defined globally (under runtime section maybe ?) then transfered to model_args if Module is a megatron one.

Yeah. We are working on adding time summary to wandb. It requires start an wandb instance from engine. Perhaps we could define another yaml called log.yaml, and include it from runtime yaml and base yaml? What do you think of this solution?

feat(wandb): enable logging

ddd8ed2

fix(examples/megatron/models/): fix pylinting

759f7c1

haolin-nju approved these changes Feb 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wandb): enable logging #225

feat(wandb): enable logging #225

slimfrkha commented Jan 31, 2025

haolin-nju commented Feb 6, 2025

slimfrkha commented Feb 6, 2025

haolin-nju left a comment •

edited

Loading

slimfrkha commented Feb 10, 2025

haolin-nju commented Feb 11, 2025

charles9304 commented Feb 11, 2025

slimfrkha commented Feb 11, 2025

adoda commented Feb 21, 2025

slimfrkha commented Feb 27, 2025

haolin-nju commented Feb 27, 2025

slimfrkha commented Feb 27, 2025

haolin-nju commented Feb 28, 2025

feat(wandb): enable logging #225

Are you sure you want to change the base?

feat(wandb): enable logging #225

Conversation

slimfrkha commented Jan 31, 2025

haolin-nju commented Feb 6, 2025

slimfrkha commented Feb 6, 2025

haolin-nju left a comment • edited Loading

Choose a reason for hiding this comment

slimfrkha commented Feb 10, 2025

haolin-nju commented Feb 11, 2025

charles9304 commented Feb 11, 2025

slimfrkha commented Feb 11, 2025

adoda commented Feb 21, 2025

slimfrkha commented Feb 27, 2025

haolin-nju commented Feb 27, 2025

slimfrkha commented Feb 27, 2025

haolin-nju commented Feb 28, 2025

haolin-nju left a comment •

edited

Loading