Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ume datamodule - allow downloads of HF datasets #50

Merged
merged 12 commits into from
Mar 21, 2025
Merged

Ume datamodule - allow downloads of HF datasets #50

merged 12 commits into from
Mar 21, 2025

Conversation

karinazad
Copy link
Collaborator

No description provided.

sample: false # if false, uses RoundRobinConcatIterableDataset, else MultiplexedSamplingDataset
stopping_condition: min # min or max, used only if sample is false
weights: null # used only if sample is true, if null and sample is true, samples with weights based on dataset sizes

paths:
root_dir: ./runs
root_dir: /data/lobster/ume
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to use env variable

datasets: ["M320M", "Calm", "AMPLIFY", "Pinder"]
batch_size: 128
tokenizer_max_length: ${model.max_length}
pin_memory: true
shuffle_buffer_size: 1000
num_workers: 4
seed: 0
sample: false # if false, uses RoundRobinConcatIterableDataset, else MultiplexedSamplingDataset
Copy link
Collaborator Author

@karinazad karinazad Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comments since docstrings are in the Ume datamodule

@ncfrey ncfrey merged commit a69b771 into main Mar 21, 2025
5 checks passed
@ncfrey ncfrey deleted the ume-runs branch March 21, 2025 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants