Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset #7275

Open
santiagobp99 opened this issue Nov 4, 2024 · 0 comments
Open

load_dataset #7275

santiagobp99 opened this issue Nov 4, 2024 · 0 comments

Comments

@santiagobp99
Copy link

Describe the bug

I am performing two operations I see on a hugging face tutorial (Fine-tune a language model), and I am defining every aspect inside the mapped functions, also some imports of the library because it doesnt identify anything not defined outside that function where the dataset elements are being mapped:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=iaAJy5Hu3l_B

`- lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=batch_size,
num_proc=4,
)

  • tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
    def tokenize_function(examples):
    model_checkpoint = 'gpt2'
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    return tokenizer(examples["text"])`

Steps to reproduce the bug

Currently handle all the imports inside the function

Expected behavior

The code must work es expected in the notebook, but currently this is not happening.

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=iaAJy5Hu3l_B

Environment info

print(transformers.version)

4.46.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant