load_dataset #7275

santiagobp99 · 2024-11-04T03:01:44Z

Describe the bug

I am performing two operations I see on a hugging face tutorial (Fine-tune a language model), and I am defining every aspect inside the mapped functions, also some imports of the library because it doesnt identify anything not defined outside that function where the dataset elements are being mapped:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=iaAJy5Hu3l_B

`- lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=batch_size,
num_proc=4,
)

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
def tokenize_function(examples):
model_checkpoint = 'gpt2'
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
return tokenizer(examples["text"])`

Steps to reproduce the bug

Currently handle all the imports inside the function

Expected behavior

The code must work es expected in the notebook, but currently this is not happening.

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=iaAJy5Hu3l_B

Environment info

print(transformers.version)

4.46.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset #7275

load_dataset #7275

santiagobp99 commented Nov 4, 2024

load_dataset #7275

load_dataset #7275

Comments

santiagobp99 commented Nov 4, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info