Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debugging #7249

Open
ShDdu opened this issue Oct 24, 2024 · 0 comments
Open

How to debugging #7249

ShDdu opened this issue Oct 24, 2024 · 0 comments

Comments

@ShDdu
Copy link

ShDdu commented Oct 24, 2024

Describe the bug

I wanted to use my own script to handle the processing, and followed the tutorial documentation by rewriting the MyDatasetConfig and MyDatasetBuilder (which contains the _info,_split_generators and _generate_examples methods) classes. Testing with simple data was able to output the results of the processing, but when I wished to do more complex processing, I found that I was unable to debug (even the simple samples were inaccessible). There are no errors reported, and I am able to print the _info,_split_generators and _generate_examples messages, but I am unable to access the breakpoints.

Steps to reproduce the bug

my_dataset.py

import json
import datasets

class MyDatasetConfig(datasets.BuilderConfig):
def init(self, **kwargs):
super(MyDatasetConfig, self).init(**kwargs)

class MyDataset(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("1.0.0")

BUILDER_CONFIGS = [
    MyDatasetConfig(
        name="default",
        version=VERSION,
        description="myDATASET"
    ),
]

def _info(self):
    print("info")  # breakpoints
    return datasets.DatasetInfo(
        description="myDATASET",
        features=datasets.Features(
            {
                "id": datasets.Value("int32"),
                "text": datasets.Value("string"),
                "label": datasets.ClassLabel(names=["negative", "positive"]),
            }
        ),
        supervised_keys=("text", "label"),
    )

def _split_generators(self, dl_manager):

    print("generate")  # breakpoints
    data_file = "data.json"  

    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_file}
        ),
    ]

def _generate_examples(self, filepath):
    print("example")  # breakpoints
    with open(filepath, encoding="utf-8") as f:
        data = json.load(f)
        for idx, sample in enumerate(data):
            yield idx, {
                "id": sample["id"],
                "text": sample["text"],
                "label": sample["label"],
            }

#main.py
import os
os.environ["TRANSFORMERS_NO_MULTIPROCESSING"] = "1"

from datasets import load_dataset

dataset = load_dataset("my_dataset.py", split="train", cache_dir=None)

print(dataset[:5])

Expected behavior

Pause at breakpoints while running debugging

Environment info

pycharm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant