Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: Stacking, potentially heterogeneous, datasets #7279

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

TimCares
Copy link

@TimCares TimCares commented Nov 5, 2024

Introduction

Hello there,
I noticed that there are two ways to combine multiple datasets: Either through datasets.concatenate_datasets or datasets.interleave_datasets. However, to my knowledge (please correct me if I am wrong) both approaches require the datasets that are combined to have the same features.

I think it would be a great idea to add support for combining multiple datasets that might not follow the same schema (i.e. have different features), for example an image and text dataset. That is why I propose a third function of the datasets.combine module called stack_datasets, which can be used to combine a list of datasets with (potentially) different features. This would look as follows:

>>> from datasets import stack_datasets
>>> image_dataset = ...
>>> next(iter(image_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
>>> text_dataset = ...
>>> next(iter(text_dataset))
{'text': "This is a test."}
>>> stacked = stack_datasets(datasets={'i_ds': image_dataset, 't_ds': text_dataset}, stopping_strategy='all_exhausted')
>>> next(iter(stacked))
{
'i_ds': {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
't_ds': {'text': "This is a test."}
}

Motivation

I motivate this by:

A: The fact that Pytorch offers a similar functionality under torch.utils.data.StackDataset (link).

B: In settings where one would like to e.g. train a Vision-Language model using an image-text dataset, an image dataset, and a text dataset, this functionality would offer a clean and intuitive solution to create multimodal datasets. I am aware that the aforementioned is also feasible without my proposed function, but I believe this offers a nice approach that aligns with existing functionality and is directly provided within the datasets package.

API

stack_datasets has two arguments: datasets and stopping_strategy .

datasets is a dictionary of either type Dict[str, Dataset] or Dict[str, IterableDatasets], a mixture is not allowed. It contains the names of the datasets (the keys) and the datasets themselves (the values) that should be stacked. Each item returned is a dictionary with one key-value pair for each dataset. The keys are the names of the datasets as provided in the argument datasets, and the values are the respective examples from the datasets.

stopping_strategy is the same as for interleave_datasets. If it is first_exhausted we stop if the smallest dataset runs out of examples, if it is all_exhausted we stop if all datasets ran out of examples at least once. For all_exhausted that means that we may visit examples from datasets multiple times.

Docs

I saw that there are multiple documentations and guides on the HuggingFace website that introduce concatenate_datasets and interleave_datasets, for example here. If this request is merged I would be willing to add the new functionality at the appropriate points in the documentation (if desired).

Tests

I also added some tests to ensure correctness. Some tests I wrote in tests/test_iterable_dataset.py
run for both Dataset and IterableDataset even though tests for Dataset technically do not belong in this script, but I found that this was a nice way to cover more cases with mostly the same code.

Additional information

I tried to write the code in a way so that it is similar to that of concatenate_datasets and interleave_datasets.
I’m open to feedback and willing to make adjustments based on your suggestions, so feel free to give me your take. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant