Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added connector folder and HF file #313

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Empty file added connectors/__init__.py
Empty file.
77 changes: 77 additions & 0 deletions connectors/huggingface_connecter.py
RLesser marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
from datasets import load_dataset
abhisomala marked this conversation as resolved.
Show resolved Hide resolved
from nomic import AtlasDataset
from ulid import ULID
import numpy as np

# Gets data from HF dataset
def get_hfdata(dataset_identifier):
try:
# Loads dataset without specifying config
dataset = load_dataset(dataset_identifier)
except ValueError as e:
abhisomala marked this conversation as resolved.
Show resolved Hide resolved
abhisomala marked this conversation as resolved.
Show resolved Hide resolved
# Handles error messages
error_message = str(e)
if "Please pick one among the available configs" in error_message:
try:
available_configs_start = error_message.index("['") + 2
available_configs_end = error_message.index("']")
available_configs = error_message[available_configs_start:available_configs_end].split("', '")
dataset = load_dataset(dataset_identifier, available_configs[0], trust_remote_code=True)
except ValueError:
raise ValueError("Failed to get available configurations")
else:
raise e


# Processes dataset entries
data = []
for split in dataset.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here about large datasets - are we sure this will not break?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I solve this with batch processing and by passing streaming=True through load dataset? Aaron mentioned that so it seems like that could prevent that issue

for i, example in enumerate(dataset[split]):
# Creates a unique ULID
ulid = ULID()
example['id'] = str(ulid)
data.append(example)


return data

# Creates AtlasDataset from HF dataset
def hf_atlasdataset(dataset_identifier):
data = get_hfdata(dataset_identifier.strip())


map_name = dataset_identifier.replace('/', '_')
if not data:
raise ValueError("No data was found for the provided dataset.")


dataset = AtlasDataset(
map_name,
unique_id_field="id",
)


# Convert all booleans and lists to strings
for entry in data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two potential issues here:

  1. This seems like it would be extremely slow (and possibly crash) for large huggingface datasets.
  2. I'm a bit worried about assuming this is how people would want to handle these fields, but i guess they can edit it themselves if they want it done differently...

for key, value in entry.items():
if isinstance(value, bool):
entry[key] = str(value)
elif isinstance(value, list):
entry[key] = ' '.join(map(str, value))
elif isinstance(value, np.ndarray):
entry[key] = ' '.join(map(str, value.flatten()))
elif hasattr(value, 'tolist'):
entry[key] = ' '.join(map(str, value.tolist()))
else:
entry[key] = str(value)


dataset.add_data(data=data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we making an index here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite; that part mainly just converts anything that isn't a string, like booleans or lists, into strings because it will error out if I don't. I didn't really like how it is just a bunch of conditional statements but I couldn't find a better way to resolve it.



return dataset





24 changes: 24 additions & 0 deletions examples/HF_example_usage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

from huggingface_connecter import hf_atlasdataset
import logging

if __name__ == "__main__":
dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse

that way it's easier to handle optional args like split and limit



try:
atlas_dataset = hf_atlasdataset(dataset_identifier)
logging.info(f"AtlasDataset has been created for '{dataset_identifier}'")
except ValueError as e:
logging.error(f"Error creating AtlasDataset: {e}")