artifacts-of-llmgendata

This repository contains scripts for running the experiments illustrated in the paper:

"Under the Surface: Tracking the Artifactuality of LLM-Generated Data"
Debarati Das^†¶, Karin de Langis^¶, Anna Martin-Boyle^¶, Jaehyung Kim^¶, Minhwa Lee^¶, Zae Myung Kim^¶, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Sachin Parkar, Ryan Koo, Jong Inn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang
Minnesota NLP, University of Minnesota Twin Cities
^† Project Lead, ^¶ Core Contribution

Our website can be accessed at this Link. The paper can be accessed at arXiv.
The datasets used in the paper can be downloaded from HuggingFace Hub.

Overview

This research project collects diverse text data from large language models (LLMs), encompassing both structured "task labels" and open-ended "free-form text." This extensive dataset allows for a holistic examination of LLM outputs, offering insights into their performance under varying degrees of structure and freedom. The research underscores the importance of responsible and ethical practices in LLM-generated data creation and usage, advocating for collaborative efforts to address biases, enhance diversity, and deepen the understanding of complex human opinions in LLM outputs for ethical and sustainable development.

The structure of the repository closely follows the stress testing methods applied to the five different data types: Task Labels, Preferences, Instructions, Simulations, and Free-Form Text

More specifically, the stress testing experiments are categorized as either "first-order" or "second-order" experiments. In short, the first-order experiments investigate the data "as-is," for example, focusing on their distributional differences and correlation patterns among human- and LLM-generated data; validating and analyzing using manual inspection; and counting how often labels flip between the original human and the resulting machine text. The second-order experiments involve fine-tuning LLMs on the machine-generated data and investigating whether the existing artifacts or biases are amplified.

The code for the corresponding first-order and second-order experiments are placed under two directories of the same names, respectively, except for Simulation data type where we did not perform a second-order experiment.

Under each data type directory, a corresponding README.md file is located for further details:

Citation

@misc{das2024surface,
  title={Under the Surface: Tracking the Artifactuality of LLM-Generated Data}, 
  author={Debarati Das and Karin De Langis and Anna Martin and Jaehyung Kim and Minhwa Lee and Zae Myung Kim and Shirley Hayati and Risako Owan and Bin Hu and Ritik Parkar and Ryan Koo and Jonginn Park and Aahan Tyagi and Libby Ferland and Sanjali Roy and Vincent Liu and Dongyeop Kang},
  year={2024},
  eprint={2401.14698},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
free_form_text		free_form_text
instructions		instructions
preference		preference
simulation		simulation
task_labels		task_labels
website		website
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

artifacts-of-llmgendata

Overview

Citation

About

Releases

Packages

Contributors 10

Languages

minnesotanlp/artifacts-of-llmgendata

Folders and files

Latest commit

History

Repository files navigation

artifacts-of-llmgendata

Overview

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages