The project to enhance the self-instruct method to long-context instruction tuning dataset auto-generation based on long-LLMs, Retrieval-Augmented Generation (RAG) and LLMs-as-Agents
- self-instruct:
- Retrieval-Augmented Generation (RAG):
- LLMs-as-Agents:
- install the pip dependences:
pip install -r requirements.txt
- download the punkt from nltk :
- method1: download through the api
import nltk nltk.download('punkt')
- method2: if the api fails, you can go to the github repo and follow the steps below:
- step1: download the whole
packages
directory into your conda env path like/home/user/anaconda3/envs/myenv/
and rename itnltk_data
- step2: unzip the zip files through the
nltk_data
, especially thetokenizers/
andtaggers/
, and to make it convenient, we also provide a function to do it automatically:from src.utils import unzip_nltk_data nltk_data_dir = "/home/user/anaconda3/envs/myenv/" unzip_nltk_data(nltk_data_dir, remove=True)
- step1: download the whole
- method1: download through the api
- install the poppler tools to make
pdf2image
work (Assuming your OS is Linux, well if not, you can check pdf2image installation guide further):sudo apt-get install poppler-utils
- follow the guide here and install the
LibreOffice
tool to makeunstructured.partition.doc
work