Skip to content

IPFS_Datasets_Py

endomorphosis edited this page Dec 20, 2024 · 1 revision

This module overloads the huggingface Datasets library, and creates a new .auto_download() method, which automatically downloads the files from the fastest source using the IPFS model manager, which supports retrieving the model from the IPFS_model_manager_Py module. The model manager then uses local disk cache, the local IPFS node, S3, libp2p, IPFS_cluster, Huggingface Hub, storacha, and filecoin lassie for retrieval in order the "hotness" the files in question and the storage quotas. During retrieval, it runs a test at downloading readme.md, and whichever source is the fastest will be the one which it tries to retrieve the rest of the data.

It also provides some parallel downloading functions, functions to load based by indicies of the shards, load IPFS_CID indicies, functions to convert parquet files to IPLD CAR files, and some other dataset manipulation libraries that are helpful for managing large amounts of data, while both trying to optimize parallelization and memory consumption, using different strategies such as dataset streaming, and indexing.

Clone this wiki locally