You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The order of the dataset was changed after executing the compact_files operation.
My code is as follows:
importlancefromlance.datasetimportDatasetOptimizerDB_PATH="<path-to-my-dataset>"defmain():
dataset=lance.dataset(DB_PATH)
print(dataset.take([0,1,2,3,4,5,6,7,8,9]) # show the first ten elements# compact the datasetoptim=DatasetOptimizer(dataset)
optim.compact_files(num_threads=8)
print(dataset.take([0,1,2,3,4,5,6,7,8,9]) # the first ten elements was changedreturnif__name__=="__main__":
main()
My enviroment
I am using a ubuntu server with 64 cores and 512G memory.
The dataset has 5 columns: title(str), section(str), text(str), id(str), and vector(list[float]).
How to reproduce
This dataset has 38 Million records of 768 dim vector and payload. I'm not sure if its feasible to share the dataset.
The text was updated successfully, but these errors were encountered:
If there are enough files to justify multiple concurrent compaction tasks (by default this would mean at least 2Mi uncompacted rows) then we run compaction tasks in parallel.
I'm not sure whether or not we sequence the results but this seems a likely candidate for the reordering.
Lance Version
pylance 0.20.0
What happened
The order of the dataset was changed after executing the
compact_files
operation.My code is as follows:
My enviroment
I am using a ubuntu server with 64 cores and 512G memory.
The dataset has 5 columns: title(str), section(str), text(str), id(str), and vector(list[float]).
How to reproduce
This dataset has 38 Million records of 768 dim vector and payload. I'm not sure if its feasible to share the dataset.
The text was updated successfully, but these errors were encountered: