Skip to content

Commit 92c0bb9

Browse files
authored
feat: added example for reading and writing dataset in rust (lancedb#2349)
1 parent 648adf8 commit 92c0bb9

File tree

3 files changed

+97
-3
lines changed

3 files changed

+97
-3
lines changed

docs/examples/examples.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@ Examples
22
--------
33

44
.. toctree::
5-
:maxdepth: 1
5+
:maxdepth: 1
66

7-
Creating text dataset for LLM training using Lance <./llm_dataset_creation>
8-
Training LLMs using a Lance text dataset <./llm_training>
7+
Creating text dataset for LLM training using Lance <./llm_dataset_creation.rst>
8+
Training LLMs using a Lance text dataset <./llm_training.rst>
9+
Reading and writing a Lance dataset in Rust <./write_read_dataset.rst>

docs/examples/write_read_dataset.rst

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Writing and reading a dataset using Lance
2+
-----------------------------------------
3+
4+
In this example, we will write a simple lance dataset to disk. Then we will read it and print out some basic properties like the schema and sizes for each record batch in the dataset.
5+
The example uses only one record batch, however it should work for larger datasets (multiple record batches) as well.
6+
7+
Writing the raw dataset
8+
~~~~~~~~~~~~~~~~~~~~~~~
9+
10+
.. literalinclude:: ../../rust/lance/examples/write_read_ds.rs
11+
:language: rust
12+
:linenos:
13+
:start-at: // Writes sample dataset to the given path
14+
:end-at: } // End write dataset
15+
16+
First we define a schema for our dataset, and create a record batch from that schema. Next we iterate over the record batches (only one in this case) and write them to disk. We also define the write parameters (set to overwrite) and then write the dataset to disk.
17+
18+
Reading a Lance dataset
19+
~~~~~~~~~~~~~~~~~~~~~~~
20+
Now that we have written the dataset to a new directory, we can read it back and print out some basic properties.
21+
22+
.. literalinclude:: ../../rust/lance/examples/write_read_ds.rs
23+
:language: rust
24+
:linenos:
25+
:start-at: // Reads dataset from the given path
26+
:end-at: // End read dataset
27+
28+
First we open the dataset, and create a scanner object. We use it to create a `batch_stream` that will let us access each record batch in the dataset.
29+
Then we iterate over the record batches and print out the size and schema of each one.

rust/lance/examples/write_read_ds.rs

+64
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
// SPDX-FileCopyrightText: Copyright The Lance Authors
3+
4+
use arrow::array::UInt32Array;
5+
use arrow::datatypes::{DataType, Field, Schema};
6+
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
7+
use futures::StreamExt;
8+
use lance::dataset::{WriteMode, WriteParams};
9+
use lance::Dataset;
10+
use std::sync::Arc;
11+
12+
// Writes sample dataset to the given path
13+
async fn write_dataset(data_path: &str) {
14+
// Define new schema
15+
let schema = Arc::new(Schema::new(vec![
16+
Field::new("key", DataType::UInt32, false),
17+
Field::new("value", DataType::UInt32, false),
18+
]));
19+
20+
// Create new record batches
21+
let batch = RecordBatch::try_new(
22+
schema.clone(),
23+
vec![
24+
Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
25+
Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
26+
],
27+
)
28+
.unwrap();
29+
30+
let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());
31+
32+
// Define write parameters (e.g. overwrite dataset)
33+
let write_params = WriteParams {
34+
mode: WriteMode::Overwrite,
35+
..Default::default()
36+
};
37+
38+
Dataset::write(batches, data_path, Some(write_params))
39+
.await
40+
.unwrap();
41+
} // End write dataset
42+
43+
// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
44+
async fn read_dataset(data_path: &str) {
45+
let dataset = Dataset::open(data_path).await.unwrap();
46+
let scanner = dataset.scan();
47+
48+
let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());
49+
50+
while let Some(batch) = batch_stream.next().await {
51+
println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
52+
println!("Schema: {:?}", batch.schema()); // print schema of recordbatch
53+
54+
println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
55+
}
56+
} // End read dataset
57+
58+
#[tokio::main]
59+
async fn main() {
60+
let data_path: &str = "./temp_data.lance";
61+
62+
write_dataset(data_path).await;
63+
read_dataset(data_path).await;
64+
}

0 commit comments

Comments
 (0)