Make Blocks addressable from the file reader #322

osopardo1 · 2024-04-25T06:04:44Z

From v0.6.0 onwards, the structure of the Table is composed by files that contain multiple blocks, each of them belonging to the same or different cubes. This is part of the Multiblock format, that allowed Qbeast to balance the file layout without losing indexing benefits.

Now, blocks help us locate a particular cube on the file, but a single block is not addressable/retrievable from the spark reader. Although we are using Delta File Skipping to discard data based on min/max, we are not supporting such fine-grained search when Sampling is applied.

This change requires some work regarding #175 . Datasource V2 is more extensible and allows us to implement our reader. In this case, the reader should be designed to skip entire groups of rows based on the block number.

PS: This is something that @alexeiakimov had tried in previous issues, but some other priorities were raised.

TODOs:

Analyze how to make blocks addressable from a Parquet File.
Implement Datasource V2 for Qbeast
Make a PoC
Develop the feature and test

The text was updated successfully, but these errors were encountered:

osopardo1 added the type: enhancement Improvement of existing feature or code label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Blocks addressable from the file reader #322

Make Blocks addressable from the file reader #322

osopardo1 commented Apr 25, 2024

Make Blocks addressable from the file reader #322

Make Blocks addressable from the file reader #322

Comments

osopardo1 commented Apr 25, 2024