You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So first of all, file skipping based on stats does seem to work correctly. For example, in the following case kernel correctly skips the files based on the predicate. So only 1 of the two files is passed to duckdb for scanning:
FROM delta_scan('${DAT_PATH}/out/reader_tests/generated/basic_append/delta')
WHEREnumber>4
Now I have added some test data in duckdb delta which aims to test file skipping for all types that we can push down now. To do so I generate a few tables in the format /generated/test_file_skipping/{type}/delta_lake. See the line generating these tables here.
Now what I would expect is to be able to skip by this table using:
FROM delta_scan('./data/generated/test_file_skipping/bigint/delta_lake')
WHERE part=0
However when I instrument DuckDB to print the files kernel is passing me, I can see that even though the filter is pushed down, both files are passed:
Pushing down filter part = 0
Scanning path file:///Users/sam/Development/delta-kernel-testing/data/generated/test_file_skipping/bigint/delta_lake/part=0/0-00900a4a-99cf-4d43-993c-41950d6ed025-0.parquet
Scanning path file:///Users/sam/Development/delta-kernel-testing/data/generated/test_file_skipping/bigint/delta_lake/part=1/0-00900a4a-99cf-4d43-993c-41950d6ed025-0.parquet
The text was updated successfully, but these errors were encountered:
Hey Sam, I'm currently working on this. Right now data skipping doesn't take hive style partition paths like this into account, I have to upstream a few expression changes for this to also be compatible in delta-rs, but just so you're aware it's on my radar.
In the duckdb delta extension I'm not seeing file skipping based on partitions.
So first of all, file skipping based on stats does seem to work correctly. For example, in the following case kernel correctly skips the files based on the predicate. So only 1 of the two files is passed to duckdb for scanning:
Now I have added some test data in duckdb delta which aims to test file skipping for all types that we can push down now. To do so I generate a few tables in the format
/generated/test_file_skipping/{type}/delta_lake
. See the line generating these tables here.Now what I would expect is to be able to skip by this table using:
However when I instrument DuckDB to print the files kernel is passing me, I can see that even though the filter is pushed down, both files are passed:
The text was updated successfully, but these errors were encountered: