CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations. 

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions