Skip to content

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Closed
@ilyanoskov

Description

@ilyanoskov

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

Metadata

Metadata

Assignees

No one assigned

    Labels

    ArrowApache Arrow support

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions