Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query pandas table with index starting above zero #282

Open
ADBond opened this issue Oct 30, 2024 · 2 comments
Open

Cannot query pandas table with index starting above zero #282

ADBond opened this issue Oct 30, 2024 · 2 comments
Assignees

Comments

@ADBond
Copy link

ADBond commented Oct 30, 2024

If I have a pandas dataframe whose index does not include a 0, I get an error if I try to query it: chdb.ChdbError: Code: 1001. DB::Exception: pybind11::error_already_set: KeyError: 0. If I include an index with 0, but missing any other numbers, everything works as expected.

import pandas as pd
import chdb

df_okay = pd.DataFrame(
    data={"id": ["id1", "id2", "id3"], "name": ["nm1", "nm2", "nm3"]},
    index=[0, 2, 3],
)
# this is fine:
chdb.query("SELECT * FROM Python(df_okay)").show()

df_problem = pd.DataFrame(
    data={"id": ["id1", "id2", "id3"], "name": ["nm1", "nm2", "nm3"]},
    index=[1, 2, 3],
)
# error:
chdb.query("SELECT * FROM Python(df_problem)").show()

This appears in several python versions (3.9-3.12) and MacOS + Ubuntu. For reference:

chdb==2.1.1
pandas==2.2.3
@auxten
Copy link
Member

auxten commented Oct 31, 2024

Forgive my ignorance, I really didn't know that a DataFrame can have its index set. I will debug the issue you mentioned.
BTW, I'm curious about the application scenarios and objectives of setting an index like this?

@auxten auxten self-assigned this Oct 31, 2024
@ADBond
Copy link
Author

ADBond commented Oct 31, 2024

Forgive my ignorance, I really didn't know that a DataFrame can have its index set. I will debug the issue you mentioned. BTW, I'm curious about the application scenarios and objectives of setting an index like this?

No worries, thanks!

I am no expert as I don't really ever work with pandas indexes directly, but my understanding is that they are more performant for certain data operations, like row selection and joins (DataFrame.merge in pandas terms) than using ordinary columns. So you may want to have an index with semantically meaningful data in it, if it makes sense for the kind of operations you would be doing.

In my case it is happening as a side-effect of subsetting some data - the pandas frame I am receiving is a (row) subset of some other frame, and this by default leaves the index values unchanged (so I run into this if the rows I am selecting don't include the row with index value 0). In this case I think there is fairly easy to workaround - I can reset the index before running a query on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants