Skip to content

describe does not handle mixed case or dots in column names #16017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnkerl opened this issue May 10, 2025 · 4 comments · Fixed by #16082
Closed

describe does not handle mixed case or dots in column names #16017

johnkerl opened this issue May 10, 2025 · 4 comments · Fixed by #16082
Assignees
Labels
bug Something isn't working

Comments

@johnkerl
Copy link

johnkerl commented May 10, 2025

Describe the bug

Doing describe on a table with any upper-case or dot characters in column names results in

Error: Execution("Schema error: No field named ...")

To Reproduce

./Cargo.toml

[package]
name = "describe-with-case"
version = "0.1.0"
edition = "2024"

[dependencies]
"datafusion" = "47"
tokio = { version = "1.45", features = ["rt", "rt-multi-thread", "macros"] }

src/main.rs

use std::env;
use datafusion::error::Result;
use datafusion::execution::context::SessionContext;
use datafusion::prelude::CsvReadOptions;

#[tokio::main]
async fn main() -> Result<()> {
    let args: Vec<String> = env::args().collect();
    let ctx = SessionContext::new();
    for arg in args.iter().skip(1) {
        println!("");
        println!("Filename: {arg}");

        let df = ctx.read_csv(arg, CsvReadOptions::new()).await?;
        let stat = df.describe().await?.collect().await?;
        println!("{stat:?}");
    }

    Ok(())
}

./desc-good.csv

abc,def,ghi
1,2,3
4,5,6
7,8,9

./desc-bad.csv

abc,Def,gh.i
1,2,3
4,5,6
7,8,9

Expected behavior

With column names abc,def,ghi we see

cargo run ./desc-good.csv

Filename ./desc-good.csv
[RecordBatch { schema: Schema { fields: [Field { name: "describe", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "abc", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "def", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "ghi", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StringArray
[
  "count",
  "null_count",
  "mean",
  "std",
  "min",
  "max",
  "median",
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  4.0,
  3.0,
  1.0,
  7.0,
  4.0,
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  5.0,
  3.0,
  2.0,
  8.0,
  5.0,
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  6.0,
  3.0,
  3.0,
  9.0,
  6.0,
]], row_count: 7 }]

With column names abc,Def,gh.i I would expect similar. But I actually see:

cargo run ./desc-bad.csv

Filename ./desc-bad.csv
Error: Execution("Schema error: No field named def. Valid fields are \"?table?\".abc, \"?table?\".\"Def\", \"?table?\".\"gh.i\".")

Additional context

No response

@johnkerl johnkerl added the bug Something isn't working label May 10, 2025
@jfahne
Copy link
Contributor

jfahne commented May 14, 2025

take

@jfahne
Copy link
Contributor

jfahne commented May 16, 2025

Just an update. I was able to reproduce the error with the following goofy test added to the dataframe tests

#[tokio::test]
async fn bad_describe_behavior() -> Result<()> {
    let config = SessionConfig::from_string_hash_map(&HashMap::from([(
        "datafusion.sql_parser.enable_ident_normalization".to_owned(),
        "true".to_owned(),
    )]))?;
    let ctx = SessionContext::new_with_config(config);
    let name = "aggregate_test_100";
    register_aggregate_csv(&ctx, name).await?;
    let df = ctx.table(name);

    let df = df
        .await?
        .filter(col("c2").eq(lit(3)).and(col("c1").eq(lit("a"))))?
        .limit(0, Some(1))?
        .sort(vec![
            // make the test deterministic
            col("c1").sort(true, true),
            col("c2").sort(true, true),
            col("c3").sort(true, true),
        ])?
        .select_columns(&["c1"])?;

    let df_renamed = df.clone().with_column_renamed("c1", "CoLu.Mn1")?;

    let res = &df_renamed.clone().collect().await?;

    println!("{:?}", df_renamed.describe().await.unwrap());

    Ok(())
}

I get a similar error to the one you shared @johnkerl :

thread 'dataframe::bad_describe_behavior' panicked at datafusion/core/tests/dataframe/mod.rs:1883:50:
called `Result::unwrap()` on an `Err` value: Execution("Schema error: No field named colu.mn1. Valid fields are aggregate_test_100.\"CoLu.Mn1\".")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test dataframe::bad_describe_behavior ... FAILED

failures:

failures:
    dataframe::bad_describe_behavior

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 606 filtered out; finished in 0.10s

One of my initial thoughts for where this error would be coming from is an unexpected behavior of the identifier normalization which is enabled by default. This seems to be the right thread to pull at as disabling datafusion.sql_parser.enable_ident_normalization and running the same test leads to an identical error.

@johnkerl
Copy link
Author

Thanks @jfahne ! :)

@jfahne
Copy link
Contributor

jfahne commented May 16, 2025

I figured it out! The error is coming from here in the call to the describe(...) method (link to line 937 of mod.rs). The col(f.name()) call is passing the string "CoLu.Mn1" to col which is defined here (link to line 65 of expr_fn.rs). col implicitly normalizes the identifier. Using a format! macro, I can pass in the identifier with escaped quotes which leads to the expected behavior. I'll clean it up and submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants