`describe` does not handle mixed case or dots in column names #16017

johnkerl · 2025-05-10T14:51:32Z

Describe the bug

Doing describe on a table with any upper-case or dot characters in column names results in

Error: Execution("Schema error: No field named ...")

To Reproduce

./Cargo.toml

[package]
name = "describe-with-case"
version = "0.1.0"
edition = "2024"

[dependencies]
"datafusion" = "47"
tokio = { version = "1.45", features = ["rt", "rt-multi-thread", "macros"] }

src/main.rs

use std::env;
use datafusion::error::Result;
use datafusion::execution::context::SessionContext;
use datafusion::prelude::CsvReadOptions;

#[tokio::main]
async fn main() -> Result<()> {
    let args: Vec<String> = env::args().collect();
    let ctx = SessionContext::new();
    for arg in args.iter().skip(1) {
        println!("");
        println!("Filename: {arg}");

        let df = ctx.read_csv(arg, CsvReadOptions::new()).await?;
        let stat = df.describe().await?.collect().await?;
        println!("{stat:?}");
    }

    Ok(())
}

./desc-good.csv

abc,def,ghi
1,2,3
4,5,6
7,8,9

./desc-bad.csv

abc,Def,gh.i
1,2,3
4,5,6
7,8,9

Expected behavior

With column names abc,def,ghi we see

cargo run ./desc-good.csv

Filename ./desc-good.csv
[RecordBatch { schema: Schema { fields: [Field { name: "describe", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "abc", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "def", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "ghi", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StringArray
[
  "count",
  "null_count",
  "mean",
  "std",
  "min",
  "max",
  "median",
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  4.0,
  3.0,
  1.0,
  7.0,
  4.0,
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  5.0,
  3.0,
  2.0,
  8.0,
  5.0,
], PrimitiveArray<Float64>
[
  3.0,
  0.0,
  6.0,
  3.0,
  3.0,
  9.0,
  6.0,
]], row_count: 7 }]

With column names abc,Def,gh.i I would expect similar. But I actually see:

cargo run ./desc-bad.csv

Filename ./desc-bad.csv
Error: Execution("Schema error: No field named def. Valid fields are \"?table?\".abc, \"?table?\".\"Def\", \"?table?\".\"gh.i\".")

Additional context

No response

The text was updated successfully, but these errors were encountered:

jfahne · 2025-05-14T11:55:49Z

take

jfahne · 2025-05-16T14:57:16Z

Just an update. I was able to reproduce the error with the following goofy test added to the dataframe tests

#[tokio::test]
async fn bad_describe_behavior() -> Result<()> {
    let config = SessionConfig::from_string_hash_map(&HashMap::from([(
        "datafusion.sql_parser.enable_ident_normalization".to_owned(),
        "true".to_owned(),
    )]))?;
    let ctx = SessionContext::new_with_config(config);
    let name = "aggregate_test_100";
    register_aggregate_csv(&ctx, name).await?;
    let df = ctx.table(name);

    let df = df
        .await?
        .filter(col("c2").eq(lit(3)).and(col("c1").eq(lit("a"))))?
        .limit(0, Some(1))?
        .sort(vec![
            // make the test deterministic
            col("c1").sort(true, true),
            col("c2").sort(true, true),
            col("c3").sort(true, true),
        ])?
        .select_columns(&["c1"])?;

    let df_renamed = df.clone().with_column_renamed("c1", "CoLu.Mn1")?;

    let res = &df_renamed.clone().collect().await?;

    println!("{:?}", df_renamed.describe().await.unwrap());

    Ok(())
}

I get a similar error to the one you shared @johnkerl :

thread 'dataframe::bad_describe_behavior' panicked at datafusion/core/tests/dataframe/mod.rs:1883:50:
called `Result::unwrap()` on an `Err` value: Execution("Schema error: No field named colu.mn1. Valid fields are aggregate_test_100.\"CoLu.Mn1\".")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test dataframe::bad_describe_behavior ... FAILED

failures:

failures:
    dataframe::bad_describe_behavior

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 606 filtered out; finished in 0.10s

One of my initial thoughts for where this error would be coming from is an unexpected behavior of the identifier normalization which is enabled by default. This seems to be the right thread to pull at as disabling datafusion.sql_parser.enable_ident_normalization and running the same test leads to an identical error.

johnkerl · 2025-05-16T16:01:11Z

Thanks @jfahne ! :)

jfahne · 2025-05-16T21:52:13Z

I figured it out! The error is coming from here in the call to the describe(...) method (link to line 937 of mod.rs). The col(f.name()) call is passing the string "CoLu.Mn1" to col which is defined here (link to line 65 of expr_fn.rs). col implicitly normalizes the identifier. Using a format! macro, I can pass in the identifier with escaped quotes which leads to the expected behavior. I'll clean it up and submit a PR.

johnkerl added the bug Something isn't working label May 10, 2025

github-actions bot assigned jfahne May 14, 2025

jfahne mentioned this issue May 18, 2025

fix: describe escaped quoted identifiers #16082

Merged

alamb closed this as completed in #16082 May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`describe` does not handle mixed case or dots in column names #16017

`describe` does not handle mixed case or dots in column names #16017

johnkerl commented May 10, 2025 •

edited

Loading

jfahne commented May 14, 2025

Uh oh!

jfahne commented May 16, 2025 •

edited

Loading

Uh oh!

johnkerl commented May 16, 2025

Uh oh!

jfahne commented May 16, 2025

Uh oh!

describe does not handle mixed case or dots in column names #16017

describe does not handle mixed case or dots in column names #16017

Comments

johnkerl commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

To Reproduce

Expected behavior

Additional context

jfahne commented May 14, 2025

Uh oh!

jfahne commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnkerl commented May 16, 2025

Uh oh!

jfahne commented May 16, 2025

Uh oh!

`describe` does not handle mixed case or dots in column names #16017

`describe` does not handle mixed case or dots in column names #16017

johnkerl commented May 10, 2025 •

edited

Loading

jfahne commented May 16, 2025 •

edited

Loading