Print rows with invalid data type #654

Lavi2015 · 2021-10-07T02:43:02Z

Lavi2015
Oct 7, 2021

Question about pandera

Hi,

I am trying to test pandera for schema validation.
In the following example, I would like to know whether 1st row ( because of "a" in the "int_column") will be flagged as error.

print(err.data) indicates "int_column dtype('int64')" as the possible error but without row or index value.

As per the documentation page err.data # invalid dataframe should return only the invalid row.
But err.data prints all the 3 rows instead of just 1 st row. As 2nd and 3rd row, data types match with the schema.
Please correct me if my understanding is wrong.

Is it possible to print only rows where the data type does not match the schema in the following example?
Thanks and Regards

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

schema = pa.DataFrameSchema(
    columns={
        "int_column": Column(int),
        "float_column": Column(float),
        "str_column": Column(str),
    },
    strict=True
)

df = pd.DataFrame({
    "int_column": ["a", 2,3],
    "float_column": [0.0, 1.0, 2.0],
    "str_column": ["a", "b", "c"],
})

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    print(err.failure_cases)
    print("\nDataFrame object that failed validation:")
    print(err.data)

output as below:

Schema errors and failure cases:
schema_context column check check_number failure_case index
0 Column int_column dtype('int64') None object None

DataFrame object that failed validation:
int_column float_column str_column
0 a 0.0 a
1 2 1.0 b
2 3 2.0 c

Answered by jeffzi

Oct 10, 2021

Hi @Lavi2015.

By default, pandera does not check values individually, but checks the dtypes of the columns (i.e. DataFrame.dtypes). To know the exact failure cases, you can enable coerce=True. Pandera will attempt to coerce the DataFrame to the schema dtypes and will report values that could not be coerced:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

schema = pa.DataFrameSchema(
    columns={
        "int_column": Column(int),
        "float_column": Column(float),
        "str_column": Column(str),
    },
    strict=True,
    coerce=True, # <----
)

df = pd.DataFrame(
    {
        "int_column": ["a", 2, 3],
        "float_column": [0.0, 1.0

View full answer

jeffzi · 2021-10-10T21:52:46Z

jeffzi
Oct 10, 2021
Collaborator

Hi @Lavi2015.

By default, pandera does not check values individually, but checks the dtypes of the columns (i.e. DataFrame.dtypes). To know the exact failure cases, you can enable coerce=True. Pandera will attempt to coerce the DataFrame to the schema dtypes and will report values that could not be coerced:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

schema = pa.DataFrameSchema(
    columns={
        "int_column": Column(int),
        "float_column": Column(float),
        "str_column": Column(str),
    },
    strict=True,
    coerce=True, # <----
)

df = pd.DataFrame(
    {
        "int_column": ["a", 2, 3],
        "float_column": [0.0, 1.0, 2.0],
        "str_column": ["a", "b", "c"],
    }
)

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    print(err.failure_cases)
    print("\nDataFrame object that failed validation:")
    print(err.data)
#> Schema errors and failure cases:
#>   schema_context      column                  check check_number failure_case  \
#> 0         Column  int_column  coerce_dtype('int64')         None            a   
#> 1         Column  int_column         dtype('int64')         None       object   
#> 
#>   index  
#> 0     0  
#> 1  None  
#> 
#> DataFrame object that failed validation:
#>   int_column  float_column str_column
#> 0          a           0.0          a
#> 1          2           1.0          b
#> 2          3           2.0          c

It's true that this behavior could be made easier to discover. We've talked before about writing a cookbook. I think that would be a good recipe.

0 replies

Lavi2015 · 2021-10-11T09:44:49Z

Lavi2015
Oct 11, 2021
Author

Hi @jeffzi ,
Thanks for your response and really appreciate your time.
I have installed pandera==0.7.2 version

 try:
    schema.validate(df_uc1, lazy=True)
    print("Schema validation is completed successfully")
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)

My dataset has around 100 thousand records of which 90 rows fail due to dtype mismatch.
But print(exc.failure_cases) fails to print all 90 rows in stdout. I could see only first and last 5 rows as below.

I tried redirecting the exceptions to a file but still I could see only first and last 5 lines. My intention is to find out all the rows (form index column) for troubleshooting as well as to identify all the issues. How to achieve this as print(exc.failure_cases) is not printing all the 90 rows.

Also what's the first column about? Thanks.

0 replies

jeffzi · 2021-10-11T11:03:40Z

jeffzi
Oct 11, 2021
Collaborator

failure_cases is actually a DataFrame. You can ask pandas to print more rows with pd.options.display.max_rows = 999 .

I tried redirecting the exceptions to a file but still I could see only first and last 5 lines.

How did you redirect? Exporting to csv with err.failure_cases.to_csv("failures.csv", index=False) should write all the failures.

0 replies

Lavi2015 · 2021-10-11T11:19:16Z

Lavi2015
Oct 11, 2021
Author

@jeffzi , Awesome. It works. Thank you so much.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print rows with invalid data type #654

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Print rows with invalid data type #654

Lavi2015 Oct 7, 2021

Question about pandera

Replies: 4 comments

jeffzi Oct 10, 2021 Collaborator

Lavi2015 Oct 11, 2021 Author

jeffzi Oct 11, 2021 Collaborator

Lavi2015 Oct 11, 2021 Author

Lavi2015
Oct 7, 2021

jeffzi
Oct 10, 2021
Collaborator

Lavi2015
Oct 11, 2021
Author

jeffzi
Oct 11, 2021
Collaborator

Lavi2015
Oct 11, 2021
Author