Great Expectations vs Pandera #598
-
Question about panderaHi there, I've used pandera in the past to validate data processing pipelines for ML workflows. My current org is doing a spike on Great Expectations to try to improve the quality of our data ingestion process. Could anyone here provide insight as to the differences between Great Expectations and Pandera, whether or not they overlap or do similar things? It seems like there is some overlap but I'm sure the group here could tell me more about the nuances and differences between the two resources. Thanks in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Hi @Veganveins thanks for your question! So one big caveat here is that I haven't used GE very extensively but I'll do my best to summarize the similarities and differences. OverlapThe main overlap is that both libraries aim to solve the same problem of ensuring data quality, but I think the approach pandera takes is closer in spirit to pydantic or dataclasses, in that it's a light weight package that focuses on one thing, which is parsing and validation of in-memory dataframes. Think of this as run-time enforced type-annotations for your dataframes. DifferencesGE provides data validation, profiling, and documentation, and is closer to a declarative tool that you'd integrate with your various data stores (SQL, etc.) or cluster computing environments like Spark. Their docs go into more detail on the package's functionality. Pandera is designed to be useful with zero configuration, and it's syntax is optimized for intuitiveness and ease of use for folks already familiar with pandas/pandas-like libraries. Currently pandas is only supported, but we're working on getting support for Koalas, Modin, and eventually Dask and other dataframe frameworks (SQL parsing/validation would be a heavy lift, but might be added to the roadmap if there's enough demand). On the other hand, GE looks like it requires some upfront investment on configuration and setup, but once that's done it provides a whole suite of useful features (data profiling and docs look super useful), as well as a GUI for updating validation rules. One thing that Pandera offers that GE doesn't is data synthesis strategies, which integrates with hypothesis for automatically generating mock data for use in a (e.g. pytest) test suite. SyntaxSyntactically, Pandera schemas are primarily written in python, either with the object-based API or class-based API, though it does support a yaml format and reading from frictionless schemas. It separates the concern between the schema specification and the object to be validated. With GE, it looks like the primary UX is to define validation rules declaratively in json files, which can then be loaded into a python runtime to validate your tables of interest. It also exposes a python API that (I think?) inherits from pandas dataframes and extends the ConclusionNote that these two libraries are not mutually exclusive: e.g. you could use Pandera for in-memory parsing/validation, and GE for validating data on disk, or Pandera when doing prototyping and research and port Pandera schemas to GE suite (a Pandera Schema -> GE expectation suite seems like a good idea to facilitate this 🤔) Let me know if you have other questions! |
Beta Was this translation helpful? Give feedback.
-
You might find fugue interesting. They are running pandera on spark and dask through fugue, Kevin Kho (@kvnkho) wrote up a medium post on it here! |
Beta Was this translation helpful? Give feedback.
-
Thank you @cosmicBboy and @rdmolony ! Great content and very useful context. I don't have any other questions right now but I will follow up if I think of anything else :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for tagging @rdmolony . Coincidentally, there is this pull request into the pandera docs on how to use pandera on top of the Spark execution engine through Fugue. We connected with @cosmicBboy after PyCon. I talked about Great Expectations versus pandera in my PyCon presentation, but not detailed enough since it was 30 mins. @goodwanghan and I will also be using both in our upcoming Oreilly course that came as a result of the PyCon presentation. I don't have much more to add to what @cosmicBboy said. Let's just say that the Great Expectations has a larger surface area when it comes to your project, but you have to opt-in to get those benefits (like data documentation). pandera is lightweight and is non-invasive into your code. I'd be happy to chat more @Veganveins through Zoom or wherever if you're interested. My contact info is in my Github bio. 😄 |
Beta Was this translation helpful? Give feedback.
-
Wow thanks @kvnkho !! This presentation looks excellent and the O'Reilly course looks great! |
Beta Was this translation helpful? Give feedback.
-
@Veganveins thanks for the question, the discussion in here is great! Going to convert this to a github discussion, would you mind selecting my response as the answer? |
Beta Was this translation helpful? Give feedback.
Hi @Veganveins thanks for your question!
So one big caveat here is that I haven't used GE very extensively but I'll do my best to summarize the similarities and differences.
Overlap
The main overlap is that both libraries aim to solve the same problem of ensuring data quality, but I think the approach pandera takes is closer in spirit to pydantic or dataclasses, in that it's a light weight package that focuses on one thing, which is parsing and validation of in-memory dataframes. Think of this as run-time enforced type-annotations for your dataframes.
Differences
GE provides data validation, profiling, and documentation, and is closer to a declarative tool that you'd integrate with your v…