forked from rtdip/core
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feature/00766 - Data Quality Article (rtdip#768)
* Article on Data Quality for Blog Signed-off-by: Amber-Rigg <[email protected]> * Review on Wording Signed-off-by: Amber-Rigg <[email protected]> * Update of date Signed-off-by: Amber-Rigg <[email protected]> * spelling reference Signed-off-by: Amber-Rigg <[email protected]> --------- Signed-off-by: Amber-Rigg <[email protected]>
- Loading branch information
1 parent
a5415b3
commit 4fe52f8
Showing
3 changed files
with
66 additions
and
4 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- | ||
date: 2024-06-24 | ||
authors: | ||
- GBARAS | ||
--- | ||
|
||
# Ensuring Data Quality at Speed with Real Time Data | ||
|
||
<center> | ||
![DataQualityImage](../images/data-quality.png){width=75%} | ||
</center> | ||
|
||
High quality data plays a pivotal role in business success across various dimensions. Accurate and reliable data empowers business leaders to make well informed decisions and achieve operational efficiency, promoting growth and profitability. Data quality encompasses more than just accuracy it also includes completeness, consistency, and relevance. | ||
|
||
<!-- more --> | ||
|
||
Maintaining consistent data quality becomes challenging without a robust data governance framework. Organizations often lack comprehensive data quality assessment procedures, so it’s crucial to regularly evaluate data quality using metrics and automated checks. Integrating data from various sources can introduce inconsistencies, but implementing data integration best practices ensures seamless data flow. Manual data entry is prone to errors, so automation reduces reliance on manual input. To measure data quality, define clear metrics such as accuracy and completeness, and track them consistently. Additionally, automate data cleansing routines (e.g., deduplication, validation) to streamline processes and reduce manual effort. Lastly, use of automation can help to identify incomplete or outdated records and regularly update data sources while retiring obsolete information. | ||
|
||
Maintaining data quality with time series data presents unique challenges. First, the high volume and velocity of incoming data makes real-time validation and processing difficult. Second, time series data often exhibits temporal dependencies, irregular sampling intervals, and missing values, requiring specialized handling. Lastly, dynamic data distribution due to seasonality, trends, or sudden events poses an ongoing challenge for adapting data quality checks. Ensuring data quality in time series streaming demands agility, adaptability and automation. | ||
|
||
## Data Quality Best Practices | ||
|
||
### Data Validation at Ingestion | ||
|
||
Implementing data validation checks when data enters a pipeline before any transformation can prevent issues from becoming lost and hard to track. It is possible to set this with automated scripts that can validate incoming data against predefined rules, for example, it is possible to check for duplication, outliers, missing values, inconsistent data types and much more. | ||
|
||
### Continuous Monitoring | ||
|
||
Monitoring of data quality can support the data validation and cleaning allowing the support team or developer to be notified of detected inconsistencies in the data. Early detection and alerting allow for quick action and prompt investigation which will prevent data quality degradation. | ||
|
||
### Data Cleansing and Preparation | ||
|
||
Automating data cleansing can be run as both a routine job and as a job triggered by failed data validation. Cleansing routines automatically correct or remove erroneous data, ensuring the dataset remains accurate and reliable. | ||
|
||
### Data Profiling | ||
|
||
Automated profiling tools can analyse data distributions, patterns, and correlations. By identifying these potential issues such as skewed distributions or duplicate records, businesses can proactively address them in their data validation and data cleansing processes. | ||
|
||
### Data Governance | ||
|
||
Data governance polices provide a clear framework to follow when ensuring data quality across a business. Managing access controls, data retention, and compliance, maintaining data quality and security. | ||
|
||
## RTDIP and Data Quality | ||
|
||
RTDIP now includes data quality scripts that support the end user in developing strong data quality pass gates for their datasets. The RTDIP component has been built using the open source tool Great Exceptions which is a Python-based open source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. | ||
|
||
RTDIP believes that data quality should be considered an integral part of any data pipeline, more information about RTDIPs data quality components can be found at [Examine Data Quality with Great Expectations](https://www.rtdip.io/sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations/). | ||
|
||
## Open Source Tools and Data Quality | ||
|
||
RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. There are also a number of great open source data quality tools which have gained notoriety due to their transparency, adaptability, and community driven enhancements. | ||
|
||
Choosing the right tool depends on your specific requirements and architecture. Some notable open open source data quality tools include: | ||
|
||
* Built on Spark, Deequ is excellent for testing large datasets. It allows you to validate data using constraint suggestions and verification suites. | ||
* dbt Core is a data pipeline development platform. Its automated testing features include data quality checks and validations. | ||
* MobyDQ offers data profiling, monitoring, and validation. It helps maintain data quality by identifying issues and inconsistencies. | ||
* Soda Core focuses on data monitoring and anomaly detection allowing the business to the track data quality over time and alerting. | ||
|
||
## Contribute | ||
|
||
RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. If you would like to contribute to RTDIP please follow our [Contributing](https://github.com/rtdip/core/blob/develop/CONTRIBUTING.md) guide. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters