Skip to content

Commit

Permalink
Feature/00766 - Data Quality Article (rtdip#768)
Browse files Browse the repository at this point in the history
* Article on Data Quality for Blog

Signed-off-by: Amber-Rigg <[email protected]>

* Review on Wording

Signed-off-by: Amber-Rigg <[email protected]>

* Update of date

Signed-off-by: Amber-Rigg <[email protected]>

* spelling reference

Signed-off-by: Amber-Rigg <[email protected]>

---------

Signed-off-by: Amber-Rigg <[email protected]>
  • Loading branch information
Amber-Rigg authored Jul 16, 2024
1 parent a5415b3 commit 4fe52f8
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 4 deletions.
Binary file added docs/blog/images/data-quality.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 62 additions & 0 deletions docs/blog/posts/rtdip_data_quality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
date: 2024-06-24
authors:
- GBARAS
---

# Ensuring Data Quality at Speed with Real Time Data

<center>
![DataQualityImage](../images/data-quality.png){width=75%}
</center>

High quality data plays a pivotal role in business success across various dimensions. Accurate and reliable data empowers business leaders to make well informed decisions and achieve operational efficiency, promoting growth and profitability. Data quality encompasses more than just accuracy it also includes completeness, consistency, and relevance.

<!-- more -->

Maintaining consistent data quality becomes challenging without a robust data governance framework. Organizations often lack comprehensive data quality assessment procedures, so it’s crucial to regularly evaluate data quality using metrics and automated checks. Integrating data from various sources can introduce inconsistencies, but implementing data integration best practices ensures seamless data flow. Manual data entry is prone to errors, so automation reduces reliance on manual input. To measure data quality, define clear metrics such as accuracy and completeness, and track them consistently. Additionally, automate data cleansing routines (e.g., deduplication, validation) to streamline processes and reduce manual effort. Lastly, use of automation can help to identify incomplete or outdated records and regularly update data sources while retiring obsolete information.

Maintaining data quality with time series data presents unique challenges. First, the high volume and velocity of incoming data makes real-time validation and processing difficult. Second, time series data often exhibits temporal dependencies, irregular sampling intervals, and missing values, requiring specialized handling. Lastly, dynamic data distribution due to seasonality, trends, or sudden events poses an ongoing challenge for adapting data quality checks. Ensuring data quality in time series streaming demands agility, adaptability and automation.

## Data Quality Best Practices

### Data Validation at Ingestion

Implementing data validation checks when data enters a pipeline before any transformation can prevent issues from becoming lost and hard to track. It is possible to set this with automated scripts that can validate incoming data against predefined rules, for example, it is possible to check for duplication, outliers, missing values, inconsistent data types and much more.

### Continuous Monitoring

Monitoring of data quality can support the data validation and cleaning allowing the support team or developer to be notified of detected inconsistencies in the data. Early detection and alerting allow for quick action and prompt investigation which will prevent data quality degradation.

### Data Cleansing and Preparation

Automating data cleansing can be run as both a routine job and as a job triggered by failed data validation. Cleansing routines automatically correct or remove erroneous data, ensuring the dataset remains accurate and reliable.

### Data Profiling

Automated profiling tools can analyse data distributions, patterns, and correlations. By identifying these potential issues such as skewed distributions or duplicate records, businesses can proactively address them in their data validation and data cleansing processes.

### Data Governance

Data governance polices provide a clear framework to follow when ensuring data quality across a business. Managing access controls, data retention, and compliance, maintaining data quality and security.

## RTDIP and Data Quality

RTDIP now includes data quality scripts that support the end user in developing strong data quality pass gates for their datasets. The RTDIP component has been built using the open source tool Great Exceptions which is a Python-based open source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams.

RTDIP believes that data quality should be considered an integral part of any data pipeline, more information about RTDIPs data quality components can be found at [Examine Data Quality with Great Expectations](https://www.rtdip.io/sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations/).

## Open Source Tools and Data Quality

RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. There are also a number of great open source data quality tools which have gained notoriety due to their transparency, adaptability, and community driven enhancements.

Choosing the right tool depends on your specific requirements and architecture. Some notable open open source data quality tools include:

* Built on Spark, Deequ is excellent for testing large datasets. It allows you to validate data using constraint suggestions and verification suites.
* dbt Core is a data pipeline development platform. Its automated testing features include data quality checks and validations.
* MobyDQ offers data profiling, monitoring, and validation. It helps maintain data quality by identifying issues and inconsistencies.
* Soda Core focuses on data monitoring and anomaly detection allowing the business to the track data quality over time and alerting.

## Contribute

RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. If you would like to contribute to RTDIP please follow our [Contributing](https://github.com/rtdip/core/blob/develop/CONTRIBUTING.md) guide.
8 changes: 4 additions & 4 deletions docs/blog/posts/rtdip_energy_forecasting.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ Energy forecasting plays a pivotal role in our modern world, where energy consum

Energy forecasting involves predicting the demand load and price of various energy sources, including both fossil fuels and renewable energy resources like wind and solar.

With an accurate energy usage forecast, a business can efficiently allocate and manage resources, this is crucial to maintain a stable energy supply to the consumer; energy forecasting is fundamental as we transition to renewable energy sources which do not produce consistent energy. Energy companies, grid operators and industrial consumers rely on forecasts to optimize their operations. Over- or undercontracting can lead to significant financial losses, so precise forecasts are essential.
With an accurate energy usage forecast, a business can efficiently allocate and manage resources, this is crucial to maintain a stable energy supply to the consumer; energy forecasting is fundamental as we transition to renewable energy sources which do not produce consistent energy. Energy companies, grid operators and industrial consumers rely on forecasts to optimise their operations. Over- or undercontracting can lead to significant financial losses, so precise forecasts are essential.

<!-- more -->

Energy load prices and forecasts greatly influence the energy sector and the decisions made across multiple departments in energy companies. For example, medium to long-term energy forecasts are vital for planning and investing in new capacity, they guide decisions on new assets, transmission lines and distribution networks. Another example is risk mitigation, unstable electricity prices can be handled with accurate forecasting of the market, companies can develop bidding strategies, production schedules and consumption patterns to minimize risk and maximize profits.

Energy forecasting is foused on performance, i.e. how much over or under a forecast is and performance during extreme weather days. Quantifying a financial impact relative to market conditions can be diffcult. However, a rough estimate of savings from a 1% reduction in the mean absolute percentage error (MAPE) for a utility with a 1 GW peak load includes:
Energy forecasting is focused on performance, i.e. how much over or under a forecast is and performance during extreme weather days. Quantifying a financial impact relative to market conditions can be diffcult. However, a rough estimate of savings from a 1% reduction in the mean absolute percentage error (MAPE) for a utility with a 1 GW peak load includes:

- $500,000 per year from long-term load forecasting
- $300,000 per year from short-term load forecasting
Expand All @@ -30,7 +30,7 @@ Energy Forecasting allows for significant cost avoidance due to better price for

## Energy Forecasting with RTDIP

RTDIP can be a powerful tool for businesses looking to forecast energy usage. RTDIP supports load forecasting applications, a critical technique used by RTOs(Regional Transmission Organisations)/TSOs(Transmission System Operators), ISOs (Independent System Operators) and energy providers. Load forecasting allows a business to predict the power or energy needed to maintain the balance between energy demand and supply on the grid. Two primary inputs for load forecasting are weather data and meter data, RTDIP has developed pipeline components for these types of data.
RTDIP can be a powerful tool for businesses looking to forecast energy usage. RTDIP supports load forecasting applications, a critical technique used by RTOs (Regional Transmission Organisations)/TSO (Transmission System Operators), ISOs (Independent System Operators) and energy providers. Load forecasting allows a business to predict the power or energy needed to maintain the balance between energy demand and supply on the grid. Two primary inputs for load forecasting are weather data and meter data, RTDIP has developed pipeline components for these types of data.

RTDIP provides example pipelines for weather forecast data ingestion. Accurate weather data helps predict energy production in renewable assets based on factors like temperature, humidity and wind patterns.

Expand Down Expand Up @@ -88,4 +88,4 @@ Data conversion into 'Meters Data Model' via transformers

## Contribute

RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. There are multiple sources for weather and metering data crucial to forecasting energy needs, if you have anymore you’d like to add to RTDIP please raise a feature request and contribute.
RTDIP empowers energy professionals to share solutions, RTDIP welcomes contributions and recognises the importance of sharing code. There are multiple sources for weather and metering data crucial to forecasting energy needs, if you have anymore you’d like to add to RTDIP please follow our [Contributing](https://github.com/rtdip/core/blob/develop/CONTRIBUTING.md) guide.

0 comments on commit 4fe52f8

Please sign in to comment.