- Apache Griffin is an open source Data Quality solution for distributed data systems at any scale in both streaming or batch data context.
- Users will primarily access this application from a PC.
After you log into the system, you may follow the steps:
- First, create a new measure.
- Then, create a job to process the measure periodically.
- Finally, the heatmap and dashboard will show the data diagram of the measures.
You can check the data assets by clicking "DataAssets" on the top right corner.
Then you can see all the data assets listed here.
By clicking "Measures", and then choose "Create Measure". You can use the measure to process data and get the result you want.
There are mainly four kinds of measures for you to choose, which are:
- if you want to measure the match rate between source and target, choose accuracy.
- if you want to check the specific value of the data(such as: null column count), choose profiling.
At current we only support accuracy measure creation from UI.
2.2.1 Accuracy [1]
Definition:
Measured by how the values agree with an identified source of truth.
Steps:
- Choose source
Select the source dataset and fields which will be used for comparision.
For example, we choose 3 columns here.
- Choose target:
Select the target dataset and fields which will be used for comparision.
- Mapping source and target
- Step1: "Map To": Select which rule to match the source and the target. Here are 6 options to choose:
- = : data of the two columns should be exactly matched.
- != : data of the two columns should be different.
- > : the target column data should be bigger than the source one.
- >= : the target column data should be bigger than or equal to the source one.
- < : the target column data should be smaller than the source one.
- <= : the target column data should be smaller than or equal to the source one.
- Step2: "Source fields": choose the source column that you want to compare with the target column.
- Partition Configuration
Set partition configuration for source dataset and target dataset.
The partition size means hive database minimum data unit,used to split data you want to calculate
Done file path means format of done file path
- Configuration
Set up the measure required information.
The organization means the group of your measure, you can manage your measurement dashboard by group later.
- Measure information
After you create a new accuracy measure, you can check the measure you've created by selecting it in the listed measurements' page.
Example:
Suppose the source table A has 1000 records and the target table B only has 999 records which can perfectly match with A in selected fields, then the accuracy rate=999/1000*100%=99.9%.
By clicking "Jobs", and then choose "Create Job". You can submit a job to execute your measure periodically.
At current we only support simple periodically scheduling job for measures.
Fill out the block of job configuration.
- Job Name: you can set a name for your job.
- Measure Name: name of the measure you want to schedule. you need to choose it from the list of measures you've created before.
- Cron Expression: cron expression of scheduler. For example: 0 0/4 * * *.
- Begin: data segment start time comparing with trigger time
- End: data segment end time comparing with trigger time.
After submit the job, Apache Griffin will schedule the job in background, and after calculation, you can monitor the dashboard to view the result on UI.
After the processing work has done, here are 3 ways to show the data diagram.
-
Click on "Health", it shows the heatmap of metrics data.
-
Click on "DQ Metrics".
You can see the diagrams of metrics.
By clicking on the diagram, you can get the zoom-in picture of it, and know the metrics at the selected time window.
-
The metrics is shown on the right side of the page. By clicking on the measure, you can get the diagram and details about the measure result.
###Six core data quality dimensions
Content adapted from THE SIX PRIMARY DIMENSIONS FOR DATA QUALITY ASSESSMENT, DAMA, UK
Title | Accuracy |
---|---|
Definition | The degree to which data correctly describes the "real world" object or event being described. |
Reference | Ideally the "real world" truth is established through primary research. However, as this is often not practical, it is common to use 3rd party reference data from sources which are deemed trustworthy and of the same chronology. |
Measure | The degree to which the data mirrors the characteristics of the real world object or objects it represents. |
Scope | Any "real world" object or objects that may be characterized or described by data, held as data item, record, data set or database. |
Unit of Measure | The percentage of data entries that pass the data accuracy rules. |
Type of Measure:
|
Assessment, e.g. primary research or reference against trusted data. Continuous Measurement, e.g. age of students derived from the relationship between the students’ dates of birth and the current date. Discrete Measurement, e.g. date of birth recorded. |
Related Dimension | Validity is a related dimension because, in order to be accurate, values must be valid, the right value and in the correct representation. |
Optionality | Mandatory because - when inaccurate - data may not be fit for use. |
Applicability | |
Example(s) | A European school is receiving applications for its annual September intake and requires students to be aged 5 before the 31st August of the intake year. In this scenario, the parent, a US Citizen, applying to a European school completes the Date of Birth (D.O.B) on the application form in the US date format, MM/dd/yyyy rather than the European dd/MM/yyyy format, causing the representation of days and months to be reversed. As a result, 09/08/yyyy really meant 08/09/yyyy causing the student to be accepted as the age of 5 on the 31st August in yyyy. The representation of the student’s D.O.B.–whilst valid in its US context–means that in Europe the age was not derived correctly and the value recorded was consequently not accurate |
Pseudo code | ((Count of accurate objects)/ (Count of accurate objects + Counts of inaccurate objects)) x 100 Example: (Count of children who applied aged 5 before August/yyyy)/ (Count of children who applied aged 5 before August 31st yyyy+ Count of children who applied aged 5 after August /yyyy and before December 31st/yyyy) x 100 |
Title | Profiling |
---|---|
Definition | Data are valid if it conforms to the syntax (format, type, range) of its definition. |
Reference | Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values). |
Measure | Comparison between the data and the metadata or documentation for the data item. |
Scope | All data can typically be measured for Validity. Validity applies at the data item level and record level (for combinations of valid values). |
Unit of Measure | Percentage of data items deemed Valid to Invalid. |
Type of Measure:
|
Assessment, Continuous and Discrete |
Related dimension | Accuracy, Completeness, Consistency and Uniqueness |
Optionality | Mandatory |
Applicability | |
Example(s) | Each class in a UK secondary school is allocated a class identifier; this consists of the 3 initials of the teacher plus a two digit year group number of the class. It is declared as AAA99 (3 Alpha characters and two numeric characters). Scenario 1: A new year 9 teacher, Sally Hearn (without a middle name) is appointed therefore there are only two initials. A decision must be made as to how to represent two initials or the rule will fail and the database will reject the class identifier of “SH09”. It is decided that an additional character “Z” will be added to pad the letters to 3: “SZH09”, however this could break the accuracy rule. A better solution would be to amend the database to accept 2 or 3 initials and 1 or 2 numbers. Scenario 2: The age at entry to a UK primary & junior school is captured on the form for school applications. This is entered into a database and checked that it is between 4 and 11. If it were captured on the form as 14 or N/A it would be rejected as invalid. |
Pseudo code | Scenario 1: Evaluate that the Class Identifier is 2 or 3 letters a-z followed by 1 or 2 numbers 7 – 11. Scenario 2: Evaluate that the age is numeric and that it is greater than or equal to 4 and less than or equal to 11. |