Topics: |
Highlight Defects is a feature that helps you find inconsistencies and potential errors in the data.
This feature can be accessed using the Highlight defect button in upper-right corner.
The Highlight defect button is available on the Profile page and on the Data table page, as shown in the following image.
To activate this feature, click the Highlight defects button.
When the Highlight defects functionality is enabled, the color of the button is changed to blue and data areas with issues are highlighted in orange, as shown in the following image.
To disable the Highlight defects functionality, re-click the Highlight defects button.
The Highlight defects functionality is saved per session. For example, if you activate Highlight defects and refresh the page, the highlighting defects functionality will be turned off. The same will happen when you decide to reset the session.
The Highlight defects functionality works with the latest data. If you execute some rules, the changes that were applied after processing those rules will also affect highlighting defects. This means that if processing the rules succeeded while the Highlight defects functionality was turned on, the defects highlighting will be re-calculated and the defect will be highlighted based on the new updated data.
The following image shows how a sample highlighting defect for the Customer Subject appeared before Rules were executed:
The following image shows how the Highlighting defect for Customer Subject appeared after the Cleanse_SSN operation was applied to the Country and gender attributes:
The same will happen if you apply a Filter while the Highlight defects functionality is activated. In this example, the defects highlighting will be re-calculated and the defect will be highlighted based on filtered data.
The following image shows how Highlighting defects for the Customer Subject appeared without an added filter:
The following image shows how the Highlighting defect for Customer Subject appeared after a Filter was added to the fullName bar chart:
The Highlight Defects functionality works by placing a highlight_defects.json file in the following directory where the environment is configured:
C:\iWay_OGC\ogc\data\workbench
The highlight_defects.json file contains a set of expression-rules based on which Highlight functionality made the calculation to determine whether a value is to be highlighted.
The expressions can be edited (if needed) inside the JSON file. After OGC Tomcat is restarted, the calculation of the highlight defects will be made according to the updated expressions.
Highlight defects can be applied to the following bar charts:
If the value does not drop above the threshold [30%], it will not be highlighted.
If the value does not fulfill the criteria above, it will not be highlighted.
If the value does not fulfill the criteria above, it will not be highlighted.
The Total number for patterns is highlighted for the following instances:
The following image shows the profiling bar charts and sections.
Highlight defects can be applied to the following bar charts:
To define which of the 20 most and least frequent values charts should be highlighted, the following complex computation should be solved.
TMP_TOTAL := $total; TMP_DISTINCT := $distinct; 20_THRESHOLD := TOTAL/5; for each row from top 5 most frequent do if $records > 20_THRESHOLD do TMP_TOTAL := TMP_TOTAL - $records; TMP_DISTINCT := TMP_DISTINCT - 1; done done LOW_THRESHOLD := TMP_TOTAL/TMP_DISTINCT; for each row in least frequent do if $records < LOW_THRESHOLD/10 do highlight done done Let's consider the following example:
The following image shows the highlighting for 20 most/least frequent patterns charts for the firstName attribute.
The following sample calculations identify which values should be highlighted:
TMP_TOTAL :=2000 TMP_DISTINCT := 23
20_THRESHOLD := TOTAL/5 = 2000/5 = 400(static and doesn't change)
for each row from top 5 most frequent do if $records > 20_THRESHOLD
if 58 > 400 (false) done LOW_THRESHOLD := TMP_TOTAL/TMP_DISTINCT;
LOW_THRESHOLD := 2000/23;for each row in least frequent
do
if 1<8.69 (true)
do highlight
if 57<8.69 (false)
done done
The following image shows the results of the calculation containing the highlighted charts with the value of 1.
Highlighting for 20 of the most and least frequent patterns and charts fulfills the following criteria:
For example, if the 5 most frequent patterns exceed 60% of the total cases and do not contain digits, then all patterns that do contain digits should be highlighted.
It is easier to define which of the 20 most and least frequent patterns charts should be highlighted based on the following complex computation:
TMP_TOTAL := pattern_total = sum(#records for all most frequent patterns) TMP_DISTINCT := pattern_distinct = # of rows in most frequent patterns
20_THRESHOLD := TOTAL/5; for each row from top 5 most frequent do if $records > 20_THRESHOLD do TMP_TOTAL := TMP_TOTAL - $records; TMP_DISTINCT := TMP_DISTINCT - 1; done done LOW_THRESHOLD := TMP_TOTAL/TMP_DISTINCT; for each row in least frequent do if $records < LOW_THRESHOLD/10 do highlight done done Let's consider the following example:
The following example shows highlighting for the 20 most/least frequent pattern charts for the firstName attribute.
The following calculations are used to identify which values are highlighted:
TMP_TOTAL := pattern_total = sum(#records for all most frequent patterns)
TMP_TOTAL :=418+346+334+255+175+163+76+29+12+4+3=1815 TMP_DISTINCT := pattern_distinct = # of rows in most frequent patterns
TMP_DISTINCT := 11
20_THRESHOLD := TOTAL/5 = 1815/5 = 363 (static and doesn't change)
for each row from top 5 most frequent do if $records > 20_THRESHOLD
if 418 > 363 (true) do TMP_TOTAL := 1815 – 418 = 1397;
TMP_DISTINCT := 11-1 = 10;
if 346 > 363 (false) done done
LOW_THRESHOLD := TMP_TOTAL/TMP_DISTINCT;
LOW_THRESHOLD := 1397/10 = 139.7;
for each row in least frequent do if $records < LOW_THRESHOLD/10
if 3 < 13.97 (true) do highlight done
if 4 < 13.97 (true) do highlight done
if 12 < 13.97 (true) do highlight done
if 29 < 13.97 (false)
done done
The following image shows the highlighted charts, containing values 3, 4, and 12.
The same calculation can be performed to define highlighting for L1/D1 and W/N patterns.
Since the bar chart numbers of the L1/D1 patterns match bar charts numbers for L/D patterns, the result after highlighting defects will be the same.
Since the most and least frequent values for W/N patterns look different when defining which bar charts should be highlighted, new calculations should be performed.
The following calculations are used to identify which values are highlighted:
TMP_TOTAL := pattern_total = 1816 TMP_DISTINCT := 1
20_THRESHOLD := 1816/5 = 363.2
for each row from top 5 most frequent do if 1816 > 363.2 (true) TMP_TOTAL := 1816 - 1816 = 0; TMP_DISTINCT := 1- 1 = 0;
done done
The resulting output is that no bar charts are highlighted, as shown in the following image.
The following image shows a sample bar chart for the most and least frequent W/N patterns to view how highlighting will appear for the fullName attribute.
The following calculations are used to identify which values are highlighted:
TMP_TOTAL := pattern_total = 1580+371+19 = 1970
TMP_DISTINCT := 3
20_THRESHOLD := 1970/5 = 394
for each row from top 5 most frequent do if 1580 > 394 (true)
do TMP_TOTAL := 1970-1580=390; TMP_DISTINCT := 2 if 371>390 (false) do LOW_THRESHOLD := 390/2=195 for each row in least frequent do if 19<19.5 (true)
do highlight done
if 371<19.5 (false)
done
done
The resulting value for the highlighted bar chart is 19.
The Highlight functionality can be also applied to On-screen filters on the Frequency and Patterns tabs.
The Rules that define which bar chart will be highlighted are the same as for the Profile page.