Data Quality Components

Topics:

This section describes the related Data Quality (DQ) components.

Data Quality Processes

The DQ processes for Cleansing, Matching, Merging, and Remediation can be started and stopped using the Omni Console. Please note that Merging is available only for the MDM Edition and not the DQ Edition.

The services shown in the following image can be managed using the Omni Console.

There is a link to access Data Quality Console form the Omni Console for further details. The console only shows if a DQ process has successfully been launched. It does not check if the services defined with the process have been loaded. If a deployment bundle contains an erroneous or incorrectly configured plan, then the process may start, but the service may be unable to load.

The status of the services within a DQ process can be seen in the console of the process. The console is available under the HTTP port defined for the process, for example:

The list of loaded services can be found in the Applications section.

If an expected service is not listed, it generally indicates an error in the plan implementing the service. In this case, the DQ logs should indicate an error.

Logs for each of the processes are in OmniGenData/logs/dq. Each DQ process writes four logs, as described in the following table.

Log file suffix

Contents

_access

HTTP requests that are received by the server.

_perf

Execution times for service invocations.

_err

Messages that occur during execution of a plan.

_online

Messages that occur during execution of a service. Generally this duplicates the _err file.

In addition, it is possible to log the data exchanged between and each DQ process by enabling the DQ Trace option in the Omni Console:

After enabling the option, Server must be restarted. The option should be enabled only for temporary debugging purposes on small loads. When enabled, Server will write a set of CSV files into OmniGenData/logs. Each file is named according to the DQ process, the transaction ID of the work order being executed, and "send" or "receive" to indicate whether the file contains data sent from to DQ or received by from DQ.

Configuration Options

The HTTP listener port and JVM properties for each process can be modified in the console in the appropriate tab under Managed Services.

The TCP port used by to send and receive data to executing DQ plans – the DQ Listener Port – is defined under Server Settings in the console. This is not an HTTP port and should never be opened in a browser or by any program except the plugin components embedded within a DQ plan.

It is not recommended that any of these settings be modified, but if they are modified then the DQ process and the Server process should be restarted.

Data Quality Processing

Topics:

The following diagram illustrates the process flow within an DQ plan. This example describes a cleansing process. Matching, merging, and remediation process flows are similar.

The general flow is:

  1. invokes a DQ REST service that is linked to a Cleansing, Matching, Merging, or Remediation plan. provides the subject and the transaction id associated with the work order being processed.
  2. The plan executes, in parallel, branches for the root subject and each of its subcollections.
  3. Each branch begins with an "OmniBatchReader" component. Each reader opens a TCP connection to the DQ server and requests a set of columns. then streams the requested data through the connection and the reader sends the records into the plan for processing.
  4. After a record is processed, it is sent to an "OmniBatchWriter" component. Upon first access, the component sends to the set of entity attributes which it will send. It then streams the input records it receives from the plan back to through another TCP connection.
  5. When all branches have read, processed, and written, all the records provided by , a count of all the records processed is computed and returned as the result of the REST invocation.

DQ Process Activities

The following table describes the data sent and received in each of the DQ processes.

Process

Sent

Received

cleansing

Cleansing overrides and source records associated with the work order

Cleaned values. Instance records are updated.

matching

Instance records associated with the work order

Ids, master ids, and match quality values for all root subject instances affected by the plan execution. This may be a super set of the instance records sent into the plan.

merging

Master ids and match quality values for all root subject records affected by the matching results

A set of master root subject records and all the subcollection records associated with them. Master root subject records are inserted or updated. Any existing subcollection records associated with the root subject masters are deleted and the new subcollection records are inserted.

remediation

Instance records associated with the work order

Cleansing and matching tickets. Inserted into omni_remediation_ticket.

Warnings and Errors

This section describes warnings and errors with related workarounds to resolve the issue.

Server fails to connect

If Server is unable to connect to a DQ process, OmniGenData/logs/server/server.log will contain a "Not Found for URL" message such as:

com.ibi.omni.server.services.ServiceException: Not Found for
URL http://localhost:9502/Person/cleanse

The most likely cause is that the plan failed to load due to an error in the plan definition. This can be verified by loading the DQ console for the process and checking that the referenced service is available. A new deployment bundle will have to be generated with a corrected plan.

Invalid name warnings

When a reader component in a plan requests a column that is undefined, a warning message is generated in OmniGenData/logs/server/server.log as for example:

WARN com.ibi.omni.cleanse.CleansingSender:64 [] [] Requested
column ssn not available in entity Person

will allow the plan to continue execution, however requested value will not be transmitted to the DQ process and the DQ process log will also include a warning, as for example:

<message>[306] ssn not sent by omni</message>

Similarly, if a writer component in a plan indicates to that it will write a column whose name is not recognized by , a "No field found" warning message is generated:

WARN com.ibi.omni.cleanse.CleansingReceiver:59 [] [] No field
found for Person.firstName

Process Failure

Server errors that occur while processing DQ streams are logged in the server log. If the DQ plan is still receiving data, an error message is also sent to the executing DQ plan, causing it to abort. The DQ plan will log the error message it receives from and also log its own failure message, which is typically just:

com.ataccama.dqc.online.core.RuntimeErrorReporterException: Configuration
execution failed.

If an error occurs within the DQ plan itself, it will attempt to send an error message to Server. Server will log this as a com.ibi.omni.dq.ReceivedErrorException and to stop all active senders and receivers.