Understanding a Pipeline

Topics:

A pipeline can be thought of as a chain of stages that are connected from a source to a target. Data is transferred between stages in the pipeline. Each stage may perform a specific operation on the data (for example, Show or Transform).

A pipeline may have multiple compute stages between a source and a target. Each compute stage and the target have an input and an output schema associated with them. The source has no input schema, as the input schema is available from the data source itself. Because each node has an associated schema, intelligent operations can be performed on the data for the stage, such as validation, building expressions, and Where clauses.

Pipelines can be used for any number of operations, including data transformation, in memory computing, data cleansing with iWay Data Quality Server (DQS), modeling with RStat, and other operations.

A pipeline has a single execution path, without branching or looping. Pipelines can be chained together, so that the target stage for one pipeline becomes the source stage for the next pipeline for multiple or complex operations on data.

Source Stages

A source stage brings data from outside the pipeline into the pipeline for processing in subsequent stages. If the source is already in Hive, then transformations can be performed on the source object. Other sources can use a separate compute stage for actions or operations on part or all of the source data. A source can also consume the results of another pipelines target stage.

Source (Inbound data) use case selector:

  • Streaming
    • Stream. Reads data from a stream into a DataFrame.
    • Kafka. Reads data from a Kafka queue into a DataFrame.
    • Kafka Direct. This new receiver-less direct approach has been introduced in Spark Version 1.3 to ensure stronger end-to-end guarantees.
    • Nifi. Reads data from an Apache NiFi stream from an output port into a DataFrame.
  • Defined
    • Hive. Reads data from a Hive table into a DataFrame.
    • HQL. Executes HQL and loads it into a DataFrame.
    • HDFS. Reads data from a HDFS into a DataFrame.
  • Structured
    • RDBMS. Reads data from an external RDBMS table into a DataFrame.

Compute Stages

A compute stage performs an action or operation on part or all of the source data. Multiple compute stages can be chained together to build a pipeline target result. Data can also changed using the Transformer tool.

The following table lists and describes the available compute stages in iBDI.

Compute Stage

Description

Cleanse

Submits a DataFrame to be cleansed in iWay Data Quality Server (DQS).

Coalese

Returns a new DataFrame with the exact number of partitions provided. This operation results in a narrow dependency. For example, if you go from 1000 partitions to 100 partitions, then there will not be a shuffle, Instead each of the 100 new partitions will claim 10 of the current.

Distinct

Returns a new DataFrame that contains only the unique rows from this DataFrame.

Drop

Returns a new DataFrame with a column dropped. This is a no-op if the schema does not contain a column name.

Filter

Filters rows using the provided SQL expression. For example:

age > 15

Hive

Transforms a DataFrame by joining it with existing Hive tables.

HQL

Executes a HQL statement.

RStat Model

Scores the incoming data and adds a column to the DataFrame containing the prediction.

Order by

Returns a new DataFrame sorted by the given expressions. This is an alias of the Sort function.

Sample

Returns a new DataFrame by sampling a fraction of rows.

Save

Saves the DataFrame into Hive without stopping the pipeline.

Select

Selects an ordered list of columns from the DataFrame.

Show

Shows the first 20 records in a DataFrame.

Target Stages

A target stage is the final stage of a pipeline, which serves as a destination for the data from the pipeline. Pipeline targets can also function as sources for subsequent pipelines, depending on the type.

The following table lists and describes the available target stages in iBDI.

Target Stage

Description

Console

Displays the result of the pipeline operations in the Console pane of iBDI.

Stream

Kafka writes each row of the DataFrame into a Kafka topic.

Defined

  • HDFS writes the DataFrame into HDFS.
  • Hive writes the data to a Hive table.
  • Hive CSV writes the DataFrame into HDFS as CSV.

Structured

RDBMS writes the data to an external table.