Topics: |
A pipeline can be thought of as a chain of stages that are connected from a source to a target. Data is transferred between stages in the pipeline. Each stage may perform a specific operation on the data (for example, Show or Transform).
A pipeline may have multiple compute stages between a source and a target. Each compute stage and the target have an input and an output schema associated with them. The source has no input schema, as the input schema is available from the data source itself. Because each node has an associated schema, intelligent operations can be performed on the data for the stage, such as validation, building expressions, and Where clauses.
Pipelines can be used for any number of operations, including data transformation, in memory computing, data cleansing with iWay Data Quality Server (DQS), modeling with RStat, and other operations.
A pipeline has a single execution path, without branching or looping. Pipelines can be chained together, so that the target stage for one pipeline becomes the source stage for the next pipeline for multiple or complex operations on data.
A source stage brings data from outside the pipeline into the pipeline for processing in subsequent stages. If the source is already in Hive, then transformations can be performed on the source object. Other sources can use a separate compute stage for actions or operations on part or all of the source data. A source can also consume the results of another pipelines target stage.
Source (Inbound data) use case selector:
A compute stage performs an action or operation on part or all of the source data. Multiple compute stages can be chained together to build a pipeline target result. Data can also changed using the Transformer tool.
The following table lists and describes the available compute stages in iBDI.
Compute Stage |
Description |
---|---|
Cleanse |
Submits a DataFrame to be cleansed in iWay Data Quality Server (DQS). |
Coalese |
Returns a new DataFrame with the exact number of partitions provided. This operation results in a narrow dependency. For example, if you go from 1000 partitions to 100 partitions, then there will not be a shuffle, Instead each of the 100 new partitions will claim 10 of the current. |
Distinct |
Returns a new DataFrame that contains only the unique rows from this DataFrame. |
Drop |
Returns a new DataFrame with a column dropped. This is a no-op if the schema does not contain a column name. |
Filter |
Filters rows using the provided SQL expression. For example: age > 15 |
Hive |
Transforms a DataFrame by joining it with existing Hive tables. |
HQL |
Executes a HQL statement. |
RStat Model |
Scores the incoming data and adds a column to the DataFrame containing the prediction. |
Order by |
Returns a new DataFrame sorted by the given expressions. This is an alias of the Sort function. |
Sample |
Returns a new DataFrame by sampling a fraction of rows. |
Save |
Saves the DataFrame into Hive without stopping the pipeline. |
Select |
Selects an ordered list of columns from the DataFrame. |
Show |
Shows the first 20 records in a DataFrame. |
A target stage is the final stage of a pipeline, which serves as a destination for the data from the pipeline. Pipeline targets can also function as sources for subsequent pipelines, depending on the type.
The following table lists and describes the available target stages in iBDI.
Target Stage |
Description |
---|---|
Console |
Displays the result of the pipeline operations in the Console pane of iBDI. |
Stream |
Kafka writes each row of the DataFrame into a Kafka topic. |
Defined |
|
Structured |
RDBMS writes the data to an external table. |