Greenplum Database Considerations

Topics:

The build_all.sh script is intended to be used to build HealthViews when the Omni-Patient data resides on the same database. However, many customers choose to use Greenplum as their data warehouse database for HealthViews and PostgreSQL for the more transactional-oriented Omni-Patient. In this type of deployment, additional scripts are required to extract the data from the PostgreSQL database and insert it into Greenplum. A set of utilities is provided for this task.

Data Load Scripts

Topics:

Separate scripts have been created to extract the data from Omni-Patient, insert the data into a staging schema called stg in Greenplum and add a hash value to the data once it is in a Greenplum database. In addition, scripts have been created to create the schemas necessary for the above steps as well as truncate unused data once Omni-Patient has been loaded. All of the data load scripts are described in this section.

EXTRACT_OMNI_DATA.BASH

This script uses the psql copy function to extract data from the Omni-Patient database and write it to flat files in a known location, one file per Omni-Patient table to be replicated.

LOAD_OMNI_STG_TABLES.BASH

This script loads data from the flat files into a staging schema using the GPLOAD utility of Greenplum.

TRANSFORM_ALL_TABLES.BASH

This script transforms the data in the staging schema by adding a hashed value to the ID field and distributing the data by that field. It calls a sub-script, TRANSFORM_DATA_TO_HUB.BASH to accomplish this.

EXTRACT_LOAD.BASH

This is a master script to coordinate the execution of all of the above scripts in the proper order. It is typically configured to run as a cron job. It can also be modified to run other scripts as required.

GP_DB_INIT.BASH

This script reads the schema from Omni-Patient and then creates the staging and hub schemas in Greenplum. If those schemas already exist it will drop them and recreate them, all data in those schemas will be lost. It is typically not run on a regular basis.

GP_GEN_SCHEMA_ONLY.BASH

This is similar to the GP_DB_INIT.BASH script except it does not actually create the schema. Instead, it creates a DDL file to for both the stg and hub schemas. This can be useful to run if new tables are added to Omni-Patient.

BACKUP_DB.BASH

This script creates a backup copy of the current HealthViews database by creating a new database using the current production database as a template. It is typically run before loading any new data into the staging or hub schemas.

TRUNCATE_STG.BASH

This script can be used to truncate the stg schema after HealthViews has been populated in order to save space and reduce backup time.

Configuring Greenplum Scripts

Topics:

Configuring the various scripts for populating Omni-Patient HealthViews in Greenplum is performed through the replicator.properties file. This properties file contains the following settings:

  • postgres.hostname=

    Name of the host running the PostgreSQL Omni-Patient database.

  • postgres.port_number=

    Port for that host.

  • postgres.database_name=

    Name of the database for Omni-Patient.

  • postgres.schema_name=

    Name of the schema for Omni-Patient (normally Public).

  • postgres.username=

    PostgreSQL database admin user name.

  • postgres.password=

    PostgreSQL database admin password (optional if configured in the .pgpass file of the user).

  • greenplum.hostname=

    Name of the host running the HealthViews database.

  • greenplum.port_number=

    Port for that host.

  • greenplum.database_name=

    Name of the database to be used (normally omnipatient_bi).

  • greenplum.staging.schema_name=

    Name of the staging schema (normally stg).

  • greenplum.staging.ddl_file=

    Location of the ddl file for the staging schema.

  • greenplum.hub.schema_name=

    Name of the hub schema (normally hub).

  • greenplum.hub.ddl_file=

    Location of the ddl file for the staging schema.

  • greenplum.username=

    Greenplum database admin name.

  • greenplum.yaml_dir=

    Location of the YAML files needed for GPLOAD.

  • greenplum.yaml_template=

    Template to be used by GPLOAD.

  • replicator.lzdir=

    Root directory for the Omni-Patient to HealthViews replication processes.

Greenplum YAML Files

YAML files are used by the Greenplum GPLOAD utility to define the data format for the flat files that are being loaded into Greenplum. The YAML file contains a description of the database server to be accessed and information on the formatting of the files to be loaded. In addition, the YAML file must be customized to the specific environment. The YAML file is structured as follows:

VERSION: YAML version to be used.  Should be left as 1.0.0.1DATABASE: Database to be used.  Should be L_GP_DATABASE which allows replicator.properties to specify the database.USER: Name of the user to load the files.HOST: Name of the host with the databasePORT: Port number on the host.GPLOAD:   INPUT:    - SOURCE:         LOCAL_HOSTNAME:           -  Name of the host with the flat files to be loaded.
         FILE: 
           - Name of the file to be loaded.  Should be L_LOADPIPE to match the value passed by the loading script.
         PORT:            Port number to read the file in on.  Should be 8081
    - FORMAT: format of the file.  Should be csv.
    - DELIMITER: Delimiter to mark fields.  Should match the delimiter used in the extract_omni_data.bash script.
    - ESCAPE: Escape character to use.  Should match the escape value in the extract_omni_data.bash script.
    - NULL_AS: Leave as "
    - QUOTE: Quote character.  Should be ""
    - HEADER: Whether or not a header is used in the flat file.  Should be set to FALSE.
    - ERROR_LIMIT: Number of errors to ignore before rejecting the flat file.  Typically set to 1000.
    - ERROR_TABLE: Table to write errors to.
   OUTPUT:    - TABLE: The table to write the files to.  Should be set to L_SCHEMA.L_TABLE.
    - MODE: The mode to use.  Should be set to insert.