Glossary

A Row-wise data serialization file format used in Hadoop. Avro files can be queried with Hive. Use Avro if you intend to consume most of the columns in the row.

Techniques and solutions for managing incremental data loads.

Flume is used to import data logs into HDFS while it is generated. For example, it imports web servers, application servers, and click streams, while simultaneously using Hadoop to clean up the data, aggregate it, and store it for long-term use. Data can then be outputted to the BI and batch reports or intercepted and altered before writing it to disk.

Metadata (for example, rows and columns) can be applied to streaming source data to provide a look and feel of a database.

An open-source framework in Java for distributed storage and processing of vary large data sets on computer clusters built from commodity hardware.

Hadoop Distributed File System (HDFS) is composed of one NameNode and multiple data nodes which contain blocks of files that are placed on HDFS. The NameNode retains metadata regarding the location of each block within the cluster. Each block within the cluster is distributed and replicated across multiple data nodes to ensure the redundancy and efficiency of retrieving data.

HDFS ensures that any data node failure will not result in data loss. For example, when HDFS receives a query, it first acquires metadata information about the location of the file from the NameNode, and then retrieves actual data blocks from the data nodes.

Originally developed by Facebook™ for data warehousing, Hive is now an open-source Apache project that uses an SQL-like language called HiveQL that generates MapReduce jobs running on the Hadoop cluster.

Hive works by turning HiveQL queries into MapReduce jobs that are submitted to the cluster. Hive queries functions on HDFS directories (or tables) containing one or more files, similar to an RDBMS. Hive also supports many formats for data storage and retrieval.

Hive can identify the structure and location of the tables, since they are specified when the table metadata is created. The metadata is then stored in the metastore within Hive and contained in an RDBMS such as Postgres or MySQL. The metastore contains the column format and location, which is exported by Hive. The query itself operates on files stored in a directory on HDFS.

MapReduce is a fault-tolerant method for distributing a task across multiple nodes where each node processes data stored on that node. Data is distributed automatically and parallel to each other.

The distribution and assigning of data to the nodes consists of the following phases:

Map. Each Map task maps data on a single HDFS block and runs on the data node where the block is stored.
Shuffle and Sort. This phase sorts and consolidates intermediate data from all Maps tasks and occurs after all Map tasks are complete but before the Reduce task occurs.
Reduce. The Reduce task takes all shuffled and sorted intermediate data (for example, the output of the map task), and produces the final output.

A Column-wise data serialization file format used in Hadoop. Parquet files can be queried with Hive. Use Parquet if you intend to only consume a few columns in a row.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase.

The Sqoop (SQL to Hadoop) process transfers data between RDBMS and HDFS. For example, you can use Sqoop to import databases from Postgres, MySQL, and Microsoft SQL Server, then use Hadoop (which is in charge of transformation, aggregation, and long-term storage) to output BI batch reports.

Sqoop performs the process very efficiently through a Map-only MapReduce job. It also supports JDBC, ODBC, and several specific direct database options such as PostGreSQL and MySQL.

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.

The YARN (MRv2) daemons consist of the following:

ResourceManager. One per cluster, it starts ApplicationMasters and allocates resources on slave nodes.
ApplicationMaster. One per job, it requests resources, manages individual Map and Reduce Tasks.
NodeManager. One per slave (data node), it manages resources on individual slave nodes.
JobHistory. One per cluster, it archives metrics and metadata about jobs that finished running (regardless of success).

Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper's architecture supports high availability through redundant services.