SYS-ED Apache Hadoop Enterprise Database Scalability SYSED


*Apache Hadoop*	Database Training Sitemap

Submit Database Questions	Knowledge Transfer	Database Schedule
Definition of Service	Delivery Medium	Web-based Training Services

Hadoop

Big Data and Scalable Databases

Operational Challenge	Cloud Computing Databases	Apache Hadoop Open Source Platform
Apache Hadoop Processing Framework	Hadoop - IBM and Oracle Corporation Databases	Best Practices

Operational Challenge

Scalable database management systems utilized in update intensive application workloads and decision support systems for descriptive and deep analytics are central to cloud infrastructure. The cloud computing paradigm of service oriented computing is being extended to database as a service and storage as a service. Elasticity, pay-per-use, small upfront investment, low time to market, and transfer of risks are enabling features for deployment in organizational enterprise infrastructure.

There are challenges in transitioning to cloud infrastructure environments: scalability, elasticity, fault-tolerance, self-manageability, and ability to run on commodity hardware. Most relational database systems were designed for enterprise infrastructures; not to meet all these goals. Cloud data management systems are being implemented to meet these operational requirements.

Cloud Computing Databases

The facilities and featureset of the cloud make it attractive for deploying new applications, developing scalable data management solutions, supporting update heavy applications, and ad hoc analytics and decision support. The central design objective is to build a single tenant database management system with one large database for applications. An application can start small, but with increased popularity, the data footprint will grow and extend beyond the limit of a traditional relational database.

With the Internet, a cloud distributed application can experience both explosive growth and wide fluctuations in use. Application servers can readily scale out; however, the data management infrastructure could become a bottleneck.

Apache Hadoop Open Source Platform

Apache Hadoop is an open source platform for consolidating large-scale data from real-time and legacy sources. It complements existing data management systems with new analyses and processing tools. Hadoop was designed for the efficient reliable analysis of terabytes and petabytes of both structured and unstructured data. An enterprise can deploy Hadoop in conjunction with its legacy information technology systems; this allows old data and new datasets to be combined.

Hadoop divides large workloads into smaller data blocks which are distributed across a cluster of commodity hardware for efficient faster processing. It is part of a larger framework of related technologies:

HDFS	This is the Hadoop Distributed File System which handles storage.
Hadoop MapReduce	A distributed file system that provides high throughout access to application data.
YARN	Yet Another Resource Negotiator provides resource and node management.

HBase	Column oriented, non-relational, schema-less, distributed database modeled after Google’s BigTable; it is designed for providing random, real-time read/write access to big data.
Hive	Data warehouse system which provides a SQL interface and data structure for projection onto unstructured underlying data.
Pig	A platform and language for managing and analyzing large datasets.
ZooKeeper	Centralized service for maintaining configuration information, naming, distributed synchronization, and group services.
Spark	Performs in-memory processing of data.
Mahout	An environment for creating machine learning applications which are scalable.
Drill	Supports different kinds of NoSQL databases and file systems including: Azure Blog Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, and NAS.
Oozie	Service for scheduling inside Hadoop.
Flume	Service for ingesting unstructured and semistructured data into HDFS.
Sqoop	Service to import and export structured data from RDMS: relational database management systems or enterprise data warehouses.
Solr and Lucene	Services used for searching and indexing in a Hadoop ecosystem.
Ambari	ASF: Apache Software Foundation project for making Hadoop ecosystems more manageable.

Apache Hadoop Processing Framework

Hadoop is a generic processing framework designed to execute queries and other batch read operations against extremely large datasets which are hundreds of terabytes and petabytes in size. The data is loaded into or appended to the HDFS: Hadoop Distributed File System. Hadoop then performs brute force scans through the data to produce results output into other files.

Hadoop is not a relational database; it does not perform updates or transactional processing. Hadoop also does not support indexing or a SQL interface. Open source projects are adding to these capabilities. Hadoop operates on massive datasets by horizontally scaling the processing across large numbers of servers through MapReduce. Vertical scaling is used for executing on a single powerful server; however this can be both limiting and expensive in terms of resources. MapReduce splits up a problem, sends the sub-problems to different servers, and lets each server solve its sub-problem in parallel. It then merges all the sub-problem solutions together and writes out the solution into files used as inputs into additional MapReduce steps.

Hadoop has been useful in environments where massive server farms collect the data. Hadoop processes parallel queries as large background batch jobs on the same server farm. This saves the organization from having to acquire additional hardware for processing the data. It also obviates the requirement for loading the data into another system.

Hadoop - Oracle and IBM Corporation Databases

The leading commercial database enterprise software offers capabilities that Hadoop does not provide: performance optimizations, analytic functions, and declarative features for complex analysis, enterprise security, auditing, maximum availability, and disaster recovery. The Oracle Exadata Database is the technology leader; it can coexist and complement Hadoop. The inexpensive cycles of server farms and Hadoop can transform masses of unstructured data with low information density into dense structured data which is then loaded into Oracle Exadata.

ODI: Oracle Data Integrator is based on Hive and provides native Hadoop integration. A user interface is provided for creating programs which load data to and from files or relational datastores. Oracle Loader for Hadoop implementations require Java MapReduce code to be written and executed on the Hadoop cluster. The ODI and the ODI Application Adapter for Hadoop can be used to develop a graphical user interface for creating these programs. The ODI Application Adapter for Hadoop, generates optimized HiveQL which generates native MapReduce programs executed on the Hadoop cluster.

Oracle Loader for Hadoop is a MapReduce utility for optimized data loading from Hadoop into the Oracle database. It sorts, partitions, and converts data into Oracle database formats in Hadoop, then loads the converted data into the database. The preprocessing of data loaded as a Hadoop job on a Hadoop cluster, reduces the CPU and I/O utilization on the database. It also results in faster index creation on the data once in the database. ODI Application Adapter for Hadoop and ODI provides native Hadoop integration. Specific ODI knowledge modules optimized for Hive and Oracle Loader for Hadoop are included within ODI Application Adapter for Hadoop. The knowledge modules can be used to build Hadoop metadata within ODI, load data into Hadoop, transform data within Hadoop, and load data directly into a Oracle database utilizing Oracle Loader for Hadoop.

IBM InfoSphere BigInsights Basic software is the IBM distribution of Hadoop. IBM has provided add-ons: text analysis engine, development tool, data exploration, enterprise software integration, platform administration, and run-time performance improvements. There also is a BigInsights Enterprise Edition which includes a text processing engine and library of annotators for querying and identifying items of interest in documents and messages.

The Enterprise Edition employs IBM-specific software to enhance administration and performance. Built-in support is available for data formats: JSON data, comma-separated values, tab-separated values, character-delimited data, and others. Plug-ins can be created for processing additional data formats and executing custom functions. There is an optional job scheduling mechanism for tuning resource allocation among long- and short-running jobs. BigInsights supports LDAP authentication to its web console. LDAP and a reverse proxy restrict access to appropriate authorization.

Best Practices

Apply these techniques for improving Hadoop performance.

Combine processing of multiple small input files into smaller number of maps. This techniques helps alleviate the bookkeeping overhead; task JVM reuse for running multiple map tasks in one JVM and some JVM startup overhead. Combiners help the shuffle phase of the applications by reducing network traffic. The shuffle phase passes this input to the Reducer and is the sorted output of the mappers. The Reducer reduces a set of intermediate values which share a key to a smaller set of values. Used appropriately, a Combiner, will reduce significantly the data sent from the maps.

An excessive number of maps and a large number of maps with extremely short run-times is counter productive. Maps should be sequentially sized to map-outputs which can be sorted in a single pass within the sort-buffer.

Applications need to be designed for ensuring a reduction in processing from a minimum = 1 - 2 GB of data to a maximum = 5 - 10 GB of data. Use larger HDFS block-sizes for processing very large datasets.

Copyright Acknowledgement

SYS-ED makes no representations regarding ownership and intellectual rights associated with the software that it provides training on.