PIVOTAL HD Features
Simple and Complete Cluster Management: Command Center
Command Center is a robust cluster management tool that allows users to install, configure, monitor and manage Hadoop components and services through a Web graphical interface. It provides a comprehensive dashboard with instant views of the health of the cluster and key performance metrics. Users can also view live and historical information about the host, application and job-level metrics across the entire Pivotal HD cluster. Command Center also provides Command-Line Interface and Web Services API for integration into enterprise monitoring services.
Big Data + Big Computing: GraphLab on OpenMPI
Graphlab on Open MPI (Message Passing Interfance) is highly used and mature graph-based, high performing, distributed computation framework for Data Scientists and Analysts. With our supported integration into Pivotal HD it eliminates costly data movements and long data science cycles. Combined with MADlib, Pivotal HD brings the most advanced analytics platform for rapid and deep discovery cycles allowing companies an edge to stay ahead of the competition.
Hadoop In the Cloud: Pivotal HD Virtualized by VMware
Hadoop Virtualization Extensions (HVE) are plug-ins that enable Hadoop virtual ware. Pivotal HD is the first Hadoop distribution to include HVE plug-ins, enabling easy deployment of Hadoop in enterprise environments. With HVE, Pivotal HD Enterprise can deliver truly elastic scalability in the cloud, augmenting on-premises deployment options that include software and appliance deployments.
Spring Data: Build Distributed Processing Solutions with Apache Hadoop
Spring for Apache Hadoop simplifies developing Big Data applications by providing a unified configuration model and easy-to-use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem projects such as Spring XD, enabling you to develop solutions for Big Data ingest/export and Hadoop workflow orchestration.
A Fast, Proven SQL Database Engine for Hadoop
Unlike new SQL-on-Hadoop entrants, Pivotal HAWQ brings more than 10 years of innovation that has resulted in a rich, powerful SQL query optimizer and processor optimized to run analytical queries and mixed query workloads in massively parallel, distributed environments.
Parallel Query Optimizer
HAWQ’s query optimizer utilizes mature and proven technology innovation from the Greenplum database. HAWQ’s cost-based query optimizer can effortlessly find the optimal query plan for the most demanding of queries, including queries with more than 30 joins.
Parallel Data Flow Framework – Dynamic Pipelining™
Dynamic Pipelining is a parallel data flow framework that combines an adaptive, high-speed UDP interconnect, a runtime execution environment, a runtime resource management layer, and a seamless data partitioning mechanism. Tuned for Big Data, Dynamic Pipelining implements the operations that underlie all SQL queries and in the presence of very demanding queries on heavily utilized clusters ensures that the queries are complete.
Extension Frameworks with Hbase, Hive, etc – PXF
Pivotal Xtension Framework (PXF) is an external table interface in HAWQ, which allows you to read data stored within the Hadoop ecosystem. External tables can be used to load data into HAWQ from Hadoop and/or also query Hadoop data without materializing it into HAWQ. It enables loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, AVRO, Hive, Sequence, RCFile formats and HBase.
Example uses cases include using statistical and analytical functions from HAWQ (e.g. Madlib) on HBase or Hive data, Joining in-database dimensions with HBase facts, leveraging analytical capabilities on Hadoop data files of various kinds and fast ingest of data into HAWQ for in database processing and analytics.
PXF provides parallel data collaboration between HAWQ and Pivotal HD and Hadoop data processing modules creating a fully single and fast analytic workflow.
Big Data Analytics Capability and Productivity
Analyzing big data efficiently requires massively-parallel architectures like Hadoop. To take advantage the computational capacity of MPP systems, the statistical, mathematical, and machine learning algorithms must be refactored to run efficiently in a parallel environment. Pivotal HD’s Advanced Database Services offer these capabilities through MADlib, a library of MPP-capable algorithms that extend the SQL capabilities of Hadoop, while also supporting write capability to user defined functions such as PL/R, PL/Python, and PL/Java. In addition, Pivotal HD Enterprise includes Apache Mahout, an open-source parallelized analytical library for MapReduce users.
GemFire XD Features
Enterprise Real-Time Data Service on Hadoop
GemFire XD, built on over a decade of innovation, combines with Pivotal HD and HAWQ to provide the industry’s first production quality platform for creating closed loop analytics solutions. It does this by providing:
- The performance of in memory, combined with the scale of big data
- Larger data sizes in the same size JVM
- Direct write to a big data store (HDFS) allowing for back-end analytics
- SQL without the penalties of relational databases
In-Memory with Big Data:
GemFire XD enables the creation of low latency, scale out OLTP applications integrated out of the box with a big data store (HDFS). This provides sub-second response to applications, while allowing the data to be analyzed in the back end via HAWQ or Map Reduce in near real time.
Closed loop analytics with HDFS
GemFire XD can be configured to directly write incoming data to HDFS. This provides lots of interesting scenarios:
- Capture streams of data for analysis in memory, and for roll up historically after the fact.
- Route transactions through a reliable in-memory system with assurances that data is available on disk for audit and compliance.
- Take advantage of consumer generated streams, like Twitter, for sentiment analysis.
- Detect fraud and shut it down in real time by knowing what “normal” patterns are and applying that to current data.
Scale out and scale up in-memory, scale out in HDFS
In addition to clustering which enables elastic scale out, and HDFS integration which enables scaling of the persistence layer, GemFire XD with Off-Heap Storage allows applications to scale individual servers to hundreds of gigabytes without incurring penalties associated with traditional garbage collection in servers.
Public performance benchmarks (YCSB) show GemFire XD with Pivotal HD to be two to three times better than HBase in throughput and latency for a variety of workloads:
Familiar SQL Interface
To access data in GemFire XD, developers use a standard ANSI SQL interface. Combining this with a PXF connector for HAWQ to read GemFire XD data, you can access your data via SQL whether in-memory, or on disk. This gives you access to OLAP and OLTP on the same data set.
For applications, GemFire XD provides both a JDBC and ODBC interface, allowing for powerful applications to be built using the familiar and friendly eco-system provided by Spring, Java, and C++.
GemFire XD achieves this by providing:
- Relational technology based on Apache Derby
- ANSI 92 standards compliant query engine
- Powerful distributed stored procedure execution
- Referential integrity on a distributed system
PIVOTAL HD Technology
What’s Included in Pivotal HD?
Pivotal HD is a commercially supported, enterprise-capable distribution of the Apache Hadoop stack. It includes Hadoop Distributed File System (HDFS), MapReduce, Hive, Pig, HBase, Zookeeper, Yarn and Mahout. Running Pivotal HD’s commercial Hadoop distribution on a Pivotal DCA helps you eliminate the pain associated with building out, debugging and monitoring Hadoop clusters from scratch, which is required by other distributions.
Simple and Complete Cluster Management: Command Center
Command Center is a robust cluster management tool that allows your users to install, configure, monitor and manage Hadoop components and services through a Web graphical interface. It simplifies Hadoop cluster installation, upgrading and expansion using a comprehensive dashboard with instant views of the health of the cluster and key performance metrics. Your users can view live and historical information about the host, application and job-level metrics across the entire Pivotal HD cluster. Command Center also provides Command-Line Interface and Web Services APIs for integration into enterprise monitoring services.
Machine Learning on Graph Data: GraphLab on Open MPI
Graphlab on Open MPI (Message Passing Interface) is highly used and mature graph-based, high performing, distributed computation framework, which scales to graphs with billions of vertices and edges easily. It is now able to run natively within an existing Hadoop cluster eliminating costly data movement. This will allow Data Scientist and Analysts to leverage popular algorithms such as page rank, collaborative filtering and computer vision natively in Hadoop rather than copying the data somewhere else to run the analytics, which would lengthen data science cycles. Combined with MADlib’s machine learning algorithms for relational data, Pivotal HD becomes the leading advanced analytical platform for machine learning in the world.
Hadoop In the Cloud: Pivotal HD Virtualized by VMware
You can use Hadoop Virtualization Extensions (HVE) plug-ins to make Hadoop aware of the virtual topology and scale Hadoop nodes dynamically in a virtual environment. Pivotal HD is the first Hadoop distribution to include HVE plug-ins, enabling easy deployment of Hadoop in your enterprise environment. With HVE, Pivotal HD can deliver truly elastic scalability in the cloud, augmenting on-premises deployment options.
Distributed Processing Solutions with Apache Hadoop: Spring Data
Spring Data makes it easier for your organization to build Spring-powered applications that use new data-access technologies such as non-relational databases, map-reduce frameworks, and cloud-based data services. Spring for Apache Hadoop simplifies developing big-data applications by providing a unified configuration model and easy-to-use APIs for using HDFS, MapReduce, Pig and Hive. It also provides integration with other Spring ecosystem projects such as Spring XD, enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.
PIVOTAL HD Technology
What Are Pivotal Data Computing Appliances?
Pivotal Data Computing Appliances (DCAs) provide your organization with pre-tested, pre-optimized and pre-configured Pivotal HD infrastructure. Delivered as modular systems, DCAs enable Pivotal HD to be delivered and at work within days—and DCAs can be easily scaled without disruption.
DCA-based HD deployments improve information availability, leveraging a fully redundant architecture.
Once installed, you can easy manage DCAs using the same Command Center console you use to manage Pivotal HD. When configured to do so, DCAs can report detailed system status to your data center management infrastructure using Simple Network Management Protocols (SNMP) or EMC support centers.
PIVOTAL HD Technology
What’s Included in HAWQ?
HAWQ integrates the industry’s first native, mature massively parallel processing (MPP) SQL query processor with Apache Hadoop. HAWQ enables you to leverage existing SQL-capable business intelligence and analytic tools and extract, load, transform (ETL) processes, plus your workforce’s SQL skills to simplify Hadoop-based data analytics development. This increases your team productivity and helps you reduce costs. HAWQ benefits include unprecedented query processing performance—100X improvement in query performance—as well as true, interactive and deep SQL processing, and powerful analytics. Unlike new SQL-on-Hadoop entrants, Pivotal HAWQ’s years of innovation have resulted in a rich, powerful SQL query optimizer and processor optimized to run analytical queries and mixed query workloads in massively parallel, distributed environments.
Query Optimization: Cost-Based, Parallel Query Optimizer
Leveraging ten years of technology innovation in MPP SQL-based analytics, HAWQ’s cost-based query optimizer delivers unmatched query optimization. It can help you effortlessly find the optimal query plan for the most demanding of queries, including queries with more than 30 joins, decisively outperforming less mature SQL-based or SQL-like Hadoop alternatives.
Parallel Data Flow Framework: Dynamic Pipelining
Dynamic Pipelining™ is a parallel data flow framework technology that lets you combine and orchestrate the various steps in the execution of complex queries. It features adaptive, high-speed user datagram protocol (UDP) interconnect, a runtime execution environment, a runtime resource management layer, and a seamless data partitioning mechanism. Tuned for big data, Dynamic Pipelining implements the operations that underlie all SQL queries and in the presence of very demanding queries on heavily utilized clusters, ensures that the queries are complete.
Hadoop Data from SQL: Xtension Framework
Pivotal Xtension Framework (PXF) is an external table interface in HAWQ that allows you to read data stored within the Hadoop ecosystem. External tables can be used to load data into HAWQ from Hadoop and/or also query Hadoop data without materializing it into HAWQ. It enables loading and querying of data stored in HDFS, HBase and Hive and supports a wide range of data formats such as Text, AVRO, Hive, Sequence, RCFile formats and HBase.
You can use PXF to conduct statistical and analytical functions from HAWQ (e.g., Madlib) on HBase or Hive data, joining in-database dimensions with HBase facts, leveraging analytical capabilities on Hadoop data files of various kinds, and fast ingest of data into HAWQ for in-database processing and analytics.
PXF provides parallel-data collaboration between HAWQ, Pivotal HD and Hadoop data-processing modules creating a fully single and fast analytic workflow.
Big Data Capabilities and Enhanced Productivity: Advanced Analytics Functions
Analyzing big data efficiently requires you to use massively-parallel architectures like Hadoop. To take advantage of the computational capacity of MPP systems, the statistical, mathematical and machine-learning algorithms must be refactored to run efficiently in a parallel environment. HAWQ includes a library of parallelized analytics algorithms (MADLib) to speed analytics development and execution. HAWQ also supports write capability to user defined functions such as PL/R, PL/Python, and PL/Java. In addition, Pivotal HD includes Apache Mahout, an open-source parallelized analytical library for MapReduce users. With HAWQ, you can apply the computational power of MPP and Hadoop to run compute-intensive statistical, mathematical and machine-learning calculations on HDFS data. Algorithms are built into the query processor as SQL commands, acting directly on data stored in Hadoop HDFS. Compared to traditional approaches, HAWQ algorithms can often accelerate analytical computation by orders of magnitude.
PIVOTAL HD Technology
What is Pivotal Analytics Workbench?
Pivotal Analytics Workbench is a test-bed cluster that consists of 1,000 hardware nodes with 24 petabytes of physical storage. This is the equivalent of nearly half of the entire written works of mankind, from the beginning of recorded history.
You can use the Pivotal Analytics Workbench to test Pivotal HD and HAWQ and certify it at scale, giving you confidence in large-cluster environments. We use the Pivotal Analytics Workbench to test the limits of scale-out infrastructure technology and redefine the models for applying big data analytics.
Pivotal Analytics Workbench is the result of collaboration between leading hardware and software vendors, including EMC, Intel, Mellanox Technologies, Micron, Seagate, Supermicro, Switch and VMware.
PIVOTAL HD Technology
What Hadoop Frameworks does Pivotal HD Support?
Pivotal HD supports the following Hadoop systems and frameworks:
- Hadoop Distributed File System (HDFS) - HDFS is a Java-based file system that provides scalable and reliable data storage. With industry installation in the thousands of nodes, HDFS has proven to be a solid foundation for any Hadoop deployment.
- MapReduce - MapReduce is a Hadoop framework for easily writing applications that process large amounts of unstructured and structured data in parallel, in a reliable and fault-tolerant manner. The framework is resilient to hardware failures, handling them transparently from user applications.
- Hive - Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in Hadoop-compatible file systems. This SQL-like interface gives users a row-based storage capability, which, along with compression, results in an improved compression ratio for storing data.
- Mahout - Mahout is a library of scalable machine-learning algorithms. Mahout’s core algorithms are for recommendation mining, clustering, classification and batched based collaborative filtering are implemented on top of the Hadoop using the map/reduce paradigm. The number of implemented algorithms is growing.
- Pig - Pig is the procedural language for processing large, semi-structured data sets using the Hadoop MapReduce platform. It enables developers to more easily write MapReduce jobs by providing an alternative programming language to Java.
- HBase - HBase is a distributed, versioned, column-oriented storage platform that provides random real-time read/write access to big data for user applications.
- YARN - Hadoop YARN is a framework for job scheduling and cluster resource management. YARN stands for “Yet-Another-Resource-Negotiator.” It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. It frees up application framework developers to work on frameworks rather than minor details. Yarn is a subproject of Apache Hadoop.
- Zookeeper - Zookeeper is a highly available system for coordinating distributed processes. Distributed applications use Zookeeper to store and mediate updates to key configuration information.
- Oozie - Oozie is a workflow scheduler system to manage Hadoop jobs such as Java MapReduce, Streaming MapReduce, Pig, Hive, etc. Oozie workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
- Flume - Flume is a distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple, extensible data model that allows for online analytic application.
- Sqoop - Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases and enterprise data warehouses. You can use Sqoop to import/export data from/to external structured datastores in/out of Hadoop Distributed File System or related systems like Hive and HBase.
PIVOTAL HD Technology
Why Scale-Out Network-Attached Storage with Pivotal HD?
Pivotal HD provides a supported big data storage and analytic solution with EMC Isilon that increases reliability and reduces risk. Isilon OneFS is the first and only proven, enterprise scale-out network-attached storage (NAS) platform that natively integrates the Hadoop Distributed File System (HDFS) protocol.
Isilon is engineered for Hadoop in the enterprise. Built-in, proven enterprise-data protection provides an always-on Hadoop data environment. With automatic load balancing, Isilon removes the need for data staging. Isilon also distributes the namenode to provide high availability and load balancing for no single point of failure. The FlexProtect feature in the Isilon OneFS operating system protects against data loss at a level beyond any other storage solution and provides N+4 data protection.
The integration of Pivotal HD and Isilon will enable organizations the flexibility and extension to keep large content files stored in Isilon, while using their Pivotal HD cluster(s) for “processing intense” analytical and transactional workloads. You can build an efficient scale-out, shared-storage infrastructure that grows with data and add a high-performance analytics platform that improves insight.
As a result, your organization gains the ability to leverage data to make better decisions, faster, and with less risk. With EMC’s Isilon storage systems, your enterprise can achieve 80 percent utilization for greater storage efficiency, while eliminating the resource-intensive import/export of data into Hadoop.
PIVOTAL HD Technology
GemFire XD is the industry’s premier “In-Memory with Big Data” relational OLTP data store, combining the power of storing and processing data in-memory with scale out persistence to Pivotal HD. GemFire XD, which supports ANSI SQL, allows the creation of linearly scalable, highly available, elastic applications that have high throughput, low latency, and are designed to run at cloud scale. GemFire XD comes with enterprise class technology that allows servers to host hundreds of gigs of data in memory in a single process without incurring the penalties usually associated with garbage collection issues in a JVM. (GemFire XD is a Java based product that runs in a stock JVM). This gives us the ability to create and manage large volumes of data in memory for transactional applications. GemFire XD allows the execution of parallel stored procedures on large volumes of data reducing network I/O and allowing everything from map-reduce like functions to arbitrary behavior execution on the data to run efficiently.
GemFire XD is closely integrated with other elements of the Pivotal data stack, including Pivotal HD, and Pivotal Advanced Database Services (also known as HAWQ) through the Pivotal Extensions Framework (PXF). Data written into HDFS from GemFire XD can be consumed by other elements of the Hadoop ecosystem.
Powered by GemFire
GemFire XD leverages a decade of R&D that went into making GemFire the data grid of choice for some of the biggest enterprises on the planet. Support for:
- Highly optimized in-memory data management
- Split brain detection
- Group membership management
- Highly optimized metadata management
- Extreme data volume support in-memory and on disk
- High availability through a variety of techniques
- WAN replication capabilities
- In-memory stored procedure support
- Scalable management and monitoring framework
To read more about the technology underpinnings that drives GemFire XD, click here.
Pivotal HD as the scaleout persistence layer
With GemFire XD, applications can rely on the in-memory scale out capabilities of GemFire and the extremely robust proven disk persistence capabilities of HDFS, provided by Pivotal HD. The ability to persist application data to Hadoop in near real time, allows a new class of applications which combine transaction processing and analytics processing into a single deployment cluster comprised of Pivotal HD, GemFire XD, and HAWQ.
Off-Heap Storage to increase in-memory density
GemFire XD comes with patent pending technology from GemFire that allows servers to host hundreds of gigs of data in memory in a single process without incurring the penalties usually associated with garbage collection issues in a JVM (GemFire XD is a Java based product that runs in a stock JVM). This gives us the ability to create and manage large volumes of data in memory for transactional applications. Off-Heap Storage keeps long-lived data outside the ambit of the Java garbage collector allowing the most optimal use of heap to provide higher throughput and higher responsiveness.
PXF interface to Pivotal HAWQ
GemFire XD supports a PXF connector that allows Pivotal HAWQ to query data written to HDFS by GemFire XD. The PXF connector for GemFire XD relies on the GemFire XD Input/OutputFormatter implementations to run HAWQ queries. The ability to run HAWQ queries on post transactional data, in addition to the ability to run parallel map-reduce jobs on the data allows users to build closed loop applications that can make changes to the transactional app based on analysis of the data done by Map-Reduce and HAWQ.
Data-aware Java Stored Procedures
GemFire XD allows applications to run both data aware and data independent stored procedures on the cluster. These stored procedures run in parallel and provide results back to the sender, allowing for both synchronous and partially synchronous behavior execution in the data grid. These stored procedures provide extremely high throughput since they execute on data that is typically in-memory.