Hadoop, MongoDB, and Data Warehouse - Selecting The Right Tool for The Job
March 07, 2012
By Serhiy Verovka, Abiliton System Architect at SoftServe Inc.
What is Big Data?
The term, Big Data has been around for a while, but during recent years more and more enterprises and internet services have turn their attention to Big Data. But what is Big Data, and how big it is?
The notion of big is relative. There is no size threshold that defines big and non-big data. It all depends on the organization that is dealing with the data. For some organizations, terabytes can define big. For others, petabytes or more. Big Data is not a precise term. It is rather a characterization of the process and problem of continuously accumulating more and more data. Its main characteristics are:
- Big volume
- Big variety of data sources and formats
- Big rate of volume increase and data change
Big data is also characterized by technology implementation challenges of scope and scale. Datasets grow so large they become more difficult to work with. Related challenges in storage, search, sharing, analytics, and visualization also emerge.
Some organizations are trying to collect as much data as possible. They try to measure every aspect related to each transaction and operation. The problem is not in the storage of big volumes of data. The situation is quite opposite – storage becomes cheaper every year and allows organizations to store larger and larger volumes of data. The problem is how to extract real value out of the increasing volumes of data.
Other organizations are dealing with the unlimited demand of Internet consumers for popular new web and mobile services. Some of the largest social network applications and Internet services have successfully moved away from SQL based data management. The leading commercial platform for this space is Hadoop. Large scale content and document management platforms have also emerged for dealing successfully with rapidly scaling demand for storage and retrieval capacity. Leading platforms in this space include MongoDB and CouchDB.
Hadoop is a scalable, fault-tolerant and distributed data storage and processing system. Hadoop is designed to store terabytes and even petabytes of data on commodity hardware with the ability to process such volumes of data. Hadoop automatically detects and recovers from hardware, software and system failures. Two base components of Hadoop ecosystem are:
- HDFS is a distributed file system that provides high-throughput access to application data. HDFS is organized into a cluster of servers where each server stores a small fragment of the complete data set, and each piece of data is replicated on more than one server.
- Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters.
A data warehouse is a central repository for all enterprise data. Data can be collected from various systems inside an organization. Usually a data warehouse is hosted on a relational database. The difference from traditional relational databases used in OLTP systems is that it is designed purely for reporting and analytics. For large enterprises with a wide set of business lines, departments, tools and data formats, it is often very difficult to trust the data. This is critical for tasks of analysts and executives who usually analyze a high-level view of the enterprise's data. In this case the Data Warehouse plays the role of "a single version of truth".
Basic attributes of data warehouses are:
- Wide ranges of historical data
- Consolidated, conformed and valid data
- De-normalized data structures (usually multidimensional)
- Fast report response time
- Flexible ad-hoc analysis
In addition to traditional row-based relation databases, data warehouses now support alternate technical approaches to store and access the data. Recently new trends have been evolving very rapidly:
- Column-oriented RDBMSs which store the data by columns; this allows great compression rates and improved I/O performance
- In-memory data storage engines, which allows database queries in memory without affecting a slow disk systems
Main approaches in DW design are Inmon's style (Top-Down) and Kimball's style (Bottom-Up). Within a Top-Down approach, the data warehouse is being designed for the entire enterprise, and then propagated to the needs of certain departments. The Bottom-Up approach recommends building the data for small tasks (like single department or service needs), and then integrating them into the enterprise data warehouse.
As always, in real life, there is no silver bullet and selecting the right tool for each task is the key. In this short article we have tried to cover the main definitive characteristics of Hadoop, Mongo DB, and more traditional Data Warehouses.