Perfect MVC code: Big Data -MongoDB vs Hadoop

Big Data

MongoDB vs Hadoop Big Solutions for Big Problems

White Paper

In recent years there has been an explosion of data. IDC predicts that the digital universe will grow to 2.7 zettabytes in 2012, up 48% from 2011. By 2015, this is expected to grow to 8 zettabytes of data. 1 Zettabyte = 1,000,000,000,000,000,000,000 bytes. More than half of this is the unstructured data from social networks, mobile devices, web applications and other similar sources. Traditional RDBMS systems are data storage solutions that were designed decades ago. At that time, most of the data was structured and had a different flavor than today. We need solutions that address Big Data problems and were designed on the basis of recent changes in the na- ture of data.(1) (2)

There are many solutions that address the Big Data problem. Some of these solutions are tweaks and/or hacks around standard RDBMS solutions. Some are independent data processing frameworks, data storage solutions or both. There is a website which lists many of them.(3)

According to Jaspersoft, there are two leading technologies in this new field. First is Apache Hadoop, a framework that allows for the distributed processing of large data sets using a simple programming model. The other is MongoDB, an open source NoSQL scalable, distributed database. Both of these solutions have their strengths, weaknesses and unique characteristics.(4) (5) (6)

MongoDB and Hadoop are fundamentally different systems. MongoDB is a database while Hadoop is a data pro- cessing and analysis framework. MongoDB focuses on storage and efficient retrieval of data while Hadoop focuses on data processing using MapReduce. In spite of this basic difference, both technologies have similar functional- ity. MongoDB has its own MapReduce framework and Hadoop has HBase. HBase is a scalable database similar to MongoDB.(7) (8)

The main flaw in Hadoop is that it has a single point of failure, namely the “NameNode”. If the NameNode goes down, the entire system becomes unavailable. There are a few workarounds for this which involve manually re- storing the NameNode.(9)

Regardless, the single point of failure exists. Mon- goDB has no such single point of failure. If at any point in time, one of the primaries, config-servers, or nodes goes down, there is a replicated resource which can take over the responsibility of the system automatically.(10) (11)

MongoDB supports rich queries like traditional RD- BMS systems and is written in a standard JavaScript shell. Hadoop has two different components for writing MapReduce (MR) code, Pig and Hive. Pig is a scripting language (similar to python, perl) that generates MR code, while Hive is a more SQL-like language. Hive is mainly used to structure the data and provides a rich set of queries. Data has to be in JSON or CSV format to be imported into MongoDB. Hadoop, on the other hand can accept data in al- most any format. Hadoop structures data using Hive, but can handle unstructured data easily using Pig. With the help of Apache Sqoop, Pig can even translate between RDBMS and Hadoop.(12) (13) (14)

What about transactions?

Database transactions: A transaction is a logical unit of work that usually consists of one or more database opera- tions.(15)

MongoDB has no transactions. Initially that fact may be a concern for people from a RDBMS background. Transac- tions are somewhat obsolete in the age of the web. When dealing with distributed systems, long-running database operations and concurrent data contention the concept of a “database transaction” may require a different strategy. Jim Gray - The Transactional Concept : Virtues and Limita- tions.(16) MongoDB does have something that is “transaction-like” in that a database write can happen in a single blocking synchronous “fsync”. MongoDB supports atomic opera- tions to some extent. So long as the schema is structured correctly, you can have a reliable write for a single entry.

MongoDB vs Hadoop

Page 2

1. http://www.smartercomputingblog.com/2012/03/21/finally-more-data-about-big-data/ 2. http://en.wikipedia.org/wiki/Big_data 3. http://nosql-database.org/ 4. http://nosql.mypopescu.com/post/20001178842/nosql-databases-adoption-in-numbers 5. http://hadoop.apache.org/ 6. http://www.mongodb.org/ 7. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Overview 8. http://hbase.apache.org/ 9. http://wiki.apache.org/hadoop/NameNode 10. http://www.mongodb.org/display/DOCS/Replica+Sets 11. http://www.mongodb.org/display/DOCS/Configuring+Sharding#ConfiguringSharding-ConfigServers 12. http://pig.apache.org/ 13.http://hive.apache.org/ 14. http://sqoop.apache.org/ 15. http://en.wikipedia.org/wiki/Database_transaction 16. http://en.wikipedia.org/wiki/Database_transaction

MongoDB (written in C++) manages memory more cost-efficiently than Hadoop’s HBase (written in Java). While Java garbage collection (GC) will, in theory, be as CPU/time efficient as unmanaged memory, it requires 5-10x as much memory to do so. In practice, there is a large performance cost for GC on these types of large scale distrib- uted systems. Both systems also take a different approach to space utilization. MongoDB pre-allocates space for storage, improving performance, but wasting space. Hadoop optimizes space usage, but ends up with lower write performance by comparison with MongoDB.

Hadoop is not a single product, but rather a software family. Its common components consist of the following: • Pig, a scripting language used to quickly write MapReduce code to handle unstructured sources • Hive, used to facilitate structure for the data • HCatalog, used to provide inter-operatability between these internal systems (15) • HBase, which is essentially a database built on top of Hadoop • HDFS, the actual file system for hadoop. (16)

MongoDB is a standalone product with supported binaries. The learning curve for MongoDB is generally lower than that of Hadoop.

Recently, there has been a lot of talk about security with NoSQL databases. Both MongoDB and Hadoop have basic security. MongoDB has simple authenti- cation and MD5 hash and Hadoop offers fairly rudimentary security in its various frameworks. This is not a flaw. NoSQL databases like MongoDB were never de- signed to handle security, they were designed to efficiently handle Big Data which they effectively do. It is simple to implement security in your application instead of expecting it from your data solution. Here is more on this.(17)

For data processing and data analysis there is almost no technology, Open Source or otherwise, that beats Hadoop. Hadoop was designed specifically to ad- dress this issue and thereby it contains all the components necessary to rapidly process terabytes to petabytes of information. Writing MapReduce code in Hadoop is elementary. Pig is easy to learn and makes it uncomplicated to write user-defined functions. MongoDB has its own MapReduce framework, which, though subpar to Hadoop, does the job well. When it boils down to sheer numbers, Hadoop is ahead. Yahoo has a 4000 node Hadoop cluster Continued on page 3...

Page 3

What is MapReduce? Why do we want it?

MapReduce is a framework for processing highly distributable problems across huge datasets.

It divides the basic problem into a set of smaller manageable tasks and assigns them to a large number of computers (nodes). An ideal MapReduce task is too large for any one node to process, but can be accomplished by multiple nodes efficiently.

MapReduce is named for the two steps at the heart of the frame- work.

• Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. Each worker node processes its smaller problem, and passes the result back to its master node. There can be multiple levels of workers. • Reduce step: The master node collects the results from all of the sub-problems, combines the results into groups based on the key and then assigns them to worker nodes called reducers. Each reducer processes those values and sends the result back to the master node.

MapReduce can be a huge help in analyzing and processing large chunks of data: • buying pattern analysis, customer usage and interest patterns in e-commerce • processing the large amount of data generated in the fields of science and medicine • processing and analyzing security data, credit scores and other large data-sets in the financial industry

These and many other uses make MapReduce, an indispensable tool in the software industry.

MongoDB vs Hadoop

Page 2

15. http://incubator.apache.org/hcatalog/ 16. http://hadoop.apache.org/hdfs/ 17. http://www.darkreading.com/blog/232600288/a-response-to-nosql-security-concerns.html 18. http://developer.yahoo.com/blogs/hadoop/posts/2008/09/scaling_hadoop_to_4000_nodes_a/

MongoDB vs Hadoop

and someone is aiming to test a 10000 nodes soon enough. MongoDB is typically used in clusters with around 30- 50 shards and a 100 shard cluster under testing. Typically, MongoDB is used with systems less than approximately 5 TB of data. Hadoop, on the other hand, has been used for systems larger than 100 TB, including systems contain- ing petabytes of data.(18)

There are several use cases of both systems.

Hadoop • Log Processing (best use case) - Log files are usually very large and there are typically lots of them. This cre- ates huge amounts of data. A single machine may not be able to efficiently process them. Hadoop is the best answer to this problem; splitting the log into smaller workable chunks and assigning them to workers results in very fast processing. • ETL -- For unstructured data streaming in real time, say from a web-application, Hadoop is a good choice to structure the data and then store it. Additionally, Hadoop provides ways to pre-process data prior to structuring it. • Analytics - Various organizations are using Hadoop’s superior processing capabilities to analyze huge amounts of data. The most famous usage is Facebook’s 21PB, 2000 node cluster, which Facebook uses for several tasks. • Genomics - Scientists working in this field constantly need to process DNA sequences. These are very long and complicated strands of information which require large amounts of storage and processing. Hadoop provides a simple cheap solution to their processing problem. • Machine education - Apache Mahout is built on top of Hadoop and essentially works along with it to facilitate targeted trade in e-commerce.

MongoDB • Archiving - Craigslist uses MongoDB as an efficient stor- age for archives. They still use SQL for active data while they archive most of their data using MongoDB. • Flexibility in data handling - With data in a non- standard format (like photos) it is easy to leverage the unstructured document oriented storage of MongoDB, in addition to its scaling capacities. • E-Commerce - OpenSky had a very complicated e-com- merce model. It was very difficult to design the correct schema for it and even then tables had to be altered many times. MongoDB allows them to alter the schema as and when required and simultaneously scale. • Online website data-storage - MongoDB is very good at real-time inserts, updates, and queries. Scalability and replication are required for very large websites. • Mobile systems - MongoDB is a great choice for mobile systems due to its geo-spatial index.

Sometimes it is helpful to use MongoDB and Hadoop together in the same system. For instance, with a system that is reasonably large that requires rich queries with indexes and effective retrieval, we can leverage the best qualities of both systems.

Page 4

Why NoSQL?

Relational databases have been used for de- cades. NoSQL databases are radically different and less mature. Relational databases can scale and handle Big Data. The difference is that rela- tional databases require more effort to scale.

Facebook uses MySQL-based database storage. Facebook with 800 million users and their ever growing numbers, is the best example of how relational databases can scale.(17)

The question arises, why do we need one of these new NoSQL databases? Is it to scale? Clearly not. Relational databases have been a part of large-scale systems for years. The issue is the amount of work to make them scale. For example, Twitter began as a relational database accessed by a Rails application. Amazon began as a relational database accessed by a C++ ap- plication. Facebook also began as a relational database accessed with PHP. While these giants have been successful at scaling, their success has not come easily. They had to make many tweaks and changes and even implement a few software and hardware tricks to achieve it.

Consider the simple act of “sharding” and what would be required to shard a set of customer records on a traditional RDBMS. What about the related data? How do you handle an un- usual number of orders on the node for cus- tomers with names beginning with the letter S? You can shard an RDBMS, but a lot of man hours will go into what a NoSQL database can handle automatically.

MongoDB vs Hadoop

Why NoSQL? Continued

What of ETL? Processing terabytes of data is possible using parallel queries, star schemas, disk layouts and other techniques. Using these traditional means, you can eek out as much CPU as possible and then push the data into separate databases for additional pro- cessing. However, it will still be hard to achieve, through this manual structur- ing, what you can achieve with Hadoop or a similar solution. It can be done, and has been many times before, but it will not be as easy or as quick as one of these modern automated solutions.

What about reliability? Oracle RAC is tried and true. However, it is expensive and difficult to work with on larger systems. It requires abandoning many of the features of the Oracle RDBMS. Microsoft SQL Server also supports clustering with a similar approach and similar limitations as Oracle. MongoDB and some (but not all) of the other NoSQL solutions support out of the box reliability.

For a startup or a growing company, scalable databases, which require investing a lot of effort and time into making the storage solution scale, is not an option. This is where NoSQL data- bases are the clear winner. For a larger company, with IT budget cuts and increasing pressure to make the best use of staff and out of the box solutions, NoSQL and Big Data solutions can no longer be ignored.

Hadoop + MongoDB

• The “Catching Osama” problem: Problem A few billion pages, with geospatial-tags Tremendous storage and processing required MongoDB fails to effectively process data due to a single-thread bottleneck per node. Hadoop does not have indexes thereby data retrieval takes longer than usual Solution Store all images in MongoDB with geo-spatial indexes Retrieve using the Hadoop framework and assign pro cessing to multiple mappers and reducers

• Demographic data analysis: • Whenever there is a need for processing demographic data spread over a large geographical area, the geospatial in- dexes of MongoDB are unmatched and MongoDB becomes an ideal storage. Hadoop is essential when processing hun dreds of terabytes of information.

Page 5

Given an unstructured data source and the desire to store it in MongoDB, Hadoop can be used to first struc ture the data. This facilitates easier storage in MongoDB. Then Hadoop can be used repeatedly for processing it later in large chunks.

Both MongoDB and Hadoop have their place in modern systems. MongoDB is rapidly becoming a replacement for highly scalable operational systems and websites. Hadoop offers exceptional features for applying MapReduce to Big Data and large analytical systems. Currently, both are dis- placing traditional RDBMS systems.

This white paper was written by Deep Mistry, Open Software Inte- grators. Open Software Integrators, LLC is a professional services company that provides consulting, training, support and course development.

Perfect MVC code

Thursday, 22 May 2014

Big Data -MongoDB vs Hadoop

No comments:

Post a Comment

Angular Tutorial (Update to Angular 7)

Search This Blog