To Hadoop or Not

Refer “Big Data” or “Data Analytics” today and you would definitely hear of “Hadoop”. Hadoop is often positioned as “The Framework” which would solve all your Big Data needs. This framework from Apache Software Foundation is an Open Source Framework, i.e. anyone can use the framework for free and is capable of handling huge data sets across a large set of commodity hardware. Therefore, it is popular among the development community.

Big Data, from its very nomenclature refers to data which is huge, structured or unstructured. Examples of such data are the enormous amounts of data in social media sites, evolved due to interaction of the fraternity of social media users of Facebook, Twitter, LinkedIn, etc on a day to day basis. Types of such data include the various types of chat information, images, videos, etc which are being used by the users of such applications. Other applications providing such data are from iOT applications like data related to processes in industries or manufacturing units such as temperature, pressure, etc. that keep on changing over real time and result in huge data sets, if we measure the data over certain periods of time. Other such data could be data related to telecom usage, space or weather data, stock trading data, etc.

Hadoop is handy if your needs are that of ETL (Extract – Transform and Load) Operations. However, do not get into the Hadoop trap unless you have a clear understanding of your business needs. Ask yourself the following questions before you decide on investing your time and money implementing on Hadoop.

  • Do you really have terabytes or petabytes of data to be processed? Hadoop was designed to handle huge data volumes of this scale. However, a report from Microsoft states that majority of jobs process less than 100 GB of data. If your data size is lower than terabytes, you may not require a Hadoop. Even if your data size is more than terabytes, do you really need to process all your data?
  • Are your data needs real time? If your expectations of processing data are real time, Hadoop is not the best tool to meet your needs. In fact, Hadoop requires some time to process data and is a very good batch processing tool. In case your business is to interpret data in real time, such as movement of data in the stock markets for taking real time decisions such as buying or selling stocks, Hadoop is not the answer.
  • Do you require a quick response? Your requirements for response time need to be well understood. If your user is not interested to wait for a minute to look at response for large data sets, you may have to use other real time applications and not Hadoop.
  • Does your requirement involve complex and computation intensive algorithms? The alogorithm of MapReduce in Hadoop is efficient in handling processing of large volume of data on a parallel processing mode by dividing large files into smaller files and storing across machines. However, this is not apt for requirements which are computation intensive and having large number of intermediate steps of data during computation, e.g. the computation of Fibonacci Series. A few machine learning algorithms also do not fall in the paradigm of MapReduce and therefore the pertinent question here to decide is whether the business requires high usage of specialised algorithms. In that case, the technology experts need to analyze if the algorithms required are MapReducible.

Hadoop would be your choice when:

  • You want to transform largely unstructured or semi-structured data into usable form.
  • You want to analyze information of a huge set of data to obtain insights, and you have ample time for such analysis.
  • Have overnight batch jobs to process, e.g. daily transactions processing in case of credit card companies
  • When the insight gained from such data analysis is applicable over a longer period of time, e.g. social behavior analysis in case of social sites, e.g likes and affinity analysis, job suggestions based on browsing history, etc.
  • Hadoop utilizes key value pairs during its processing efficiently, and such forms of data are ideally useful for its operations.

So… before you rush in to select Hadoop as your framework, analyze your needs carefully. Though Hadoop is a free framework, its implementation might require effort and cost and the budget for implementation may not be that cheap.

Sudipta Choudhury
Sudipta Choudhury is a Technical / Business Writer with considerable industry experience in various domains. He can be reached at sudipta_choudhury@writeforvalue.com

Published by

Sudipta Choudhury

Sudipta Choudhury is a Technical / Business Writer with considerable industry experience in various domains. He can be reached at sudipta_choudhury@writeforvalue.com

Leave a Reply

Your email address will not be published. Required fields are marked *