To Hadoop or Not

Refer “Big Data” or “Data Analytics” today and you would definitely hear of “Hadoop”. Hadoop is often positioned as “The Framework” which would solve all your Big Data needs. This framework from Apache Software Foundation is an Open Source Framework, i.e. anyone can use the framework for free and is capable of handling huge data sets across a large set of commodity hardware. Therefore, it is popular among the development community.

Big Data, from its very nomenclature refers to data which is huge, structured or unstructured. Examples of such data are the enormous amounts of data in social media sites, evolved due to interaction of the fraternity of social media users of Facebook, Twitter, LinkedIn, etc on a day to day basis. Types of such data include the various types of chat information, images, videos, etc which are being used by the users of such applications. Other applications providing such data are from iOT applications like data related to processes in industries or manufacturing units such as temperature, pressure, etc. that keep on changing over real time and result in huge data sets, if we measure the data over certain periods of time. Other such data could be data related to telecom usage, space or weather data, stock trading data, etc.

Hadoop is handy if your needs are that of ETL (Extract – Transform and Load) Operations. However, do not get into the Hadoop trap unless you have a clear understanding of your business needs. Ask yourself the following questions before you decide on investing your time and money implementing on Hadoop.

  • Do you really have terabytes or petabytes of data to be processed? Hadoop was designed to handle huge data volumes of this scale. However, a report from Microsoft states that majority of jobs process less than 100 GB of data. If your data size is lower than terabytes, you may not require a Hadoop. Even if your data size is more than terabytes, do you really need to process all your data?
  • Are your data needs real time? If your expectations of processing data are real time, Hadoop is not the best tool to meet your needs. In fact, Hadoop requires some time to process data and is a very good batch processing tool. In case your business is to interpret data in real time, such as movement of data in the stock markets for taking real time decisions such as buying or selling stocks, Hadoop is not the answer.
  • Do you require a quick response? Your requirements for response time need to be well understood. If your user is not interested to wait for a minute to look at response for large data sets, you may have to use other real time applications and not Hadoop.
  • Does your requirement involve complex and computation intensive algorithms? The alogorithm of MapReduce in Hadoop is efficient in handling processing of large volume of data on a parallel processing mode by dividing large files into smaller files and storing across machines. However, this is not apt for requirements which are computation intensive and having large number of intermediate steps of data during computation, e.g. the computation of Fibonacci Series. A few machine learning algorithms also do not fall in the paradigm of MapReduce and therefore the pertinent question here to decide is whether the business requires high usage of specialised algorithms. In that case, the technology experts need to analyze if the algorithms required are MapReducible.

Hadoop would be your choice when:

  • You want to transform largely unstructured or semi-structured data into usable form.
  • You want to analyze information of a huge set of data to obtain insights, and you have ample time for such analysis.
  • Have overnight batch jobs to process, e.g. daily transactions processing in case of credit card companies
  • When the insight gained from such data analysis is applicable over a longer period of time, e.g. social behavior analysis in case of social sites, e.g likes and affinity analysis, job suggestions based on browsing history, etc.
  • Hadoop utilizes key value pairs during its processing efficiently, and such forms of data are ideally useful for its operations.

So… before you rush in to select Hadoop as your framework, analyze your needs carefully. Though Hadoop is a free framework, its implementation might require effort and cost and the budget for implementation may not be that cheap.

Is the World a better Place for Travel

Worldwide economic slowdown, conflicts and terrorist attacks, and the European refugee crisis, apparently seem to have their impact on the global travel & tourism industry.  However, when we look at the actual trend, the industry has maintained its growth despite all the adversities. A report from IPK International World Travel Monitor, 2015 reports a 4.5% growth in actual outbound trips in 2015, with a healthy increment of 4.3% estimated for 2016.

Outbound travel is primarily fueled by Asia Pacific and North America.  Germany, as a country is the ‘world travel champion’ – a leader in outbound travel. The United States follows Germany and continues to be both a leading source and destination for travel. China enjoys a leadership position in the travel industry, being next only to USA in terms of spending.

The European economy has improved slightly from 2014, primarily due to the growth of Germany. Overall, the forecast for Europe is a net growth of 2.8% in outbound travelers. While Europeans have maintained their momentum, they are likely to travel to safer destinations, avoiding the zones of conflicts and terrorism. Also, there is good growth in inbound travel to Europe, and the expected growth is between 3 to 4 % in 2016. Travelers from China and Asia Pacific countries, USA and Japan are all keen for travelling to Europe.

Outboubound Travel Forecast Percentage GrowthEconomic growth has slowed down to a certain extent in Asia Pacific, but, despite the slowdown, the number of travelers have only increased and the projected growth of 6.3% for outbound travelers is on expected lines. The growth rate as per IMF report for 2015 shows that India’s rate of growth is highest at 7.6%, ahead of China, and therefore, outbound travelers from and to India are expected to increase.

While North America is expected to show good growth in outbound travel, South America’s 1.9% growth is a cause for concern. About half of South America’s outbound travel market is catered to by Brazil and Argentina. Traditionally, South Americans travel internationally within the same region. One international event which could improve the percentage of international travel to Brazil could be the Olympic Games planned to be hosted in the city of Rio de Janeiro this year. The last Football World Cup in Brazil in the year 2014 caused more than half a million visitors to Brazil, and this trend is expected to be seen during the Olympics.

The Middle East travel market is one of the fastest growing markets and countries like Saudi Arabia and United Arab Emirates (UAE) are leaders in this area. The region is noted for travelers with deep pockets, who usually travel for long durations (with average trips for more than 14 nights). Also, more than 30% of travelers are immigrants travelling to meet friends and relatives. Inbound travel to Middle East has been seriously hurt due to the ongoing conflicts in that zone.

Last, but not the least, social media plays a vital role in international travel today. 70% of international travelers are active users of social media such as Facebook, Twitter, WhatsApp, LinkedIn, Google+. About 30% of the international travelers actively use social media for planning their trips. Marketers would do well to be creative with innovative approaches, so as to influence this section of buyers to plan their trips.

Do Not Let Your Data Kill You – The Need for 3 R’s – Reduce, Recycle and Reuse

As the saying goes – anything in excess is a waste. Isn’t it true for information today?  Information or “data” – the four letter word which is more representative of the digital world has overwhelmed you, me and everyone transcending this space. Data in this form has various connotations – the more popular “Big Data”, Large or complex data, humongous data, etc.

On an average, data of companies have been increasing at a rapid pace – about 100% or more every year. Also, with users of social media being overactive, data transactions have multiplied manifold in real time. Though technical advances are being made to store this data in large repositories, there is a need for deriving context – meaningful information so as to Reduce, Recycle and Reuse data. For example, companies would like to use their data to understand and interpret information such as employee interactions, communications and client engagements. Data that is not used, but occupies useful repository space is a costly waste and needs to be eliminated. Regulatory requirements require one to use data to create intelligent and statutory reports that can be audited easily if the need be. The 3 R’s put in practice improve data management in a business environment:

Reduce:  Regulatory requirements for data, e.g. PCI data storage requirements or other Information governance or compliance standards, require one to be circumspect before planning for reduction of data. This challenge for cleaning up data not only results in a large volume of unused data, but also results in saving of data in local repositories of users with subsequent backups by the IT team.

Therefore, how do I reduce unused data? A Document Retention Policy, specifying the criteria for holding or removing data, the process governing such a decision and the relevant owners to implement and oversee is the first proactive step that any company can adopt that only appropriate data is maintained. With a policy in place, the discipline to actually implement such a policy enables a large reduction in unused data.

Recycle:  Regulatory Reporting is an important aspect for many industries. For example, in the US, Health industry related reports are mandatory, not only for the companies, but also for the patients, and the industry is well regulated.  Taxation or Financial obligations also require statutory reporting and audits. It is important for the data to be recycled and processed into useful reports for the auditors and the statutory authorities. Usually, intelligent software, ETL techniques, help in recycling such data.

Reuse: The most interesting part of data management is Reuse of data. The world of Business Analytics and Business Intelligence has offered options for deriving business insights from a large data set and intelligently reuse data. A new science “Data Science” has evolved in its own right and is promptly advocated by the Harvard Business Review. The HBR article from Thomas H Davenport and D J Patil in fact refers the job of a data scientist as the “sexiest job of 21st century”.

A few terms often used for reuse of data are:

  • Data Science: This is a term which loosely entails the combo of computer science, analytics, statistics, and data modeling. While this is a loose combination, and some companies have evolved their own courses or certifications, it still needs to mature as a science with comprehensive tenets and elaborate literature.
  • Smart data: Smart data is usually a subset of Big Data, with noise filtered out. While Big Data can be characterized by its attributes – variety, velocity and volume, a smart data is usually is characterized by velocity and value. Smart data is a key ingredient for intelligent BI Reporting.
  • Predictive Analytics: It involves smart methodologies utilizing data – machine learning techniques and statistical algorithms to predict the future outcomes of data. Companies gain out of predictive analytics by deriving or planning important outcomes from past data, e.g. revenue or profit.
  • Real Time Analytics: Analytics served real time, e.g. stock prices moving up or down, updates on page views, sessions, bounce rates, page navigation, advertisements dynamically adjusted based on type and frequency of customer usage, etc.
  • Intelligent Decision Systems: Use of Artificial intelligence in association with data is an area that helps users to derive the best and optimized decisions based on a large number of input variables. While this is still evolving, it can be used in number of areas such as building marketing systems that offer customers based on profile analysis, blocking of fraudulent transactions in credit card operations, etc.
  • Data Visualization: Pictorial or graphical representation of data intelligently, in an interactive way, help business professionals to identify trends and patterns in their data, e.g. sales data region-wise, or by customer profile.
  • Big Data Analytics: Reuse of data is not complete unless we use the term Big Data. The concept of Big data analytics has evolved from companies managing huge sets of data such as oil companies or telecommunication companies to social media such as Facebook, Twitter, LinkedIn that involve large data sets. This form of analytics help us to derive hidden patterns, market trends, preferences of customers, unknown correlations, etc.

 Business Data Analytics, therefore is in its infancy, to be nurtured, developed and evolved over the years. The attraction therefore is immense, and so is the job of the Data Scientist!!!