Tag Archives: data

To Hadoop or Not

Refer “Big Data” or “Data Analytics” today and you would definitely hear of “Hadoop”. Hadoop is often positioned as “The Framework” which would solve all your Big Data needs. This framework from Apache Software Foundation is an Open Source Framework, i.e. anyone can use the framework for free and is capable of handling huge data sets across a large set of commodity hardware. Therefore, it is popular among the development community.

Big Data, from its very nomenclature refers to data which is huge, structured or unstructured. Examples of such data are the enormous amounts of data in social media sites, evolved due to interaction of the fraternity of social media users of Facebook, Twitter, LinkedIn, etc on a day to day basis. Types of such data include the various types of chat information, images, videos, etc which are being used by the users of such applications. Other applications providing such data are from iOT applications like data related to processes in industries or manufacturing units such as temperature, pressure, etc. that keep on changing over real time and result in huge data sets, if we measure the data over certain periods of time. Other such data could be data related to telecom usage, space or weather data, stock trading data, etc.

Hadoop is handy if your needs are that of ETL (Extract – Transform and Load) Operations. However, do not get into the Hadoop trap unless you have a clear understanding of your business needs. Ask yourself the following questions before you decide on investing your time and money implementing on Hadoop.

  • Do you really have terabytes or petabytes of data to be processed? Hadoop was designed to handle huge data volumes of this scale. However, a report from Microsoft states that majority of jobs process less than 100 GB of data. If your data size is lower than terabytes, you may not require a Hadoop. Even if your data size is more than terabytes, do you really need to process all your data?
  • Are your data needs real time? If your expectations of processing data are real time, Hadoop is not the best tool to meet your needs. In fact, Hadoop requires some time to process data and is a very good batch processing tool. In case your business is to interpret data in real time, such as movement of data in the stock markets for taking real time decisions such as buying or selling stocks, Hadoop is not the answer.
  • Do you require a quick response? Your requirements for response time need to be well understood. If your user is not interested to wait for a minute to look at response for large data sets, you may have to use other real time applications and not Hadoop.
  • Does your requirement involve complex and computation intensive algorithms? The alogorithm of MapReduce in Hadoop is efficient in handling processing of large volume of data on a parallel processing mode by dividing large files into smaller files and storing across machines. However, this is not apt for requirements which are computation intensive and having large number of intermediate steps of data during computation, e.g. the computation of Fibonacci Series. A few machine learning algorithms also do not fall in the paradigm of MapReduce and therefore the pertinent question here to decide is whether the business requires high usage of specialised algorithms. In that case, the technology experts need to analyze if the algorithms required are MapReducible.

Hadoop would be your choice when:

  • You want to transform largely unstructured or semi-structured data into usable form.
  • You want to analyze information of a huge set of data to obtain insights, and you have ample time for such analysis.
  • Have overnight batch jobs to process, e.g. daily transactions processing in case of credit card companies
  • When the insight gained from such data analysis is applicable over a longer period of time, e.g. social behavior analysis in case of social sites, e.g likes and affinity analysis, job suggestions based on browsing history, etc.
  • Hadoop utilizes key value pairs during its processing efficiently, and such forms of data are ideally useful for its operations.

So… before you rush in to select Hadoop as your framework, analyze your needs carefully. Though Hadoop is a free framework, its implementation might require effort and cost and the budget for implementation may not be that cheap.

Brexit, Automation, Digital Age and Us

The referendum for Brexit by the people of the United Kingdom seeking to part ways with the European Union (EU) throws a few poignant questions on where the world is heading. The instability is further enhanced by visible cues from the Republican Party in the United States with its projected leader, Donald Trump seeking to rake up the xenophobic feelings against migration, and hatred for a religious minority.

  • Are we seeking to go back to our past to resurrect a state, where countries were isolated, and collaborations were limited?
  • In this connected world, where messages move almost at the speed of light, can isolation work?

In the last two decades, the internet has become deeply entrenched, and associated with day to day lives.  Technical advancements have been quick, and human beings in general have gained by having better control, and exposure to usage of services. These advancements in general have affected some of the jobs, but created a whole range of new services and work areas. Information Technology (IT) industry therefore, has, not only created numerous jobs, but also has given ample options for reskilling, and redistribution of labor.

One such advancement of technology which is abuzz is “Robots”. These were created, and used by the Japanese, who were the pioneers in research with Artificial Intelligence. More often, they used these robots as toys, and in games. They did replace a few mundane, and repetitive jobs in manufacturing as part of Flexible Manufacturing Systems, particularly in automobile companies like Toyota. The nature of these displacements, however, was not as much as impacting as the recent experiments suggest. With the U.S & Japanese companies, along with, Research & Development establishments devoting considerable effort to create humanoids and build intelligence, new developments which are termed “disruptive” have emerged, each of  which could potentially remove jobs in millions. Take for instance the “driverless car” experiments from Google or a few auto companies in USA.  What happens to the jobs of drivers? The drones from the US Department of Defense are operational for some time, particularly for reconnaissance, and in war to weed out terrorists without “pilots” onboard.

Replacement of labor in industries with the purpose of gaining operational efficiency, and profitability has been in vogue for quite some time – from the introduction of machines to automate the textile industry; to introduction of automated systems and flexible manufacturing systems; to companies like IBM, and Accenture that have moved to destinations abroad to scout for cheap techies in place of the expensive ones in the United States.

When it comes to robots, however, the cost advantage that they bring in to the table cannot be matched by any salaried employee – in fact robots do not work for salary. They are adept at doing repetitive jobs, at no extra cost without becoming fatigued. Not surprisingly, the world’s three large employers – Foxconn, US Department of Defense, and Walmart are replacing workers with robots, reports Business Insider. Foxconn is a key manufacturer for Google, Apple and Amazon – is 10th largest employer in the world, and has used robots to replace 60,000 workers. Citi and Oxford predict that about 77% jobs in China, and 57% jobs from 34 OECD countries are prone to risks due to automation. The World Economic Forum estimates that by 2020, 5 million jobs could be lost globally.  A utilization of automation dubbed Robotic Process Automation ( RPA) is also being developed and implemented which would potentially replace cheap BPO jobs as well. It is just not the workers who are at risk. Even the highly educated professionals like medical, or journalists could be at risk if artificial Intelligence has its way. In fact, IBM has claimed to develop a computer that can diagnose cancer better than doctors.

It is this undercurrent of technology, coupled with a long recession, which seems to have created a jingoist attitude – that of holding on to whatever available, or grabbing even what is not available among the masses, causing Brexit. It reminds one of famine – the Russian famine, consequently forcing human beings into cannibalism, in order to survive.

The neo-Luddites are back.  The army of textile workers, known as Luddites had protested against the machines introduced during the Industrial Revolution in England, since then and the struggle seems to have continued, albeit in a different way during this Fourth Industrial revolution of Robots, RPA, Artificial Intelligence, Digital Technologies and human beings.

While robots usually create efficiency and profits for the companies they work for, they also require human support – to manage and monitor them, to maintain them and replace them; if the need be. As such, artificial intelligence (AI) is “artificial”, as it is developed completely by humans. Each norm of intelligence that is impregnated within a robot with AI requires development, testing, and implementation by human beings.  These programs also require inputs from data analysts, since any artificial intelligence requires a lot of learning that the robot (humanoid) has to do.  For example, just to make a robot understand what a “flower” is, it has to be fed a lot of comprehensive data. Recently, Google’s artificial intelligence program erred by tagging a Black man as a Gorrilla, due to inadequate data. The vast requirements of data that the Artificial Intelligence, and Digital Technologies require is called Big Data in Software parlance. This in turn requires new software, and technical members to maintain such data and databases. Data Analyst is a new job profile that is created to define, and analyze such data.

Each technological revolution leads to job loss for some, and new jobs created for others. It is important to reskill, and keep oneself updated with skills, which would be applicable under the transformed environment. If the number of jobs created are less than the jobs lost, we are going to see more and more of social adjustments like Brexit or social unrest. Governments need to collaborate more on job creation, since only job creation for human beings would keep the social environment under control.

The future for us is CHANGE, so as to adjust to rapid automation. We need to learn to coexist with the robots and being productive during the age of robots.


Do Not Let Your Data Kill You – The Need for 3 R’s – Reduce, Recycle and Reuse

As the saying goes – anything in excess is a waste. Isn’t it true for information today?  Information or “data” – the four letter word which is more representative of the digital world has overwhelmed you, me and everyone transcending this space. Data in this form has various connotations – the more popular “Big Data”, Large or complex data, humongous data, etc.

On an average, data of companies have been increasing at a rapid pace – about 100% or more every year. Also, with users of social media being overactive, data transactions have multiplied manifold in real time. Though technical advances are being made to store this data in large repositories, there is a need for deriving context – meaningful information so as to Reduce, Recycle and Reuse data. For example, companies would like to use their data to understand and interpret information such as employee interactions, communications and client engagements. Data that is not used, but occupies useful repository space is a costly waste and needs to be eliminated. Regulatory requirements require one to use data to create intelligent and statutory reports that can be audited easily if the need be. The 3 R’s put in practice improve data management in a business environment:

Reduce:  Regulatory requirements for data, e.g. PCI data storage requirements or other Information governance or compliance standards, require one to be circumspect before planning for reduction of data. This challenge for cleaning up data not only results in a large volume of unused data, but also results in saving of data in local repositories of users with subsequent backups by the IT team.

Therefore, how do I reduce unused data? A Document Retention Policy, specifying the criteria for holding or removing data, the process governing such a decision and the relevant owners to implement and oversee is the first proactive step that any company can adopt that only appropriate data is maintained. With a policy in place, the discipline to actually implement such a policy enables a large reduction in unused data.

Recycle:  Regulatory Reporting is an important aspect for many industries. For example, in the US, Health industry related reports are mandatory, not only for the companies, but also for the patients, and the industry is well regulated.  Taxation or Financial obligations also require statutory reporting and audits. It is important for the data to be recycled and processed into useful reports for the auditors and the statutory authorities. Usually, intelligent software, ETL techniques, help in recycling such data.

Reuse: The most interesting part of data management is Reuse of data. The world of Business Analytics and Business Intelligence has offered options for deriving business insights from a large data set and intelligently reuse data. A new science “Data Science” has evolved in its own right and is promptly advocated by the Harvard Business Review. The HBR article from Thomas H Davenport and D J Patil in fact refers the job of a data scientist as the “sexiest job of 21st century”.

A few terms often used for reuse of data are:

  • Data Science: This is a term which loosely entails the combo of computer science, analytics, statistics, and data modeling. While this is a loose combination, and some companies have evolved their own courses or certifications, it still needs to mature as a science with comprehensive tenets and elaborate literature.
  • Smart data: Smart data is usually a subset of Big Data, with noise filtered out. While Big Data can be characterized by its attributes – variety, velocity and volume, a smart data is usually is characterized by velocity and value. Smart data is a key ingredient for intelligent BI Reporting.
  • Predictive Analytics: It involves smart methodologies utilizing data – machine learning techniques and statistical algorithms to predict the future outcomes of data. Companies gain out of predictive analytics by deriving or planning important outcomes from past data, e.g. revenue or profit.
  • Real Time Analytics: Analytics served real time, e.g. stock prices moving up or down, updates on page views, sessions, bounce rates, page navigation, advertisements dynamically adjusted based on type and frequency of customer usage, etc.
  • Intelligent Decision Systems: Use of Artificial intelligence in association with data is an area that helps users to derive the best and optimized decisions based on a large number of input variables. While this is still evolving, it can be used in number of areas such as building marketing systems that offer customers based on profile analysis, blocking of fraudulent transactions in credit card operations, etc.
  • Data Visualization: Pictorial or graphical representation of data intelligently, in an interactive way, help business professionals to identify trends and patterns in their data, e.g. sales data region-wise, or by customer profile.
  • Big Data Analytics: Reuse of data is not complete unless we use the term Big Data. The concept of Big data analytics has evolved from companies managing huge sets of data such as oil companies or telecommunication companies to social media such as Facebook, Twitter, LinkedIn that involve large data sets. This form of analytics help us to derive hidden patterns, market trends, preferences of customers, unknown correlations, etc.

 Business Data Analytics, therefore is in its infancy, to be nurtured, developed and evolved over the years. The attraction therefore is immense, and so is the job of the Data Scientist!!!