PART ONE: The Rise of Big Data![]()
CHAPTER 1
What Is Big Data and Why Does It Matter?
Perhaps nothing will have as large an impact on advanced analytics in the coming years as the ongoing explosion of new and powerful data sources. When analyzing customers, for example, the days of relying exclusively on demographics and sales history are past. Virtually every industry has at least one completely new data source coming online soon, if it isnât here already. Some of the data sources apply widely across industries; others are primarily relevant to a very small number of industries or niches. Many of these data sources fall under a new term that is receiving a lot of buzz: big data.
Big data is sprouting up everywhere and using it appropriately will drive competitive advantage. Ignoring big data will put an organization at risk and cause it to fall behind the competition. To stay competitive, it is imperative that organizations aggressively pursue capturing and analyzing these new data sources to gain the insights that they offer. Analytic professionals have a lot of work to do! It wonât be easy to incorporate big data alongside all the other data that has been used for analysis for years.
This chapter begins with some background on big data and what it is all about. Then it will cover a number of considerations in terms of how an organization can make use of big data. Readers will need to understand what is in this chapter as much as or more than anything else in the book if they are to tame the big data tidal wave successfully.
WHAT IS BIG DATA?
There is not a consensus in the marketplace as to how to define big data, but there are a couple of consistent themes. Two sources have done a good job of capturing the essence of what most would agree big data is all about. The first definition is from Gartnerâs Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, âBig data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.â1 Another good definition is from a paper by the McKinsey Global Institute in May 2011: âBig data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.â2
These definitions imply that what qualifies as big data will change over time as technology advances. What was big data historically or what is big data today wonât be big data tomorrow. This aspect of the definition of big data is one that some people find unsettling. The preceding definitions also imply that what constitutes big data can vary by industry, or even organization, if the tools and technologies in place vary greatly in capability. We will talk more about this later in the chapter in the section titled âTodayâs Big Data Is Not Tomorrowâs Big Data.â
A couple of interesting facts in the McKinsey paper help bring into focus how much data is out there today:
- $600 today can buy a disk drive that will store all of the worldâs music.
- There are 30 billion pieces of information shared on Facebook each month.
- Fifteen of 17 industry sectors in the United States have more data per company on average than the U.S. Library of Congress.3
THE âBIGâ IN BIG DATA ISNâT JUST ABOUT VOLUME
While big data certainly involves having a lot of data, big data doesnât refer to data volume alone. Big data also has increased velocity (i.e., the rate at which data is transmitted and received), complexity, and variety compared to data sources of the past.
Big data isnât just about the size of the data in terms of how much data there is. According to the Gartner Group, the âbigâ in big data also refers to several other characteristics of a big data source.4 These aspects include not just increased volume but increased velocity and increased variety. These factors, of course, lead to extra complexity as well. What this means is that you arenât just getting a lot of data when you work with big data. Itâs also coming at you fast, itâs coming at you in complex formats, and itâs coming at you from a variety of sources.
It is easy to see why the wealth of big data coming toward us can be likened to a tidal wave and why taming it will be such a challenge! The analytics techniques, processes, and systems within organizations will be strained up to, or even beyond, their limits. It will be necessary to develop additional analysis techniques and processes utilizing updated technologies and methods in order to analyze and act upon big data effectively. We will talk about all these topics before the book is done with the goal of demonstrating why the effort to tame big data is more than worth it.
IS THE âBIGâ PART OR THE âDATAâ PART MORE IMPORTANT?
It is already time to take a brief quiz! Stop for a minute and consider the following question before you read on: What is the most important part of the term big data? Is it (1) the âbigâ part, (2) the âdataâ part, (3) both, or (4) neither? Take a minute to think about it and once youâve locked in your answer, proceed to the next paragraph. In the meantime, imagine the âcontestants are thinkingâ music from a game show playing in the background.
Okay, now that youâve locked in your answer letâs find out if you got the right answer. The answer to the question is choice (4). Neither the âbigâ part nor the âdataâ part is the most important part of big data. Not by a long shot. What organizations do with big data is what is most important. The analysis your organization does against big data combined with the actions that are taken to improve your business are what matters.
Having a big source of data does not in and of itself add any value whatsoever. Maybe your data is bigger than mine. Who cares? In fact, having any set of data, however big or small it may be, doesnât add any value by itself. Data that is captured but not used for anything is of no more value than some of the old junk stored in an attic or basement. Data is irrelevant without being put into context and put to use. As with any source of data big or small, the power of big data is in what is done with that data. How is it analyzed? What actions are taken based on the findings? How is the data used to make changes to a business?
Reading a lot of the hype around big data, many people are led to believe that just because big data has high volume, velocity, and variety, it is somehow better or more important than other data. This is not true. As we will discuss later in the chapter in the section titled Most Big Data Doesnât Matter, many big data sources have a far higher percentage of useless or low-value content than virtually any historical data source. By the time you trim down a big data source to what you actually need, it may not even be so big any more. But that doesnât really matter, because whether it stays big or whether it ends up being small when youâre done processing it, the size isnât important. Itâs what you do with it.
IT ISNâT HOW BIG IT IS. ITâS HOW YOU USE IT!
Weâre talking about big data of course! Neither the fact that big data is big nor the fact that it is data adds any inherent value. The value is in how you analyze and act upon the data to improve your business.
The first critical point to remember as we start into the book is that big data is both big and itâs data. However, thatâs not whatâs going to make it exciting for you and your organization. The exciting part comes from all the new and powerful analytics that will be possible as the data is utilized. Weâre going to talk about a number of those new analytics as we proceed.
HOW IS BIG DATA DIFFERENT?
There are some important ways that big data is different from traditional data sources. Not every big data source will have every feature that follows, but most big data sources will have several of them.
First, big data is often automatically generated by a machine. Instead of a person being involved in creating new data, itâs generated purely by machines in an automated way. If you think about traditional data sources, there was always a person involved. Consider retail or bank transactions, telephone call detail records, product shipments, or invoice payments. All of those involve a person doing something in order for a data record to be generated. Somebody had to deposit money, or make a purchase, or make a phone call, or send a shipment, or make a payment. In each case, there is a person who is taking action as part of the process of new data being created. This is not so for big data in many cases. A lot of sources of big data are generated without any human interaction at all. A sensor embedded in an engine, for example, spits out data about its surroundings even if nobody touches it or asks it to.
Second, big data is typically an entirely new source of data. It is not simply an extended collection of existing data. For example, with the use of the Internet, customers can now execute a transaction with a bank or retailer online. But the transactions they execute are not fundamentally different transactions from what they would have done traditionally. Theyâve simply executed the transactions through a different channel. An organization may capture web transactions, but they are really just more of the same old transactions that have been captured for years. However, actually capturing browsing behaviors as customers execute a transaction creates fundamentally new data which weâll discuss in detail in Chapter 2.
Sometimes âmore of the sameâ can be taken to such an extreme that the data becomes something new. For example, your power meter has probably been read manually each month for years. An argument can be made that automatic readings every 15 minutes by a Smart Meter is more of the same. It can also be argued that it is so much more of the same and that it enables such a different, more in-depth level of analytics that such data is really a new data source. Weâll discuss this data in Chapter 3.
Third, many big data sources are not designed to be friendly. In fact, some of the sources arenât designed at all! Take text streams from a social media site. There is no way to ask users to follow certain standards of grammar, or sentence ordering, or vocabulary. You are going to get what you get when people make a posting. It can be difficult to work with such data at best and very, very ugly at worst. Weâll discuss text data in Chapters 3 and 6. Most traditional data sources were designed up-front to be friendly. Systems used to capture transactions, for example, provide data in a clean, preformatted template that makes the data easy to load and use. This was driven in part by the historical need to be highly efficient with space. There was no room for excess fluff.
BIG DATA CAN BE MESSY AND UGLY
Traditional data sources were very tightly defined up-front. Every bit of data had a high level of value or it would not be included. With the cost of storage space becoming almost negligible, big data sources are not always tightly defined up-front and typically capture everything that may be of use. This can lead to having to wade through messy, junk-filled data when doing an analysis.
Last, large swaths of big data streams may not have much value. In fact, much of the data may even be close to worthless. Within a web log, there is information that is very powerful. There is also a lot of information that doesnât have much value at all. It is necessary to weed through and pull out the valuable and relevant pieces. Traditional data sources were defined up-front to be 100 percent relevant. This is because of the scalability limitations that were present. It was far too expensive to have anything included in a data feed that wasnât critical. Not only were data records predefined, but every piece of data in them was high-value. Storage space is no longer a primary constraint. This has led to the default with big data being to capture everything possible and worry later about what matters. This ensures nothing will be missed, but also can make the process of analyzing big data more painful.
HOW IS BIG DATA MORE OF THE SAME?
As with any new topic getting a lot of attention, there are all sorts of claims about how big data is going to fundamentally change everything about how analysis is done and how it is used. If you take the time to think about it, however, it really isnât the case. It is an example where the hype is going beyond the reality.
The fact that big data is big and poses scalability issues isnât new. Most new data sources were considered big and difficult when they first came into use. Big data is just the next wave of new, bigger data that pushes current limits. Analysts were able to tame past data sources, given the constraints at the time, and big data will be tamed as well. After all, analysts have been at the forefront of exploring new data sources for a long time. Thatâs going to continue.
Who first started to analyze call detail records within telecom companies? Analysts did. I was doing churn analysis against mainframe tapes at my first job. At the time, the data was mind-boggling big. Who first started digging into retail point-of-sale data to figure out what nuggets it held? Analysts did. Originally, the thought of analyzing data about tens to hundreds of thousands of products across thousands of stores was considered a huge problem. Today, not so much.
The analytical professionals who first dipped their toe into such sources were dealing with what at the time were unthinkably large amounts of data. They had to figure out how to analyze it and make use of it within the constraints in place at the time. Many people doubted it was possible, and some even questioned the value of such data. That sounds a lot like big data today, doesnât it?
Big data really isnât going to change what analytic professionals are trying to do or why they are doing it. Even as some begin to define themselves as data scientists, rather than analysts, the goals and objectives are the same. Certainly the problems addressed will evolve with big data, just as they have always evolved. But at the end of the day, analysts and data scientists will simply be exploring new and unthinkably large data sets to uncover valuable trends and patterns as they have always done. For the purposes o...