1
HOW BIG IS BIG DATA?
Just outside Memphis, an industrial symphony of machines and humans shuttles goods to and fro, their carefully orchestrated movements and identifying marks tracked by bar-code scanners and chips emitting radio waves. Mechanical arms snatch the plastic shrink-wrapped bundles off a conveyor belt, as forklifts ferry the packages onto trucks for long-distance travel. Flesh-and-blood humans guide and monitor the flow of goods and drive the forklifts and trucks.
McKesson, which distributes about a third of all of the pharmaceutical products in America, runs this sprawling showcase of efficiency. Its buildings span the equivalent of more than eight football fields, forming the hub of McKessonâs national distribution networkâa feat of logistics that sends goods to 26,000 customer locations, from neighborhood pharmacies to Walmart. The main cargo is drugs, roughly 240 million pills a day. The pharmaceutical distribution business is one of high volumes and razor-thin profit margins. So, understandably, efficiency has been all but a religion for McKesson for decades.
Yet in the last few years, McKesson has taken a striking step further by cutting the inventory flowing through its network at any given time by $1 billion. The payoff came from insights gleaned from harvesting all the product, location, and transport data, from scanners and sensors, and then mining that data with clever software to identify potential time-saving and cost-cutting opportunities. The technology-enhanced view of the business was a breakthrough that Donald Walker, a senior McKesson executive, calls âmaking the invisible visible.â
In Atlanta, I stand outside one of the glassed-in rooms in the fifth-floor intensive care unit at the Emory University Hospital. Inside, a dense thicket of electronic devices, a veritable forest of medical computing, crowds the room: a respirator, a kidney machine, infusion machines pumping antibiotics and painkilling opiates, and gadgets monitoring heart rate, breathing, blood pressure, oxygen saturation, and other vital signs. Nearly every machine has its own computer monitor, each emitting an electronic cacophony of beeps and alerts. I count a dozen screens, larger flat panels and smaller ones, smartphone-sized.
A typical twenty-bed intensive care unit generates an estimated 160,000 data points a second. Amid all that data, informed and distracted by it, doctors and nurses make decisions at a rapid clip, about 100 decisions a day per patient, according to research at Emory. Or more than 9.3 million decisions about care during a year in an ICU. So there is ample room for error. The overwhelmed people need help. And Emory is one of a handful of medical research centers that is working to transform critical care with data, both in adult and neonatal intensive care wards. The data streams from the medical devices monitoring patients are parsed by software that has been trained to spot early warning signals that a patientâs condition is worsening.
Digesting vast amounts of data and spotting seemingly subtle patterns is where computers and software algorithms excel, more so than humans. Dr. Timothy Buchman heads up such an effort at Emory. A surgeon, scientist, and experienced pilot, Buchman uses a flight analogy to explain his goal. GPS (Global Positioning System) location data on planes is translated to screen images that show air-traffic controllers when a flight is going astrayââoff trajectory,â as he puts itâwell before a plane crashes. Buchman wants the same sort of early warning system for patients whose pattern of vital signs is off trajectory, before they crash, in medical terms. âThatâs where big data is taking us,â he says.
The age of big data is coming of age, moving well beyond Internet incubators in Silicon Valley, such as Google and Facebook. It began in the digital-only world of bits, and is rapidly marching into the physical world of atoms, into the mainstream. The McKesson distribution center and the Emory intensive care unit show the wayâbig data saving money and saving lives. Indeed, the long view of the technology is that it will become a layer of data-driven artificial intelligence that resides on top of both the digital and the physical realms. Today, weâre seeing the early steps toward that vision. Big-data technology is ushering in a revolution in measurement that promises to be the basis for the next wave of efficiency and innovation across the economy. But more than technology is at work here. Big data is also the vehicle for a point of view, or philosophy, about how decisions will beâand perhaps should beâmade in the future. David Brooks, my colleague at the New York Times, has referred to this rising mind-set as âdata-ismââa term Iâve adopted as well because it suggests the breadth of the phenomenon. The tools of innovation matter, as weâve often seen in the past, not only for economic growth but because they can reshape how we see the world and make decisions about it
A bundle of technologies fly under the banner of big data. The first is all the old and new sources of dataâWeb pages, browsing habits, sensor signals, social media, GPS location data from smartphones, genomic information, and surveillance videos. The data surge just keeps rising, about doubling in volume every two years. But I would argue that the most exaggeratedâand often least importantâaspect of big data is the âbig.â The global data count becomes a kind of nerdâs parlor game of estimates and projections, an excursion into the linguistic backwater of zettabytes, yottabytes, and brontobytes. The numbers and their equivalents are impressive. Ninety percent of all of the data in history, by one estimate, was created in the last two years. In 2014, International Data Corporation estimated the data universe at 4.4 zettabytes, which is 4.4 trillion gigabytes. That volume of information, the research firm said, straining for perspective, would fill enough slender iPad Air tablets to create a stack more than 157,000 miles high, or two thirds of the way to the moon.
But not all data is created equal, or is equally valuable. The mind-numbing data totals are inflated by the rise in the production of digital images and video. Just think of all of the smart-phone friend and family pictures and video clips taken and sent. It is said that a picture is worth a thousand words. Yet in the arithmetic of digital measurement, that is a considerable under-statement, because images are bit gluttons. Text, by contrast, is a bit-sipping medium. There are eight bits in a byte. A letter of text consumes one byte, while a standard, high-resolution picture is measured in megabytes, millions of bytes. And video, in its appetite for bits, dwarfs still pictures. And forty-eight hours of video are uploaded onto YouTube every minute, as I write this, with the pace likely to only increase.
The big in big data matters, but a lot less than many people think. Thereâs a lot of water in the ocean, too, but you canât drink it. The more pressing issue is being able to use and make sense of data. The success stories in this book involve lots of data, but typically not in volumes that would impress engineers at Google. And while advances in computer processing, storage, and memory are helping with the data challenge, the biggest step ahead is in software. The crucial code comes largely from the steadily evolving toolkit of artificial intelligence, like machine-learning software.
Data and smart technology are opening the door to new horizons of measurement, both from afar and close-up. Big-data technology is the digital-age equivalent of the telescope or the microscope. Both of those made it possible to see and measure things as never beforeâwith the telescope, it was the heavens and new galaxies; with the microscope, it was the mysteries of life down to the cellular level.
Just as modern telescopes transformed astronomy and modern microscopes did the same for biology, big data holds a similar promise, but more broadly, in every field and every discipline. Far-reaching advances in technology are engines of economic change. The Internet transformed the economics of communication. Then other technologies, like the Web, were built on top of the Internet, which has become a platform for innovation and new businesses. Similarly big data, though still a young technology, is transforming the economics of discoveryâbecoming a platform, if you will, for human decision making.
Decisions of all kinds will increasingly be made based on data and analysis rather than on experience and intuitionâmore science and less gut feel.
Throughout history, technological change has challenged traditional practices, ways of educating people, and even ways of understanding the world. In 1959, at the dawn of the modern computer age, the English chemist and novelist C. P. Snow delivered a lecture at Cambridge University, âThe Two Cultures.â In it, Snow dissected the differences and observed the widening gap between two camps, the sciences and the humanities. The schism between scientific and âliterary intellectuals,â he warned, threatened to stymie economic and social progress, if those in the humanities remained ignorant of the advances in science and their implications. The lecture was widely read in America, and among those influenced were two professors at Dartmouth College, John Kemeny and Thomas Kurtz. Kemeny, a mathematician and a former research assistant to Albert Einstein, would go on to become the president of Dartmouth. Kurtz was a young maths professor in the early 1960s when he approached Kemeny with the idea of giving nearly all students at Dartmouth a taste of programming on a computer.
Kemeny and Kurtz saw the rise of computing as a major technological force that would sweep across the economy and society. But only a quarter of Dartmouth students majored in science or engineering, the group most likely to be interested in computing. Yet âmost of the decision makers of business and governmentâ typically came from the less technically inclined 75 percent of the student population, Kurtz explained. So Kurtz and Kemeny devised a simple programming language BASIC (Beginnerâs All-purpose Symbolic Instruction Code), intended to be accessible to non-engineers. In 1964, they began teaching Dartmouth students to write programs in BASIC. And variants of Dartmouthâs BASIC would eventually be used by millions of people to write software. Bill Gates wrote a stripped-down BASIC to run on early personal computers, and Microsoft BASIC was the companyâs founding product. Years later, Gates fondly recalled the feat of writing a shrunken version of BASIC to work on the primitive personal computers of the mid-1970s. âOf all the programming Iâve done,â Gates told me, âitâs the thing Iâm most proud of.â
Back in the 1960s, Kemeny and Kurtz had no intention of making Dartmouth a training ground for professional programmers. They wanted to give their students a feel for interacting with these digital machines and for computational thinking, which involves analyzing and logically organizing data in ways so that computers can help solve problems. The Dartmouth professors werenât really teaching programming. They were trying to change minds, to encourage their students to see things differently. Today, when people talk about the need to retool education and training for the data age, it is often a fairly narrow discussion of specific skills. But the larger picture has less to do with a wizardâs mastery of data than with a fundamental curiosity about data. The bigger goal is to foster a mind-set, so that thinking about data becomes an intellectual first principle, the starting point of inquiry. Itâs a mentality that can be summed up in a question: What story does the data tell you?
The promise of big data is that the story is far richer and more detailed than ever before, making it suddenly possible to see more and learn fasterâor in the McKesson executiveâs words, âto make the invisible visible.â And the improvement is not a little bit better, but fundamentally different. I think of this as the deeper meaning of Mooreâs Law. In a technical sense, the law, formulated by Intelâs cofounder Gordon Moore in 1965, is the observation that transistor density on computer chips doubles about every two years and that computing power improves at that exponential pace. But in a practical sense, it also means that seemingly quantitative changes become qualitative, opening the door to new possibilities and doing new things. In computing, you start by calculating the flight trajectory of artillery shells, the task assigned the ENIAC (Electronic Numerical Integrator and Computer) in 1946. And by 2011, you have IBMâs Watson beating the best humans in the question-and-answer game Jeopardy!
To a computer, itâs all just the 1âs and 0âs of digital code. Yet the massive quantitative improvement in performance over time drastically changes what can be done. Trained physicists in the data world often compare the quantitative-to-qualitative transformation to a âphase change,â or change of state, as when a gas becomes a liquid or a liquid becomes a solid. It is an apt, descriptive comparison. But I prefer the Mooreâs Law reference, and hereâs why. When the temperature drops below thirty-two degrees Fahrenheit or zero degrees Celsius, water freezes. It happens naturally, a law of nature. Mooreâs Law is an observation about what had happened for years, and what could well happen in the future. But it is not a law of nature. Mooreâs Law has held for so many years because of human ingenuity, endeavor, and investment. Scientists, companies, and investors made it happen.
The same is true of big data. It has become technically possible thanks to a bounty of improvements in computing, sensing, and communications. But the steady advance in software and hardware, and the rise of data-ism, will happen because of brains, energy, and money. The big-data revolution requires both trailblazing individuals and institutional commitment. The narrative of this book is built around one of eachâa young man, and an old company. The young man is Jeffrey Hammerbacher, thirty-two, who personifies the mind-set of data-ism and whose career traces the widening horizons of data technology and methods. Hammerbacher grew up in Indiana, went to Harvard University, and then briefly was a quant at a Wall Street investment bank, before building the first team of so-called data scientists at Facebook. He left to be cofounder and chief scientist of Cloudera, a start-up that makes software for data scientists. Then, beginning in the summer of 2012, he embarked on a very different professional path. He joined the Icahn School of Medicine at Mount Sinai in New York, where he is leading a data group that is exploring genetic and other medical information in search of breakthroughs in disease modeling and treatment. Medical research, he figures, is the best use of his skills today.
At the other pole of the modern data world is IBM, a century-old technology giant known for its research prowess and its mainstream corporate clientele. Its customers provide a window into the progress data techniques are making, as well as the challenges, across a spectrum of industries. IBM itself has lined up its research, its strategy, and its investment behind the big-data business. âWe are betting the company on this,â Virginia Rometty, the chief executive, told me in an interview.
But for IBM, big data is a threat as well as an opportunity. The new, low-cost hardware and software that power many big-data applicationsâcloud computing and open-source codeâwill supplant some of IBMâs traditional products. The company must expand in the new data markets faster than its old-line businesses wither. No company can match IBMâs history in the data field; the founding technology of the company that became IBM, punched cards, developed by Herman Hollerith, triumphed in counting and tabulating the 1890 census, when the American population grew to sixty-three millionâthe big data of its day. Today, IBM researchers are at the forefront of big-data technology. The projects at McKesson and Emory, which will be examined in greater detail later, are collaborations with IBM scientists. And IBMâs Watson, that engine of data-driven artificial intelligence, is no longer merely a game-playing science experiment but a full-fledged business unit within IBM, supported by an investment of $1 billion, as it applies its smarts to medicine, customer service, financial services, and elsewhere. The Watson technology is now a cloud service, delivered over the Internet from distant data centers, and IBM is encouraging software engineers to write applications that run on Watson, as if an operating system for the future.
The new and the old, the individual and the institution are at times conflicting forces but also complementary. It is hard to imagine that Hammerbacher and IBM would ever be a comfortable fit, but they are heading in the same directionâand both are big-data enthusiasts.
Another conflicting yet complementary subject runs through this book, and it centers on decision making. Big data can be a powerful tool indeed, but it has its limits. So much depends on contextâwhat is being measured and how it is measured. Data can always be gathered, and patterns can be observedâbut is the pattern significant, and are you measuring what you really want to know? Or are you measuring what is most easily measured rather than what is most meaningful? There is a natural tension between the measurement imperative and measurement myopia. Two quotes frame the issue succinctly. The first: âYou canât manage what you canât measure.â For this one, there appear to be twin claims of attribution, either W. Edwards Deming, the statistician and quality control expert, or Peter Drucker, the management consultant. Who said it first doesnât matter so much. Itâs a mantra in business and it has the ring of commonsense truth.
The second quote is not as well known, but there is a lot of truth in it as well: âNot everything that can be counted counts, and not everything that counts can be counted.â Albert Einstein usually gets credit for this one, but the stronger claim of origin belongs to the sociologist William Bruce Cameronâthough again, who said it first matters far less than what it says...