From the academic perspective, the debates, or arguments, over specific and sophisticated technical concepts are merely hype. How so? Let's take a quick look at the essence of information technology reform (IT reform) – digitization. Technically, it is a process that stores “information” that is generated in the real world from the human mind in digital form as “data” into cyberspace. No matter what types of new technologies emerge, the data will stay the same. As the British scholar Viktor Mayer-Schonberger once said [1], it's time to focus on the “I” in the IT reform. “I,” as information, can only be obtained by analyzing data. The challenge we expect to face is the burst of a “data tsunami,” or “data explosion,” so data reform is already underway. The world of “being digital,” as advocated some time ago by Nicholas Negroponte [2], has been gradually transformed to “being in cyberspace.”1
With the “big data wave” touching nearly all human activities, not only are academic circles resolved to change the way of exploring the world as the “fourth paradigm”2 but industrial community is looking forward to enjoying profits from “inexhaustible” data innovations. Admittedly, given the fact that the emerging data industry will form a strategic industry in the near future, this is not difficult to predict. So the initiative is ours to seize, and to encourage the enterprising individual who wants to seek means of creative destruction in a business startup or wants to revamp a traditional industry to secure its survival. We ask the reader to follow us, if only for a cursory glimpse into the emerging big data industry, which handily demonstrates the properties property of the four categories in Fisher–Clark's classification, which is to say: the resource property of primary industry, the manufacturing property of secondary industry, the service property of tertiary industry, and the “increasing profits of other industries” property of quaternary industry.
At present, industrial transformation and the emerging business of data industry are big challenges for most IT giants. Both the business magnate Warren Buffett and financial wizard George Soros are bullish that such transformations will happen. For example,3 after IBM switched its business model to “big data,” Buffett and Soros increased their holdings in IBM (2012) by 5.5 and 11%, respectively.
1.1 DATA
Scientists who are attempting to disclose the mysteries of humankind are usually interested in intelligence. For instance, Sir Francis Galton,4 the founder of differential psychology, tried to evaluate human intelligence by measuring a subject's physical performance and sense perception. In 1971, another psychologist, Raymond Cattell, was acclaimed for establishing Crystallized Intelligence and Fluid Intelligence theories that differentiate general intelligence [3]. Crystallized Intelligence describes to “the ability to use skills, knowledge, and experience”5 acquired by education and previous experiences, and this improves as a person ages. Fluid Intelligence is the biological capacity “to think logically and solve problems in novel situations, independently of acquired knowledge.”5
The primary objective of twentieth-century IT reform was to endow the computing machine with “intelligence,” “brainpower,” and, in effect, “wisdom.” This all started back in 1946 when John von Neumann, in supervising the manufacturing of the ENIAC (electronic numerical integrator and computer), observed several important differences between the functioning of the computer and the human mind (such as processing speed and parallelism) [4]. Like the human mind, the machine used a “storing device” to save data and a “binary system” to organize data. By this analogy, the complexities of machine's “memory” and “comprehension” could be worked out.
What, then, is data? Data is often regarded as the potential source of factual information or scientific knowledge, and data is physically stored in bytes (a unit of measurement). Data is a “discrete and objective” factual description related to an event, and can consist of atomic data, data item, data object, and a data set, which is collected data [5]. Metadata, simply put, is data that describes data. Data that processes data, such as a program or software, is known as a data tool. A data set refers to a collection of data objects, a data object is defined in an assembly of data items, a data item can be seen as a quantity of atomic data, and an atomic data represents the lowest level of detail in all computer systems. A data item is used to describe the characteristics of data objects (naming and defining the data type) without an independent meaning. A data object can have other names [6] (record, point, vector, pattern, case, sample, observation, entity, etc.) based on a number of attributes (e.g., variable, feature, field, or dimension) by capturing what phenomena in nature.
1.1.1 Data Resources
Reaping the benefits of Moore's law, mass storage is generally credited for the drop in cost per megabyte from US$6,000 in 1955 to less than 1 cent in 2010, and the vast change in storage capacity makes big data storage feasible.
Moreover, today, data is being generated at a sharply growing speed. Even data that was handwritten several decades ago is collected and stored by new tools. To easily measure data size, the academic community has added terms that describe these new measurement units for storage: kilobyte (KB), megabyte (MB), gigabyte (GB), terabyte (TB), petabyte (PB), exabyte (EB), zettabyte (ZB), yottabyte (YB), nonabyte (NB), doggabyte (DB), and coydonbyte (CB).
To put this in perspective, we have, thanks to a special report, “All too much: monstrous amounts of data,”6 in The Economist (in February 2010), an ingenious descriptions of the magnitude of these storage units. For instance, “a kilobyte can hold about half of a page of text, while a megabyte holds about 500 pages of text.”7 And on a larger scale, the data in the American Library of Congress amounts to 15 TB. Thus, if 1 ZB of 5 MB songs stored in MP3 format were played nonstop at the rate of 1 MB per minute, it would take 1.9 billion years to finish the playlist.
A study by Martin Hilbert of the University of Southern California and Priscila López of the Open University of Catalonia at Santiago provides another interesting observation: “the total amount of global data is 295 EB” [7]. A follow-up to this finding was done by the data storage giant EMC, which sponsored an “Explore the Digital Universe” market survey by the well-known organization IDC (International Data Corporation). Some subsequent surveys, from 2007 to 2011, were themed “The Diverse and Exploding Digital Universe,” “The Expanding Digital Universe: A Forecast of Worldwide Information,” “As the Economy Contracts, The Digital Universe Expands,” “A Digital Universe – Are You Ready?” and “Extracting Value from Chaos.”
The 2009 report estimated the scale of data for the year and pointed out that despite the Great Recession, total data increased by 62% compared to 2008, approaching 0.8 ZB. This report forecasted total data in 2010 to grow to 1.2 ZB. The 2010 report forecasted that total data in 2020 would be 44 times that of 2009, amounting to 35 ZB. Additionally the increase in the amount of data objects would exceed that amount in total data. The 2011 report brought us further to the unsettling point that we have reached a stage where we need to look for a new data tool to handle the big data that is sure to change our lifestyles completely.
As data organizations connected by logics and data areas assembled by huge volumes of data reach a “certain scale,” those massive different data sets become “data resources” [5]. The reason why a data resource can be one of the vital modern strategic resources for humans – even possibly exceeding, in the twenty-first century, the combined resources of oil, coal, and mineral products – is that currently all human activities, and without exception including the exploration, exploitation, transportation, processing, and sale of petroleum, coal, and mineral products, will generate and rely on data.
Today, data resources are generated and stored for many different scientific disciplines, such as astronomy, geography, geochemistry, geology, oceanography, aerograph, biology, and medical science. Moreover various large-scale transnational collaborative experiments continuously provide big data that can be captured, stored, communicated, aggregated, and analyzed, such as CERN's LHC (Large Hadron Collider),8 American Pan-STARRS (Panoramic Sur...