
eBook - ePub
Principles and Practice of Big Data
Preparing, Sharing, and Analyzing Complex Information
- 480 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
eBook - ePub
Principles and Practice of Big Data
Preparing, Sharing, and Analyzing Complex Information
About this book
Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information, Second Edition updates and expands on the first edition, bringing a set of techniques and algorithms that are tailored to Big Data projects. The book stresses the point that most data analyses conducted on large, complex data sets can be achieved without the use of specialized suites of software (e.g., Hadoop), and without expensive hardware (e.g., supercomputers). The core of every algorithm described in the book can be implemented in a few lines of code using just about any popular programming language (Python snippets are provided).
Through the use of new multiple examples, this edition demonstrates that if we understand our data, and if we know how to ask the right questions, we can learn a great deal from large and complex data collections. The book will assist students and professionals from all scientific backgrounds who are interested in stepping outside the traditional boundaries of their chosen academic disciplines.
- Presents new methodologies that are widely applicable to just about any project involving large and complex datasets
- Offers readers informative new case studies across a range scientific and engineering disciplines
- Provides insights into semantics, identification, de-identification, vulnerabilities and regulatory/legal issues
- Utilizes a combination of pseudocode and very short snippets of Python code to show readers how they may develop their own projects without downloading or learning new software
Trusted by 375,005 students
Access to over 1 million titles for a fair monthly price.
Study more efficiently using our study tools.
Information
1
Introduction
Abstract
Big Data is not synonymous with lots and lots of data. Useful Big Data resources adhere to a set of data management principles that are fundamentally different from the traditional practices followed for small data projects. The areas of difference include: data collection; data annotation (including metadata and identifiers); location and distribution of stored data; classification of data; data access rules; data curation; data immutability; data permanence; verification and validity methods for the contained data; analytic methods; costs; and incumbent legal, social, and ethical issues. Skilled professionals who are adept in the design and management of small data resources may be unprepared for the unique challenges posed by Big Data. This chapter is an introduction to topics that will be fully explained in later chapters.
Keywords
Big data definition; Small data; Data filtering; Data reduction
Section 1.1. Definition of Big Data
It's the data, stupid.
Jim Gray
Back in the mid 1960s, my high school held pep rallies before big games. At one of these rallies, the head coach of the football team walked to the center of the stage carrying a large box of printed computer paper; each large sheet was folded flip-flop style against the next sheet and they were all held together by perforations. The coach announced that the athletic abilities of every member of our team had been entered into the school's computer (we were lucky enough to have our own IBM-360 mainframe). Likewise, data on our rival team had also been entered. The computer was instructed to digest all of this information and to produce the name of the team that would win the annual Thanksgiving Day showdown. The computer spewed forth the aforementioned box of computer paper; the very last output sheet revealed that we were the pre-ordained winners. The next day, we sallied forth to yet another ignominious defeat at the hands of our long-time rivals.
Fast-forward about 50 years to a conference room at the National Institutes of Health (NIH), in Bethesda, Maryland. A top-level science administrator is briefing me. She explains that disease research has grown in scale over the past decade. The very best research initiatives are now multi-institutional and data-intensive. Funded investigators are using high-throughput molecular methods that produce mountains of data for every tissue sample in a matter of minutes. There is only one solution; we must acquire supercomputers and a staff of talented programmers who can analyze all our data and tell us what it all means!
The NIH leadership believed, much as my high school coach believed, that if you have a really big computer and you feed it a huge amount of information, then you can answer almost any question.
That day, in the conference room at the NIH, circa 2003, I voiced my concerns, indicating that you cannot just throw data into a computer and expect answers to pop out. I pointed out that, historically, science has been a reductive process, moving from complex, descriptive data sets to simplified generalizations. The idea of developing an expensive supercomputer facility to work with increasing quantities of biological data, at higher and higher levels of complexity, seemed impractical and unnecessary. On that day, my concerns were not well received. High performance supercomputing was a very popular topic, and still is. [Glossary Science, Supercomputer]
Fifteen years have passed since the day that supercomputer-based cancer diagnosis was envisioned. The diagnostic supercomputer facility was never built. The primary diagnostic tool used in hospital laboratories is still the microscope, a tool invented circa 1590. Today, we augment microscopic findings with genetic tests for specific, key mutations; but we do not try to understand all of the complexities of human genetic variations. We know that it is hopeless to try. You can find a lot of computers in hospitals and medical offices, but the computers do not calculate your diagnosis. Computers in the medical workplace are relegated to the prosaic tasks of collecting, storing, retrieving, and delivering medical records. When those tasks are finished, the computer sends you the bill for services rendered.
Before we can take advantage of large and complex data sources, we need to think deeply about the meaning and destiny of Big Data.
Big Data is defined by the three V's:
- 1. Volume—large amounts of data;.
- 2. Variety—the data comes in different forms, including traditional databases, images, documents, and complex records;.
- 3. Velocity—the content of the data is constantly changing through the absorption of complementary data collections, the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources.
It is important to distinguish Big Data from “lotsa data” or “massive data.” In a Big Data Resource, all three V's must apply. It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed. [Glossary Big Data resource, Data resource]
The term “lotsa data” is often applied to enormous collections of simple-format records. For example: every observed star, its magnitude and its location; the name and cell phone number of every person living in the United States; and the contents of the Web. These very large data sets are sometimes just glorified lists. Some “lotsa data” collections are spreadsheets (2-dimensional tables of columns and rows), so large that we may never see where they end.
Big Data resources are not equivalent to large spreadsheets, and a Big Data resource is never analyzed in its totality. Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion. As you read this book, you will find that the gulf between “lotsa data” and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.
Section 1.2. Big Data Versus Small Data
Actually, the main function of Big Science is to generate massive amounts of reliable and easily accessible data.... Insight, understanding, and scientific progress are generally achieved by ‘small science.’
Dan Graur, Yichen Zheng, Nicholas Price, Ricardo Azevedo, Rebecca Zufall, and Eran Elhaik [1].
Big Data is not small data that has become bloated to the point that it can no longer fit on a spreadsheet, nor is it a database that happens to be very large. Nonetheless, some professionals who customarily work with relatively small data sets, harbor the false impression that they can apply their spreadsheet and database know-how directly to Big Data resources without attaining new skills or adjusting to new analytic paradigms. As they see things, when the data gets bigger, only the computer must adjust (by getting faster, acquiring more volatile memory, and increasing its storage capabilities); Big Data poses no special problems that a supercomputer could not solve. [Glossary Database]
This attitude, which seems to be prevalent among database managers, programmers, and statisticians, is highly counterproductive. It will lead to slow and ineffective software, huge investment losses, bad analyses, and the production of useless and irreversibly defective Big Data resources.
Let us look at a few of the general differences that can help distinguish Big Data and small data.
- – Goals
small data—Usually designed to answer a specific question or serve a particular goal.
Big Data—Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean. Here is a short, imaginary funding announcement for Big Data grants designed “to combine high quality data from fisheries, coast guard, commercial shipping, and coastal management agencies for a growing data collection that can be used to support a variety of governmental and commercial management studies in the Lower Peninsula.” In this fictitious case, there is a vague goal, but it is obvious that there really is no way to completely specify what the Big Data resource will contain, how the various types of data held in the resource will be organized, connected to other data resources, or usefully analyzed. Nobody can specify, with any degree of confidence, the ultimate destiny of any Big Data project; it usually comes as a surprise.
- – Location
small data—Typically, contained within one institution, often on one computer, sometimes in one file.
Big Data—Spread throughout electronic space and typically parceled onto multiple Internet servers, located anywhere on earth.
- – Data structure and content
small data—Ordinarily contains highly structured data. The data domain is restricted to a single discipline or sub-discipline. The data often comes in the form of uniform records in an ordered spreadsheet.
Big Data—Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources. [Glossary Data object]
- – Data preparation
small data—In many cases, the data user prepares her own data, for her own purposes.
Big Data—The data comes from many diverse sources, and it is prepared by many people. The people who use the data are seldom the people who have prepared the data.
- – Longevity
small data—When the data project ends, the data is kept for a limited time (seldo...
Table of contents
- Cover image
- Title page
- Table of Contents
- Copyright
- Other Books by Jules J. Berman
- Dedication
- About the Author
- Author's Preface to Second Edition
- Author's Preface to First Edition
- 1: Introduction
- 2: Providing Structure to Unstructured Data
- 3: Identification, Deidentification, and Reidentification
- 4: Metadata, Semantics, and Triples
- 5: Classifications and Ontologies
- 6: Introspection
- 7: Standards and Data Integration
- 8: Immutability and Immortality
- 9: Assessing the Adequacy of a Big Data Resource
- 10: Measurement
- 11: Indispensable Tips for Fast and Simple Big Data Analysis
- 12: Finding the Clues in Large Collections of Data
- 13: Using Random Numbers to Knock Your Big Data Analytic Problems Down to Size
- 14: Special Considerations in Big Data Analysis
- 15: Big Data Failures and How to Avoid (Some of) Them
- 16: Data Reanalysis: Much More Important Than Analysis
- 17: Repurposing Big Data
- 18: Data Sharing and Data Security
- 19: Legalities
- 20: Societal Issues
- Index
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Principles and Practice of Big Data by Jules J. Berman in PDF and/or ePUB format, as well as other popular books in Technology & Engineering & Business Intelligence. We have over one million books available in our catalogue for you to explore.