Principles and Practice of Big Data
eBook - ePub

Principles and Practice of Big Data

Preparing, Sharing, and Analyzing Complex Information

  1. 480 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Principles and Practice of Big Data

Preparing, Sharing, and Analyzing Complex Information

About this book

Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information, Second Edition updates and expands on the first edition, bringing a set of techniques and algorithms that are tailored to Big Data projects. The book stresses the point that most data analyses conducted on large, complex data sets can be achieved without the use of specialized suites of software (e.g., Hadoop), and without expensive hardware (e.g., supercomputers). The core of every algorithm described in the book can be implemented in a few lines of code using just about any popular programming language (Python snippets are provided).Through the use of new multiple examples, this edition demonstrates that if we understand our data, and if we know how to ask the right questions, we can learn a great deal from large and complex data collections. The book will assist students and professionals from all scientific backgrounds who are interested in stepping outside the traditional boundaries of their chosen academic disciplines.- Presents new methodologies that are widely applicable to just about any project involving large and complex datasets- Offers readers informative new case studies across a range scientific and engineering disciplines- Provides insights into semantics, identification, de-identification, vulnerabilities and regulatory/legal issues- Utilizes a combination of pseudocode and very short snippets of Python code to show readers how they may develop their own projects without downloading or learning new software

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Principles and Practice of Big Data by Jules J. Berman in PDF and/or ePUB format, as well as other popular books in Technology & Engineering & Business Intelligence. We have over one million books available in our catalogue for you to explore.
1

Introduction

Abstract

Big Data is not synonymous with lots and lots of data. Useful Big Data resources adhere to a set of data management principles that are fundamentally different from the traditional practices followed for small data projects. The areas of difference include: data collection; data annotation (including metadata and identifiers); location and distribution of stored data; classification of data; data access rules; data curation; data immutability; data permanence; verification and validity methods for the contained data; analytic methods; costs; and incumbent legal, social, and ethical issues. Skilled professionals who are adept in the design and management of small data resources may be unprepared for the unique challenges posed by Big Data. This chapter is an introduction to topics that will be fully explained in later chapters.

Keywords

Big data definition; Small data; Data filtering; Data reduction

Section 1.1. Definition of Big Data

It's the data, stupid.
Jim Gray
Back in the mid 1960s, my high school held pep rallies before big games. At one of these rallies, the head coach of the football team walked to the center of the stage carrying a large box of printed computer paper; each large sheet was folded flip-flop style against the next sheet and they were all held together by perforations. The coach announced that the athletic abilities of every member of our team had been entered into the school's computer (we were lucky enough to have our own IBM-360 mainframe). Likewise, data on our rival team had also been entered. The computer was instructed to digest all of this information and to produce the name of the team that would win the annual Thanksgiving Day showdown. The computer spewed forth the aforementioned box of computer paper; the very last output sheet revealed that we were the pre-ordained winners. The next day, we sallied forth to yet another ignominious defeat at the hands of our long-time rivals.
Fast-forward about 50 years to a conference room at the National Institutes of Health (NIH), in Bethesda, Maryland. A top-level science administrator is briefing me. She explains that disease research has grown in scale over the past decade. The very best research initiatives are now multi-institutional and data-intensive. Funded investigators are using high-throughput molecular methods that produce mountains of data for every tissue sample in a matter of minutes. There is only one solution; we must acquire supercomputers and a staff of talented programmers who can analyze all our data and tell us what it all means!
The NIH leadership believed, much as my high school coach believed, that if you have a really big computer and you feed it a huge amount of information, then you can answer almost any question.
That day, in the conference room at the NIH, circa 2003, I voiced my concerns, indicating that you cannot just throw data into a computer and expect answers to pop out. I pointed out that, historically, science has been a reductive process, moving from complex, descriptive data sets to simplified generalizations. The idea of developing an expensive supercomputer facility to work with increasing quantities of biological data, at higher and higher levels of complexity, seemed impractical and unnecessary. On that day, my concerns were not well received. High performance supercomputing was a very popular topic, and still is. [Glossary Science, Supercomputer]
Fifteen years have passed since the day that supercomputer-based cancer diagnosis was envisioned. The diagnostic supercomputer facility was never built. The primary diagnostic tool used in hospital laboratories is still the microscope, a tool invented circa 1590. Today, we augment microscopic findings with genetic tests for specific, key mutations; but we do not try to understand all of the complexities of human genetic variations. We know that it is hopeless to try. You can find a lot of computers in hospitals and medical offices, but the computers do not calculate your diagnosis. Computers in the medical workplace are relegated to the prosaic tasks of collecting, storing, retrieving, and delivering medical records. When those tasks are finished, the computer sends you the bill for services rendered.
Before we can take advantage of large and complex data sources, we need to think deeply about the meaning and destiny of Big Data.
Big Data is defined by the three V's:
  1. 1. Volume—large amounts of data;.
  2. 2. Variety—the data comes in different forms, including traditional databases, images, documents, and complex records;.
  3. 3. Velocity—the content of the data is constantly changing through the absorption of complementary data collections, the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources.
It is important to distinguish Big Data from “lotsa data” or “massive data.” In a Big Data Resource, all three V's must apply. It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed. [Glossary Big Data resource, Data resource]
The term “lotsa data” is often applied to enormous collections of simple-format records. For example: every observed star, its magnitude and its location; the name and cell phone number of every person living in the United States; and the contents of the Web. These very large data sets are sometimes just glorified lists. Some “lotsa data” collections are spreadsheets (2-dimensional tables of columns and rows), so large that we may never see where they end.
Big Data resources are not equivalent to large spreadsheets, and a Big Data resource is never analyzed in its totality. Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion. As you read this book, you will find that the gulf between “lotsa data” and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.

Section 1.2. Big Data Versus Small Data

Actually, the main function of Big Science is to generate massive amounts of reliable and easily accessible data.... Insight, understanding, and scientific progress are generally achieved by ‘small science.’
Dan Graur, Yichen Zheng, Nicholas Price, Ricardo Azevedo, Rebecca Zufall, and Eran Elhaik [1].
Big Data is not small data that has become bloated to the point that it can no longer fit on a spreadsheet, nor is it a database that happens to be very large. Nonetheless, some professionals who customarily work with relatively small data sets, harbor the false impression that they can apply their spreadsheet and database know-how directly to Big Data resources without attaining new skills or adjusting to new analytic paradigms. As they see things, when the data gets bigger, only the computer must adjust (by getting faster, acquiring more volatile memory, and increasing its storage capabilities); Big Data poses no special problems that a supercomputer could not solve. [Glossary Database]
This attitude, which seems to be prevalent among database managers, programmers, and statisticians, is highly counterproductive. It will lead to slow and ineffective software, huge investment losses, bad analyses, and the production of useless and irreversibly defective Big Data resources.
Let us look at a few of the general differences that can help distinguish Big Data and small data.
  • Goals
small data—Usually designed to answer a specific question or serve a particular goal.
Big Data—Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean. Here is a short, imaginary funding announcement for Big Data grants designed “to combine high quality data from fisheries, coast guard, commercial shipping, and coastal management agencies for a growing data collection that can be used to support a variety of governmental and commercial management studies in the Lower Peninsula.” In this fictitious case, there is a vague goal, but it is obvious that there really is no way to completely specify what the Big Data resource will contain, how the various types of data held in the resource will be organized, connected to other data resources, or usefully analyzed. Nobody can specify, with any degree of confidence, the ultimate destiny of any Big Data project; it usually comes as a surprise.
  • Location
small data—Typically, contained within one institution, often on one computer, sometimes in one file.
Big Data—Spread throughout electronic space and typically parceled onto multiple Internet servers, located anywhere on earth.
  • Data structure and content
small data—Ordinarily contains highly structured data. The data domain is restricted to a single discipline or sub-discipline. The data often comes in the form of uniform records in an ordered spreadsheet.
Big Data—Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources. [Glossary Data object]
  • Data preparation
small data—In many cases, the data user prepares her own data, for her own purposes.
Big Data—The data comes from many diverse sources, and it is prepared by many people. The people who use the data are seldom the people who have prepared the data.
  • Longevity
small data—When the data project ends, the data is kept for a limited time (seldo...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Other Books by Jules J. Berman
  6. Dedication
  7. About the Author
  8. Author's Preface to Second Edition
  9. Author's Preface to First Edition
  10. 1: Introduction
  11. 2: Providing Structure to Unstructured Data
  12. 3: Identification, Deidentification, and Reidentification
  13. 4: Metadata, Semantics, and Triples
  14. 5: Classifications and Ontologies
  15. 6: Introspection
  16. 7: Standards and Data Integration
  17. 8: Immutability and Immortality
  18. 9: Assessing the Adequacy of a Big Data Resource
  19. 10: Measurement
  20. 11: Indispensable Tips for Fast and Simple Big Data Analysis
  21. 12: Finding the Clues in Large Collections of Data
  22. 13: Using Random Numbers to Knock Your Big Data Analytic Problems Down to Size
  23. 14: Special Considerations in Big Data Analysis
  24. 15: Big Data Failures and How to Avoid (Some of) Them
  25. 16: Data Reanalysis: Much More Important Than Analysis
  26. 17: Repurposing Big Data
  27. 18: Data Sharing and Data Security
  28. 19: Legalities
  29. 20: Societal Issues
  30. Index