Principles of Big Data
eBook - ePub

Principles of Big Data

Preparing, Sharing, and Analyzing Complex Information

  1. 288 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Principles of Big Data

Preparing, Sharing, and Analyzing Complex Information

About this book

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.- Learn general methods for specifying Big Data in a way that is understandable to humans and to computers- Avoid the pitfalls in Big Data design and analysis- Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Principles of Big Data by Jules J. Berman in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.
Chapter 1

Providing Structure to Unstructured Data

Outline
Background
Machine Translation
Autocoding
Indexing
Term Extraction
I was working on the proof of one of my poems all the morning, and took out a comma. In the afternoon I put it back again.
Oscar Wilde

Background

In the early days of computing, data was always highly structured. All data was divided into fields, the fields had a fixed length, and the data entered into each field was constrained to a predetermined set of allowed values. Data was entered into punch cards, with preconfigured rows and columns. Depending on the intended use of the cards, various entry and read-out methods were chosen to express binary data, numeric data, fixed-size text, or programming instructions (see Glossary item, Binary data). Key-punch operators produced mountains of punch cards. For many analytic purposes, card-encoded data sets were analyzed without the assistance of a computer; all that was needed was a punch card sorter. If you wanted the data card on all males, over the age of 18, who had graduated high school, and had passed their physical exam, then the sorter would need to make four passes. The sorter would pull every card listing a male, then from the male cards it would pull all the cards of people over the age of 18, and from this double-sorted substack it would pull cards that met the next criterion, and so on. As a high school student in the 1960s, I loved playing with the card sorters. Back then, all data was structured data, and it seemed to me, at the time, that a punch-card sorter was all that anyone would ever need to analyze large sets of data.
Of course, I was completely wrong. Today, most data entered by humans is unstructured, in the form of free text. The free text comes in e-mail messages, tweets, documents, and so on. Structured data has not disappeared, but it sits in the shadows cast by mountains of unstructured text. Free text may be more interesting to read than punch cards, but the venerable punch card, in its heyday, was much easier to analyze than its free-text descendant. To get much informational value from free text, it is necessary to impose some structure. This may involve translating the text to a preferred language, parsing the text into sentences, extracting and normalizing the conceptual terms contained in the sentences, mapping terms to a standard nomenclature (see Glossary items, Nomenclature, Thesaurus), annotating the terms with codes from one or more standard nomenclatures, extracting and standardizing data values from the text, assigning data values to specific classes of data belonging to a classification system, assigning the classified data to a storage and retrieval system (e.g., a database), and indexing the data in the system. All of these activities are difficult to do on a small scale and virtually impossible to do on a large scale. Nonetheless, every Big Data project that uses unstructured data must deal with these tasks to yield the best possible results with the resources available.

Machine Translation

The purpose of narrative is to present us with complexity and ambiguity.
Scott Turow
The term unstructured data refers to data objects whose contents are not organized into arrays of attributes or values (see Glossary item, Data object). Spreadsheets, with data distributed in cells, marked by a row and column position, are examples of structured data. This paragraph is an example of unstructured data. You can see why data analysts prefer spreadsheets over free text. Without structure, the contents of the data cannot be sensibly collected and analyzed. Because Big Data is immense, the tasks of imposing structure on text must be automated and fast.
Machine translation is one of the better known areas in which computational methods have been applied to free text. Ultimately, the job of machine translation is to translate text from one language into another language. The process of machine translation begins with extracting sentences from text, parsing the words of the sentence into grammatic parts, and arranging the grammatic parts into an order that imposes logical sense on the sentence. Once this is done, each of the parts can be translated by a dictionary that finds equivalent terms in a foreign language to be reassembled by applying grammatic positioning rules appropriate for the target language. Because this process uses the natural rules for sentence constructions in a foreign language, the process is often referred to as natural language machine translation.
It all seems simple and straightforward. In a sense, it is—if you have the proper look-up tables. Relatively good automatic translators are now widely available. The drawback of all these applications is that there are many instances where they fail utterly. Complex sentences, as you might expect, are problematic. Beyond the complexity of the sentences are other problems, deeper problems that touch upon the dirtiest secret common to all human languages—languages do not make much sense. Computers cannot find meaning in sentences that have no meaning. If we, as humans, find meaning in the English language, it is only because we impose our own cultural prejudices onto the sentences we read, to create meaning where none exists.
It is worthwhile to spend a few moments on some of the inherent limitations of English. Our words are polymorphous; their meanings change depending on the context in which they occur. Word polymorphism can be used for comic effect (e.g., “Both the martini and the bar patron were drunk”). As humans steeped in the culture of our language, we effortlessly invent the intended meaning of each polymorphic pair in the following examples: “a bandage wound around a wound,” “farming to produce produce,” “please present the present in the present time,” “don’t object to the data object,” “teaching a sow to sow seed,” “wind the sail before the wind comes,” and countless others.
Words lack compositionality; their meaning cannot be deduced by analyzing root parts. For example, there is neither pine nor apple in pineapple, no egg in eggplant, and hamburgers are made from beef, not ham. You can assume that a lover will love, but you cannot assume that a finger will “fing.” Vegetarians will eat vegetables, but humanitarians will not eat humans. Overlook and oversee should, logically, be synonyms, but they are antonyms.
For many words, their meanings are determined by the case of the first letter of the word. For example, Nice and nice, Polish and polish, Herb and herb, August and august.
It is possible, given enough effort, that a machine translator may cope with all the aforementioned impedimenta. Nonetheless, no computer can create meaning out of ambiguous gibberish, and a sizable portion of written language has no meaning, in the informatics sense (see Glossary item, Meaning). As someone who has dabbled in writing machine translation tools, my favorite gripe relates to the common use of reification—the process whereby the subject of a sentence is inferred, without actually being named (see Glossary item, Reification). Reification is accomplished with pronouns and other subject references.
Here is an example, taken from a newspaper headline: “Husband named person of interest in slaying of mother.” First off, we must infer that it is the husband who was named as the person of interest, not that the husband suggested the name of the person of interest. As anyone who follows crime headlines knows, this sentence refers to a family consisting of a husband, wife, and at least one child. There is a wife because there is a husband. There is a child because there is a mother. The reader is expected to infer that the mother is the mother of the husband’s child, not the mother of the husband. The mother and the wife are the same person. Putting it all together, the husband and wife are father and mother, respectively, to the child. The sentence conveys the news that the husband is a suspect in the slaying of his wife, the mother of the child. The word “husband” reifies the existence of a wife (i.e., creates a wife by implication from the husband—wife relationship). The word “mother” reifies a child. Nowhere is any individual husband or mother identified; it’s all done with pointers pointing to other pointers. The sentence is all but meaningless; any meaning extracted from the sentence comes as a creation of our vivid imaginations.
Occasionally, a sentence contains a reification of a group of people, and the reification contributes absolutely nothing to the meaning of the sentence. For example, “John married aunt Sally.” Here, a familial relationship is established (“aunt”) for Sally, but the relationship does not extend to the only other person mentioned in the sentence (i.e., Sally is not John’s aunt). Instead, the word “aunt” reifies a group of individuals; specifically, the group of people who have Sally as their aunt. The reification seems to serve no purpose other than to confuse.
Here is another example, taken from a newspaper article: “After her husband disappeared on a 1944 recon mission over Southern France, Antoine de Saint-Exupery’s widow sat down and wrote this memoir of their dramatic marriage.” There are two reified persons in the sentence: “her husband” and “Antoine de Saint-Exupery’s widow.” In the first phrase, “her husband” is a relationship (i.e., “husband”) established for a pronoun (i.e., “her”) referenced to the person in the second phrase. The person in the second phrase is reified by a relationship to Saint-Exupery (i.e., “widow”), who just happens to be the reification of the person in the first phrase (i.e., “Saint-Exupery is her husband”).
We write self-referential reifying sentences every time we use a pronoun: “It was then that he did it for them.” The first “it” reifies an event, the word “then” reifies a time, the word “he” reifies a subject, the second “it” reifies some action, and the word “them” reifies a group of individuals representing the recipients of the reified action.
Strictly speaking, all of these examples are meaningless. The subjects of the sentence are not properly identified and the references to the subjects are ambiguous. Such sentences cannot be sensibly evaluated by computers.
A final example is “Do you know who I am?” There are no identifiable individuals; everyone is reified and reduced to an unspecified pronoun (“you,” “I”). Though there are just a few words in the sentence, half of them are superfluous. The words “Do,” “who,” and “am” are merely fluff, with no informational purpose. In an object-oriented world, the question would be transformed into an assertion, “You know me,” and the assertion would be sent a query message, “true?” (see Glossary item, Object-oriented programming). We are jumping ahead. Objects, assertions, and query messages will be discussed in later chapters.
Accurate machine translation is beyond being difficult. It is simply impossible. It is impossible because computers cannot understand nonsense. The best we can hope for is a translation that allows the reader to impose the same subjective interpretation of the text in the translation language as he or she would have made in the original language. The expectation that sentences can be reliably parsed into informational units is fantasy. Nonetheless, it is possible to compose meaningful sentences in any language, if you have a deep understanding of informational meaning. This topic will be addressed in Chapter 4.

Autocoding

The beginning of wisdom is to call things by their right names.
Chinese proverb
Coding, as used in the context of unstructured textual data, is the process of tagging terms with an identifier code that corresponds to a synonymous term listed in a standard nomenclature (see Glossary item, Identifier). For example, a medical nomenclature might contain the term renal cell carcinoma, a type of kidney cancer, attaching a unique identifier code for the term, such as “C9385000.” There are about 50 recognized synonyms for “renal cell carcinoma.” A few of these synonyms and near-synonyms are listed here to show that a single concept can be expressed many different ways, including adenocarcinoma arising from kidney, adenocarcinoma involving kidney, cancer arising from kidney, carcinoma of kidney, Grawitz tumor, Grawitz tumour, hypernephroid tumor, hypernephroma, kidney adenocarcinoma, renal adenocarcinoma, and renal cell carcinoma. All of these terms could be assigned the same identifier code, “C9385000.”
The process of coding a text document involves finding all the terms that belong to a specific nomenclature and tagging the term with the corresponding identifier code.
A nomenclature is a specialized vocabulary, usually containing terms that comprehensively cover a well-defined and circumscribed area (see Glossary item, Vocabulary). For example, there may be a nomenclature of diseases, or celestial bodies, or makes and models of automobiles. Some nomenclatures are ordered alphabetically. Others are ordered by synonymy, wherein all synonyms and plesionyms (near-synonyms, see Glossary item, Plesionymy) are collected under a canonical (i.e., best or preferred) term. Synonym indexes are always corrupted by the inclusion of polysemous terms (i.e., terms with multiple meanings; see Glossary item, Polysemy). In many nomenclatures, grouped synonyms are collected under a code (i.e., a unique alphanumeric string) assigned to all of the terms in the group (see Glossary items, Uniqueness, String). Nomenclatures have many purposes: to enhance interoperability and integration, to allow synonymous terms to be retrieved regardless of which specific synonym is entered as a query, to support comprehensive analyses of textual data, to express detail, to tag information in textual documents, and to drive down the complexity of documents by uniting synonymous terms under a common code. Sets of documents held in more than one Big Data resource can be ha...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Acknowledgments
  7. Author Biography
  8. Preface
  9. Introduction
  10. Chapter 1. Providing Structure to Unstructured Data
  11. Chapter 2. Identification, Deidentification, and Reidentification
  12. Chapter 3. Ontologies and Semantics
  13. Chapter 4. Introspection
  14. Chapter 5. Data Integration and Software Interoperability
  15. Chapter 6. Immutability and Immortality
  16. Chapter 7. Measurement
  17. Chapter 8. Simple but Powerful Big Data Techniques
  18. Chapter 9. Analysis
  19. Chapter 10. Special Considerations in Big Data Analysis
  20. Chapter 11. Stepwise Approach to Big Data Analysis
  21. Chapter 12. Failure
  22. Chapter 13. Legalities
  23. Chapter 14. Societal Issues
  24. Chapter 15. The Future
  25. Glossary
  26. References
  27. Index