I was working on the proof of one of my poems all the morning, and took out a comma. In the afternoon I put it back again.
Background
In the early days of computing, data was always highly structured. All data was divided into fields, the fields had a fixed length, and the data entered into each field was constrained to a predetermined set of allowed values. Data was entered into punch cards, with preconfigured rows and columns. Depending on the intended use of the cards, various entry and read-out methods were chosen to express binary data, numeric data, fixed-size text, or programming instructions (see Glossary item, Binary data). Key-punch operators produced mountains of punch cards. For many analytic purposes, card-encoded data sets were analyzed without the assistance of a computer; all that was needed was a punch card sorter. If you wanted the data card on all males, over the age of 18, who had graduated high school, and had passed their physical exam, then the sorter would need to make four passes. The sorter would pull every card listing a male, then from the male cards it would pull all the cards of people over the age of 18, and from this double-sorted substack it would pull cards that met the next criterion, and so on. As a high school student in the 1960s, I loved playing with the card sorters. Back then, all data was structured data, and it seemed to me, at the time, that a punch-card sorter was all that anyone would ever need to analyze large sets of data.
Of course, I was completely wrong. Today, most data entered by humans is unstructured, in the form of free text. The free text comes in e-mail messages, tweets, documents, and so on. Structured data has not disappeared, but it sits in the shadows cast by mountains of unstructured text. Free text may be more interesting to read than punch cards, but the venerable punch card, in its heyday, was much easier to analyze than its free-text descendant. To get much informational value from free text, it is necessary to impose some structure. This may involve translating the text to a preferred language, parsing the text into sentences, extracting and normalizing the conceptual terms contained in the sentences, mapping terms to a standard nomenclature (see Glossary items, Nomenclature, Thesaurus), annotating the terms with codes from one or more standard nomenclatures, extracting and standardizing data values from the text, assigning data values to specific classes of data belonging to a classification system, assigning the classified data to a storage and retrieval system (e.g., a database), and indexing the data in the system. All of these activities are difficult to do on a small scale and virtually impossible to do on a large scale. Nonetheless, every Big Data project that uses unstructured data must deal with these tasks to yield the best possible results with the resources available.
Machine Translation
The purpose of narrative is to present us with complexity and ambiguity.
Scott Turow
The term unstructured data refers to data objects whose contents are not organized into arrays of attributes or values (see Glossary item, Data object). Spreadsheets, with data distributed in cells, marked by a row and column position, are examples of structured data. This paragraph is an example of unstructured data. You can see why data analysts prefer spreadsheets over free text. Without structure, the contents of the data cannot be sensibly collected and analyzed. Because Big Data is immense, the tasks of imposing structure on text must be automated and fast.
Machine translation is one of the better known areas in which computational methods have been applied to free text. Ultimately, the job of machine translation is to translate text from one language into another language. The process of machine translation begins with extracting sentences from text, parsing the words of the sentence into grammatic parts, and arranging the grammatic parts into an order that imposes logical sense on the sentence. Once this is done, each of the parts can be translated by a dictionary that finds equivalent terms in a foreign language to be reassembled by applying grammatic positioning rules appropriate for the target language. Because this process uses the natural rules for sentence constructions in a foreign language, the process is often referred to as natural language machine translation.
It all seems simple and straightforward. In a sense, it isâif you have the proper look-up tables. Relatively good automatic translators are now widely available. The drawback of all these applications is that there are many instances where they fail utterly. Complex sentences, as you might expect, are problematic. Beyond the complexity of the sentences are other problems, deeper problems that touch upon the dirtiest secret common to all human languagesâlanguages do not make much sense. Computers cannot find meaning in sentences that have no meaning. If we, as humans, find meaning in the English language, it is only because we impose our own cultural prejudices onto the sentences we read, to create meaning where none exists.
It is worthwhile to spend a few moments on some of the inherent limitations of English. Our words are polymorphous; their meanings change depending on the context in which they occur. Word polymorphism can be used for comic effect (e.g., âBoth the martini and the bar patron were drunkâ). As humans steeped in the culture of our language, we effortlessly invent the intended meaning of each polymorphic pair in the following examples: âa bandage wound around a wound,â âfarming to produce produce,â âplease present the present in the present time,â âdonât object to the data object,â âteaching a sow to sow seed,â âwind the sail before the wind comes,â and countless others.
Words lack compositionality; their meaning cannot be deduced by analyzing root parts. For example, there is neither pine nor apple in pineapple, no egg in eggplant, and hamburgers are made from beef, not ham. You can assume that a lover will love, but you cannot assume that a finger will âfing.â Vegetarians will eat vegetables, but humanitarians will not eat humans. Overlook and oversee should, logically, be synonyms, but they are antonyms.
For many words, their meanings are determined by the case of the first letter of the word. For example, Nice and nice, Polish and polish, Herb and herb, August and august.
It is possible, given enough effort, that a machine translator may cope with all the aforementioned impedimenta. Nonetheless, no computer can create meaning out of ambiguous gibberish, and a sizable portion of written language has no meaning, in the informatics sense (see Glossary item, Meaning). As someone who has dabbled in writing machine translation tools, my favorite gripe relates to the common use of reificationâthe process whereby the subject of a sentence is inferred, without actually being named (see Glossary item, Reification). Reification is accomplished with pronouns and other subject references.
Here is an example, taken from a newspaper headline: âHusband named person of interest in slaying of mother.â First off, we must infer that it is the husband who was named as the person of interest, not that the husband suggested the name of the person of interest. As anyone who follows crime headlines knows, this sentence refers to a family consisting of a husband, wife, and at least one child. There is a wife because there is a husband. There is a child because there is a mother. The reader is expected to infer that the mother is the mother of the husbandâs child, not the mother of the husband. The mother and the wife are the same person. Putting it all together, the husband and wife are father and mother, respectively, to the child. The sentence conveys the news that the husband is a suspect in the slaying of his wife, the mother of the child. The word âhusbandâ reifies the existence of a wife (i.e., creates a wife by implication from the husbandâwife relationship). The word âmotherâ reifies a child. Nowhere is any individual husband or mother identified; itâs all done with pointers pointing to other pointers. The sentence is all but meaningless; any meaning extracted from the sentence comes as a creation of our vivid imaginations.
Occasionally, a sentence contains a reification of a group of people, and the reification contributes absolutely nothing to the meaning of the sentence. For example, âJohn married aunt Sally.â Here, a familial relationship is established (âauntâ) for Sally, but the relationship does not extend to the only other person mentioned in the sentence (i.e., Sally is not Johnâs aunt). Instead, the word âauntâ reifies a group of individuals; specifically, the group of people who have Sally as their aunt. The reification seems to serve no purpose other than to confuse.
Here is another example, taken from a newspaper article: âAfter her husband disappeared on a 1944 recon mission over Southern France, Antoine de Saint-Exuperyâs widow sat down and wrote this memoir of their dramatic marriage.â There are two reified persons in the sentence: âher husbandâ and âAntoine de Saint-Exuperyâs widow.â In the first phrase, âher husbandâ is a relationship (i.e., âhusbandâ) established for a pronoun (i.e., âherâ) referenced to the person in the second phrase. The person in the second phrase is reified by a relationship to Saint-Exupery (i.e., âwidowâ), who just happens to be the reification of the person in the first phrase (i.e., âSaint-Exupery is her husbandâ).
We write self-referential reifying sentences every time we use a pronoun: âIt was then that he did it for them.â The first âitâ reifies an event, the word âthenâ reifies a time, the word âheâ reifies a subject, the second âitâ reifies some action, and the word âthemâ reifies a group of individuals representing the recipients of the reified action.
Strictly speaking, all of these examples are meaningless. The subjects of the sentence are not properly identified and the references to the subjects are ambiguous. Such sentences cannot be sensibly evaluated by computers.
A final example is âDo you know who I am?â There are no identifiable individuals; everyone is reified and reduced to an unspecified pronoun (âyou,â âIâ). Though there are just a few words in the sentence, half of them are superfluous. The words âDo,â âwho,â and âamâ are merely fluff, with no informational purpose. In an object-oriented world, the question would be transformed into an assertion, âYou know me,â and the assertion would be sent a query message, âtrue?â (see Glossary item, Object-oriented programming). We are jumping ahead. Objects, assertions, and query messages will be discussed in later chapters.
Accurate machine translation is beyond being difficult. It is simply impossible. It is impossible because computers cannot understand nonsense. The best we can hope for is a translation that allows the reader to impose the same subjective interpretation of the text in the translation language as he or she would have made in the original language. The expectation that sentences can be reliably parsed into informational units is fantasy. Nonetheless, it is possible to compose meaningful sentences in any language, if you have a deep understanding of informational meaning. This topic will be addressed in Chapter 4.
Autocoding
The beginning of wisdom is to call things by their right names.
Chinese proverb
Coding, as used in the context of unstructured textual data, is the process of tagging terms with an identifier code that corresponds to a synonymous term listed in a standard nomenclature (see Glossary item, Identifier). For example, a medical nomenclature might contain the term renal cell carcinoma, a type of kidney cancer, attaching a unique identifier code for the term, such as âC9385000.â There are about 50 recognized synonyms for ârenal cell carcinoma.â A few of these synonyms and near-synonyms are listed here to show that a single concept can be expressed many different ways, including adenocarcinoma arising from kidney, adenocarcinoma involving kidney, cancer arising from kidney, carcinoma of kidney, Grawitz tumor, Grawitz tumour, hypernephroid tumor, hypernephroma, kidney adenocarcinoma, renal adenocarcinoma, and renal cell carcinoma. All of these terms could be assigned the same identifier code, âC9385000.â
The process of coding a text document involves finding all the terms that belong to a specific nomenclature and tagging the term with the corresponding identifier code.
A nomenclature is a specialized vocabulary, usually containing terms that comprehensively cover a well-defined and circumscribed area (see Glossary item, Vocabulary). For example, there may be a nomenclature of diseases, or celestial bodies, or makes and models of automobiles. Some nomenclatures are ordered alphabetically. Others are ordered by synonymy, wherein all synonyms and plesionyms (near-synonyms, see Glossary item, Plesionymy) are collected under a canonical (i.e., best or preferred) term. Synonym indexes are always corrupted by the inclusion of polysemous terms (i.e., terms with multiple meanings; see Glossary item, Polysemy). In many nomenclatures, grouped synonyms are collected under a code (i.e., a unique alphanumeric string) assigned to all of the terms in the group (see Glossary items, Uniqueness, String). Nomenclatures have many purposes: to enhance interoperability and integration, to allow synonymous terms to be retrieved regardless of which specific synonym is entered as a query, to support comprehensive analyses of textual data, to express detail, to tag information in textual documents, and to drive down the complexity of documents by uniting synonymous terms under a common code. Sets of documents held in more than one Big Data resource can be ha...