Data Architecture: A Primer for the Data Scientist
eBook - ePub

Data Architecture: A Primer for the Data Scientist

A Primer for the Data Scientist

  1. 431 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Architecture: A Primer for the Data Scientist

A Primer for the Data Scientist

About this book

Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things. Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together. - New case studies include expanded coverage of textual management and analytics - New chapters on visualization and big data - Discussion of new visualizations of the end-state architecture

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Data Architecture: A Primer for the Data Scientist by W.H. Inmon,Daniel Linstedt,Mary Levins in PDF and/or ePUB format, as well as other popular books in Business & Business Intelligence. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
eBook ISBN
9780128169179
Edition
2
Chapter 1.1

An Introduction to Data Architecture

Abstract

Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the ā€œgreat divide.ā€ The divide is so large; many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

Keywords

Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data
Data architecture is about the larger picture of data and how it fits together in a typical organization. The natural starting point for looking at the big picture of how data fit together in a corporation begins naturally enough with all the data in the corporation.
Fig. 1.1.1 depicts symbolically all the data—of every kind—in the corporation.
Fig. 1.1.1

Fig. 1.1.1 The totality of corporate data.
Fig. 1.1.1 depicts every kind of data found in the corporation. It depicts data generated by running transactions. It depicts e-mail. It depicts telephone conversations. It depicts data found in personal computers. It depicts metering data. It depicts office memos. It depicts contracts, safety reports, and time sheets. It depicts pay ledgers.
In a word, if it is data and it is in the corporation, it is depicted by the bar shown in Fig. 1.1.1.

Subdividing Data

There are many ways to subdivide the data shown in Fig. 1.1.1. The way that is shown is only one of many ways data can be understood.
One way to understand the data found in the corporation is to look at structured data and nonstructured data. Fig. 1.1.2 shows this subdivision of data.
Fig. 1.1.2

Fig. 1.1.2 Structured data is only a small part of corporate data.
Structured data are data that are well defined. Structured data are typically repetitive. The same structure of data recurs repeatedly. The only difference between one occurrence of data and another is in the contents of the data. As a simple example of structured data, there are records of the sale of a good—an ā€œSKUā€ā€”made by a retailer. Each time Walmart makes a sale the item sold, the amount of the sale, the tax paid, and the date and location of the sale are recorded. In a day's time, Walmart will create many records of the sale of many items. From a structural standpoint, the sale of one item will be identical to the sale of another item. The data are called ā€œstructuredā€ because of the similarity of the structure of the records.
The high degree of structure and definition of the records make the records easy to handle inside a database management system.
However, structured records are hardly the only kind of data in the corporation. In fact, structured data typically represent only a small fraction of the data found in the corporation. The other kind of data found in the corporation is called unstructured data.
It has been conjectured as to how much data in the corporation are structured and how much are unstructured. There are estimates as low as 2% and as high as 20%. The estimate really depends on the nature of the business of the corporation and the nature of what data are used in the calculation of the equation.

Repetitive/Nonrepetitive Unstructured Data

There are two basic kinds of unstructured data in the corporation—repetitive unstructured data and nonrepetitive unstructured data.
Fig. 1.1.3 depicts the different kinds of unstructured data in the corporation.
Fig. 1.1.3

Fig. 1.1.3 Repetitive data and nonrepetitive data.
A typical form of repetitive unstructured data in the corporation might be the data generated by an analog machine. For example, a farmer has a machine that reads the identification of railroad cars as the railroad cars pass through the farmer's property. Trains pass through the property night and day. The electronic eye reads and records the passage of each car on the track.
Nonrepetitive unstructured data are data that are nonrepetitive, such as e-mails. Each e-mail can be long or short. The e-mail can be in English or Spanish (or some other languages.) The author of the e-mail can say anything that he/she pleases. It is only a pure accident if the contents of any e-mail are identical to the contents of any other e-mail. And there are many forms of nonrepetitive unstructured data. There are voice recordings, there are contracts, there are customer feedback messages, etc.
Because of its irregular form, unstructured data do not fit well with standard database management systems.

The Great Divide of Data

It is not obvious at all, but the dividing line in unstructured data between unstructured repetitive data and unstructured nonrepetitive data is very significant. In fact, the dividing line between unstructured repetitive data and unstructured nonrepetitive data is so important that the division can be called the ā€œgreat divideā€ of data.
Fig. 1.1.4 shows the great divide of data.
Fig. 1.1.4

Fig. 1.1.4 The great divide.
It is hardly obvious why there should be this great divide of data. But there are some very good reasons for the divide:
  • Repetitive data usually have very limited business value, wh...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Chapter 1.1: An Introduction to Data Architecture
  7. Chapter 1.2: The Data Infrastructure
  8. Chapter 1.3: The ā€œGreat Divideā€
  9. Chapter 1.4: Demographics of Corporate Data
  10. Chapter 1.5: Corporate Data Analysis
  11. Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
  12. Chapter 1.7: A Brief History of Data
  13. Chapter 2.1: The End-State Architecture—The ā€œWorld Mapā€
  14. Chapter 3.1: Transformations in the End-State Architecture
  15. Chapter 4.1: A Brief History of Big Data
  16. Chapter 4.2: What Is Big Data?
  17. Chapter 4.3: Parallel Processing
  18. Chapter 4.4: Unstructured Data
  19. Chapter 4.5: Contextualizing Repetitive Unstructured Data
  20. Chapter 4.6: Textual Disambiguation
  21. Chapter 4.7: Taxonomies
  22. Chapter 5.1: The Siloed Application Environment
  23. Chapter 6.1: Introduction to Data Vault 2.0
  24. Chapter 6.2: Introduction to Data Vault Modeling
  25. Chapter 6.3: Introduction to Data Vault Architecture
  26. Chapter 6.4: Introduction to Data Vault Methodology
  27. Chapter 6.5: Introduction to Data Vault Implementation
  28. Chapter 7.1: The Operational Environment: A Short History
  29. Chapter 7.2: The Standard Work Unit
  30. Chapter 7.3: Data Modeling for the Structured Environment
  31. Chapter 8.1: A Brief History of Data Architecture
  32. Chapter 8.2: Big Data/Existing System Interface
  33. Chapter 8.3: The Data Warehouse/Operational Environment Interface
  34. Chapter 8.4: Data Architecture: A High-Level Perspective
  35. Chapter 9.1: Repetitive Analytics: Some Basics
  36. Chapter 9.2: Analyzing Repetitive Data
  37. Chapter 9.3: Repetitive Analysis
  38. Chapter 10.1: Nonrepetitive Data
  39. Chapter 10.2: Mapping
  40. Chapter 10.3: Analytics From Nonrepetitive Data
  41. Chapter 11.1: Operational Analytics: Response Time
  42. Chapter 12.1: Operational Analytics
  43. Chapter 13.1: Personal Analytics
  44. Chapter 14.1: Data Models Across the End-State Architecture
  45. Chapter 15.1: The System of Record
  46. Chapter 16.1: Business Value and the End-State Architecture
  47. Chapter 17.1: Managing Text
  48. Chapter 18.1: An Introduction to Data Visualizations
  49. Glossary
  50. Index