Data Architecture: A Primer for the Data Scientist
eBook - ePub

Data Architecture: A Primer for the Data Scientist

Big Data, Data Warehouse and Data Vault

  1. 378 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Architecture: A Primer for the Data Scientist

Big Data, Data Warehouse and Data Vault

About this book

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can't be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist.Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You'll be able to: - Turn textual information into a form that can be analyzed by standard tools.- Make the connection between analytics and Big Data- Understand how Big Data fits within an existing systems environment- Conduct analytics on repetitive and non-repetitive data- Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it- Shows how to turn textual information into a form that can be analyzed by standard tools- Explains how Big Data fits within an existing systems environment- Presents new opportunities that are afforded by the advent of Big Data- Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Data Architecture: A Primer for the Data Scientist by W.H. Inmon,Daniel Linstedt in PDF and/or ePUB format, as well as other popular books in Business & Business Intelligence. We have over one million books available in our catalogue for you to explore.

Information

1.1

Corporate Data

Abstract

Corporate data includes everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule there is much more unstructured data than structured data. Unstructured data has two basic divisions – repetitive data and nonrepetitive data. Big Data is made up of unstructured data. Nonrepetitive Big Data has a fundamentally different form than repetitive unstructured Big Data. In fact the differences between nonrepetitive Big Data and repetitive Big Data are so large that they can be called the boundaries of the “great divide.” The divide is so large many professionals are not even aware that there is this divide. As a rule nonrepetitive Big Data has much greater business value than repetitive Big Data.

Keywords

Big Data
business value
corporate data
great divide of data
nonrepetitive data
repetitive data
structured data
unstructured data
In today’s world it is easy to get lost when dealing with data. There are many different types of data and each type of data has its own peculiarities and idiosyncrasies. Products, vendors, and applications become so focused on their own specific world that the larger picture of how things fit together often gets lost. It oftentimes is useful to step back and look at the larger picture to gain a proper perspective.

The Totality of Data Across the Corporation

Consider the totality of data found in the corporation. A simplistic depiction of the totality of data found in the corporation is seen in Figure 1.1.1.
image
Figure 1.1.1
The totality of data represented here includes everything to do with data of any kind found in the corporation.
There are many ways to subdivide the totality of data in the corporation. One such way (but hardly the only way) to subdivide the data found in the corporation is to divide the totality of data into structured data and unstructured data, as seen in Figure 1.1.2.
image
Figure 1.1.2
Structured data is the data that has a predictable and regularly occurring format of data. Typically structured data is managed by a database management system (DBMS) and consists of records, attributes, keys, and indexes. Structured data is well defined, predictable, and managed by an elaborate infrastructure. As a rule most units of data in the structured environment can be located very quickly and easily.
Unstructured data, conversely, is data that is unpredictable and has no structure that is recognizable to a computer. As a rule, unstructured data is rather clumsy to access, where long strings of data have to be sequentially searched (parsed) in order to find a given unit of data. There are many forms and variations of unstructured data. Perhaps the most commonly occurring form of unstructured data is text. However, by no stretch of the imagination is text the only form of unstructured data.

Dividing Unstructured Data

Unstructured data can further be divided into two basic forms of data – repetitive unstructured data and nonrepetitive unstructured data. As is the case with the division of corporate data, there are many ways to subdivide unstructured data. The method shown here is but one of many ways to subdivide unstructured data. This simple subdivision of unstructured data is shown in Figure 1.1.3.
image
Figure 1.1.3
Repetitive unstructured data is data that occurs many times, often in the same structure and even in the exact same embodiment. Typically, repetitive data occurs many, many times. The structure of repetitive data looks exactly the same or substantially the same as the previous record. There is no massive and elaborate infrastructure managing the content of repetitive unstructured data.
Nonrepetitive unstructured data is data where the records are substantially different from each other. In general each nonrepetitive record is markedly different from each other record.
The division of data types in the corporation has many different embodiments. Consider the data as shown in Figure 1.1.4.
image
Figure 1.1.4
Structured data is typically found as a by-product of transactions. Every time a sale is made, every time a bank account encounters a withdrawal, every time someone transacts an ATM activity, and every time a bill is sent a record of the transaction is made. The record of the transaction ends up as a structured record.
Unstructured repetitive data is quite different. Unstructured repetitive records are typically records of machine interactions, such as the analog verification of product coming off a manufacturing process or the metering of energy usage by a consumer. Consider metering. There is great repetition of records in both form and substance that are created when looking at metered readings.
Unstructured nonrepetitive information is fundamentally different than unstructured repetitive records. With unstructured nonrepetitive records there is little or no repetition of either form or content from one record to the next. Some examples of unstructured nonrepetitive information include email, call center conversations, and market research. When you look at one email, the odds are very good that the next email in the database will be different than the previous email. The same is true for call center information, warranty claims, market research, and so forth.

Business Relevancy

Unstructured repetitive data and unstructured nonrepetitive data have very different characteristics, in many different ways. One of the ways that these...

Table of contents

  1. Cover
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Preface
  7. About the Authors
  8. 1.1: Corporate Data
  9. 1.2: The Data Infrastructure
  10. 1.3: The “Great Divide”
  11. 1.4: Demographics of Corporate Data
  12. 1.5: Corporate Data Analysis
  13. 1.6: The Life Cycle of Data – Understanding Data Over Time
  14. 1.7: A Brief History of Data
  15. 2.1: A Brief History of Big Data
  16. 2.2: What is Big Data?
  17. 2.3: Parallel Processing
  18. 2.4: Unstructured Data
  19. 2.5: Contextualizing Repetitive Unstructured Data
  20. 2.6: Textual Disambiguation
  21. 2.7: Taxonomies
  22. 3.1: A Brief History of Data Warehouse
  23. 3.2: Integrated Corporate Data
  24. 3.3: Historical Data
  25. 3.4: Data Marts
  26. 3.5: The Operational Data Store
  27. 3.6: What a Data Warehouse is Not
  28. 4.1: Introduction to Data Vault
  29. 4.2: Introduction to Data Vault Modeling
  30. 4.3: Introduction to Data Vault Architecture
  31. 4.4: Introduction to Data Vault Methodology
  32. 4.5: Introduction to Data Vault Implementation
  33. 5.1: The Operational Environment – A Short History
  34. 5.2: The Standard Work Unit
  35. 5.3: Data Modeling for the Structured Environment
  36. 5.4: Metadata
  37. 5.5: Data Governance of Structured Data
  38. 6.1: A Brief History of Data Architecture
  39. 6.2: Big Data/Existing Systems Interface
  40. 6.3: The Data Warehouse/Operational Environment Interface
  41. 6.4: Data Architecture – A High-Level Perspective
  42. 7.1: Repetitive Analytics – Some Basics
  43. 7.2: Analyzing Repetitive Data
  44. 7.3: Repetitive Analysis
  45. 8.1: Nonrepetitive Data
  46. 8.2: Mapping
  47. 8.3: Analytics from Nonrepetitive Data
  48. 9.1: Operational Analytics
  49. 10.1: Operational Analytics
  50. 11.1: Personal Analytics
  51. 12.1: A Composite Data Architecture
  52. Glossary
  53. Index