eBook - ePub

Data Architecture: A Primer for the Data Scientist

Name: Data Architecture: A Primer for the Data Scientist
ISBN: 9780128020913

Big Data, Data Warehouse and Data Vault

W.H. Inmon,

Daniel Linstedt,

378 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Architecture: A Primer for the Data Scientist

Big Data, Data Warehouse and Data Vault

W.H. Inmon,

Daniel Linstedt,

About this book

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can't be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist. Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You'll be able to: - Turn textual information into a form that can be analyzed by standard tools. - Make the connection between analytics and Big Data - Understand how Big Data fits within an existing systems environment - Conduct analytics on repetitive and non-repetitive data - Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it - Shows how to turn textual information into a form that can be analyzed by standard tools - Explains how Big Data fits within an existing systems environment - Presents new opportunities that are afforded by the advent of Big Data - Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Morgan Kaufmann

Year

2014

eBook ISBN

9780128020913

Topic

Betriebswirtschaft

Subtopic

Business Intelligence

1.1

Corporate Data

Abstract

Corporate data includes everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule there is much more unstructured data than structured data. Unstructured data has two basic divisions – repetitive data and nonrepetitive data. Big Data is made up of unstructured data. Nonrepetitive Big Data has a fundamentally different form than repetitive unstructured Big Data. In fact the differences between nonrepetitive Big Data and repetitive Big Data are so large that they can be called the boundaries of the “great divide.” The divide is so large many professionals are not even aware that there is this divide. As a rule nonrepetitive Big Data has much greater business value than repetitive Big Data.

Keywords

Big Data

business value

corporate data

great divide of data

nonrepetitive data

repetitive data

structured data

unstructured data

In today’s world it is easy to get lost when dealing with data. There are many different types of data and each type of data has its own peculiarities and idiosyncrasies. Products, vendors, and applications become so focused on their own specific world that the larger picture of how things fit together often gets lost. It oftentimes is useful to step back and look at the larger picture to gain a proper perspective.

The Totality of Data Across the Corporation

Consider the totality of data found in the corporation. A simplistic depiction of the totality of data found in the corporation is seen in Figure 1.1.1.

The totality of data represented here includes everything to do with data of any kind found in the corporation.

There are many ways to subdivide the totality of data in the corporation. One such way (but hardly the only way) to subdivide the data found in the corporation is to divide the totality of data into structured data and unstructured data, as seen in Figure 1.1.2.

Structured data is the data that has a predictable and regularly occurring format of data. Typically structured data is managed by a database management system (DBMS) and consists of records, attributes, keys, and indexes. Structured data is well defined, predictable, and managed by an elaborate infrastructure. As a rule most units of data in the structured environment can be located very quickly and easily.

Unstructured data, conversely, is data that is unpredictable and has no structure that is recognizable to a computer. As a rule, unstructured data is rather clumsy to access, where long strings of data have to be sequentially searched (parsed) in order to find a given unit of data. There are many forms and variations of unstructured data. Perhaps the most commonly occurring form of unstructured data is text. However, by no stretch of the imagination is text the only form of unstructured data.

Dividing Unstructured Data

Unstructured data can further be divided into two basic forms of data – repetitive unstructured data and nonrepetitive unstructured data. As is the case with the division of corporate data, there are many ways to subdivide unstructured data. The method shown here is but one of many ways to subdivide unstructured data. This simple subdivision of unstructured data is shown in Figure 1.1.3.

Repetitive unstructured data is data that occurs many times, often in the same structure and even in the exact same embodiment. Typically, repetitive data occurs many, many times. The structure of repetitive data looks exactly the same or substantially the same as the previous record. There is no massive and elaborate infrastructure managing the content of repetitive unstructured data.

Nonrepetitive unstructured data is data where the records are substantially different from each other. In general each nonrepetitive record is markedly different from each other record.

The division of data types in the corporation has many different embodiments. Consider the data as shown in Figure 1.1.4.

Structured data is typically found as a by-product of transactions. Every time a sale is made, every time a bank account encounters a withdrawal, every time someone transacts an ATM activity, and every time a bill is sent a record of the transaction is made. The record of the transaction ends up as a structured record.

Unstructured repetitive data is quite different. Unstructured repetitive records are typically records of machine interactions, such as the analog verification of product coming off a manufacturing process or the metering of energy usage by a consumer. Consider metering. There is great repetition of records in both form and substance that are created when looking at metered readings.

Unstructured nonrepetitive information is fundamentally different than unstructured repetitive records. With unstructured nonrepetitive records there is little or no repetition of either form or content from one record to the next. Some examples of unstructured nonrepetitive information include email, call center conversations, and market research. When you look at one email, the odds are very good that the next email in the database will be different than the previous email. The same is true for call center information, warranty claims, market research, and so forth.

Business Relevancy

Unstructured repetitive data and unstructured nonrepetitive data have very different characteristics, in many different ways. One of the ways that these...

Cover
Title page
Table of Contents
Copyright
Dedication
Preface
About the Authors
1.1: Corporate Data
1.2: The Data Infrastructure
1.3: The “Great Divide”
1.4: Demographics of Corporate Data
1.5: Corporate Data Analysis
1.6: The Life Cycle of Data – Understanding Data Over Time
1.7: A Brief History of Data
2.1: A Brief History of Big Data
2.2: What is Big Data?
2.3: Parallel Processing
2.4: Unstructured Data
2.5: Contextualizing Repetitive Unstructured Data
2.6: Textual Disambiguation
2.7: Taxonomies
3.1: A Brief History of Data Warehouse
3.2: Integrated Corporate Data
3.3: Historical Data
3.4: Data Marts
3.5: The Operational Data Store
3.6: What a Data Warehouse is Not
4.1: Introduction to Data Vault
4.2: Introduction to Data Vault Modeling
4.3: Introduction to Data Vault Architecture
4.4: Introduction to Data Vault Methodology
4.5: Introduction to Data Vault Implementation
5.1: The Operational Environment – A Short History
5.2: The Standard Work Unit
5.3: Data Modeling for the Structured Environment
5.4: Metadata
5.5: Data Governance of Structured Data
6.1: A Brief History of Data Architecture
6.2: Big Data/Existing Systems Interface
6.3: The Data Warehouse/Operational Environment Interface
6.4: Data Architecture – A High-Level Perspective
7.1: Repetitive Analytics – Some Basics
7.2: Analyzing Repetitive Data
7.3: Repetitive Analysis
8.1: Nonrepetitive Data
8.2: Mapping
8.3: Analytics from Nonrepetitive Data
9.1: Operational Analytics
10.1: Operational Analytics
11.1: Personal Analytics
12.1: A Composite Data Architecture
Glossary
Index

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Data Architecture: A Primer for the Data Scientist an online PDF/ePUB?

Yes, you can access Data Architecture: A Primer for the Data Scientist by W.H. Inmon,Daniel Linstedt in PDF and/or ePUB format, as well as other popular books in Betriebswirtschaft & Business Intelligence. We have over 1.5 million books available in our catalogue for you to explore.

Data Architecture: A Primer for the Data Scientist

Big Data, Data Warehouse and Data Vault

Data Architecture: A Primer for the Data Scientist

Big Data, Data Warehouse and Data Vault

About this book

Trusted by 375,005 students

Information

Corporate Data

Abstract

Keywords

The Totality of Data Across the Corporation

Dividing Unstructured Data

Business Relevancy

Table of contents

Frequently asked questions