eBook - ePub

Measuring Data Quality for Ongoing Improvement

Name: Measuring Data Quality for Ongoing Improvement
Author: Laura Sebastian-Coleman

A Data Quality Assessment Framework

Laura Sebastian-Coleman

Share book

376 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Measuring Data Quality for Ongoing Improvement

A Data Quality Assessment Framework

Laura Sebastian-Coleman

Book details

Book preview

Table of contents

Citations

About This Book

The Data Quality Assessment Framework shows you how to measure and monitor data quality, ensuring quality over time. You'll start with general concepts of measurement and work your way through a detailed framework of more than three dozen measurement types related to five objective dimensions of quality: completeness, timeliness, consistency, validity, and integrity. Ongoing measurement, rather than one time activities will help your organization reach a new level of data quality. This plain-language approach to measuring data can be understood by both business and IT and provides practical guidance on how to apply the DQAF within any organization enabling you to prioritize measurements and effectively report on results. Strategies for using data measurement to govern and improve the quality of data and guidelines for applying the framework within a data asset are included. You'll come away able to prioritize which measurement types to implement, knowing where to place them in a data flow and how frequently to measure. Common conceptual models for defining and storing of data quality results for purposes of trend analysis are also included as well as generic business requirements for ongoing measuring and monitoring including calculations and comparisons that make the measurements meaningful and help understand trends and detect anomalies.

Demonstrates how to leverage a technology independent data quality measurement framework for your specific business priorities and data quality challenges
Enables discussions between business and IT with a non-technical vocabulary for data quality measurement
Describes how to measure data quality on an ongoing basis with generic measurement types that can be applied to any situation

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Measuring Data Quality for Ongoing Improvement an online PDF/ePUB?

Yes, you can access Measuring Data Quality for Ongoing Improvement by Laura Sebastian-Coleman in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Morgan Kaufmann

Year

2012

ISBN

9780123977540

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Section 1. Concepts and Definitions

“Does not any analysis of measurement require concepts more fundamental than measurement?”

—John Stewart Bell, Irish Physicist, 1928–1990

The purpose of Measuring Data Quality for Ongoing Improvement is to help people understand ways of measuring data quality so that they can improve the quality of the data they are responsible for. A working assumption is that most people—even those who work in the field of information quality—find data quality measurement difficult or perplexing. The book will try to reduce that difficulty by describing the Data Quality Measurement Framework (DQAF), a set of 48 generic measurement types based on five dimensions of data quality: completeness, timeliness, validity, consistency, and integrity. The DQAF focuses on objective characteristics of data. Using the framework requires a wider context than just these dimensions. Effective data quality measurement requires building knowledge of your organization’s data: where it comes from, where it is stored, how it moves within the organization, who uses it, and to what ends.

Section One defines a set of foundational concepts related to data, the ways it is managed within organization, and how we understand and measure its quality. These concepts provide the context for the other subjects covered in the book, so they are explored in depth. Concise definitions are captured in the glossary. While this section introduces most of the basic concepts, it does not describe requirements (these will be covered in Section Four), strategy (see Section Five), or statistics (see Section Six). Each of these concepts is covered in the sections focused on those topics. Section One includes four chapters.

Chapter 1: Data presents an extended definition of data that emphasizes data’s semiotic function. Data represents things other than itself. How well it does so influences our perceptions of data quality. The chapter defines information as a variation on the concept of data and discusses the relation of both to knowledge. A primary implication of the chapter is that there cannot be data without knowledge.

Chapter 2: Data, People, and Systems defines a set of roles related to data and data management: data producer, data consumer, data broker, data steward, data quality program, and stakeholder. In many organizations individuals play multiple roles. It is not always people who fill these roles, however. Systems also produce and consume data. The chapter also addresses the sometimes challenging relationship between people who work in information technology (IT) and businesspeople.

Chapter 3: Data Management, Models, and Metadata presents a set of concepts related to data management, data models, and metadata, as these have a direct bearing on data quality. Data management implies the need for particular kinds of knowledge about data: what data represents, how much there is of it in an organization, how it is organized and maintained, where it resides, who uses it and the like. Knowledge about data captured in data models and metadata is necessary input for the process of data quality measurement.

Chapter 4: Data Quality and Measurement introduces the concept of the data quality dimension as a means through which data quality can be assessed. It presents a definition of measurement, discusses the challenges associated with establishing tools for or systems of measurement, and presents characteristics of effective measurements. It then defines several general concepts associated with data quality assessment and some specific terms, such as the measurement type, used by the DQAF.

This book is about data quality measurement. That means it is about four things: how we understand data, how we understand quality, how we understand measurement, and how the first three relate to each other. Understanding each of these requires breaking through assumptions and preconceptions that have obscured the reality of data and our uses of it. When we acknowledge the representational aspects of data, we can leverage it more effectively for specific purposes. We can also better understand how to measure and improve its quality.

A note on the definitions: My starting point for definitions of common words and their etymologies is the New Oxford American Dictionary, Second Edition (2005) (cited as NOAD). For terms specific to data and information quality, I have synthesized dictionary definitions and definitions in published works on the subject. In most cases, the definitions I have adopted do not differ significantly from other published definitions. However, I have defined them in relation to how they applied to the DQAF.

Chapter 1

Data

“The spirit of Plato dies hard. We have been unable to escape the philosophical tradition that what we can see and measure in the world is merely the superficial and imperfect representation of an underlying reality.”

—Stephen Jay Gould, The Mismeasure of Man

“Data! Data! Data!” he cried impatiently. “I cannot make bricks without clay.”

—Sherlock Holmes, “The Adventure of the Copper Beaches”

Purpose

This chapter presents an extended definition of the concept of data. Understanding what data is and how it works is essential for measuring its quality. The chapter focuses on data’s role in representing objects, events, and concepts. It also discusses the relation between data and information.

Data

The New Oxford American Dictionary defines data first as “facts and statistics collected together for reference or analysis.” The American Society for Quality (ASQ) defines data as “A set of collected facts. There are two basic kinds of numerical data: measured or variable data, such as ‘16 ounces,’ ‘4 miles’ and ‘0.75 inches;’ and counted or attribute data, such as ‘162 defects’” (ASQ.org). And the International Standards Organization (ISO) defines data as “re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing” (ISO 11179).

The term data (plural of datum)¹ derives from dare, the Latin past participle of to give.² Its literal meaning is “something given.” Despite this generic root, the term has a strong association with numbers, measurement, mathematics, and science. Seventeenth-century philosophers used the term to refer to “things known or assumed as facts, making the basis of reasoning or calculation.” A singular datum provides “a fixed starting point of a scale or operation.” We often think of data in relation to computing, where data refers to “the quantities, characters, or symbols, on which operations are performed by a computer, being stored and transmitted in the form of electrical signals and recorded on magnetic, optical or mechanical recording media” (NOAD) (although as we enter the age of “big data,” we may even have grown beyond the boundaries of this characterization).³

Today, we most often use the word data to refer to facts stored and shared electronically, in databases or other computer applications.⁴ These facts may be measurements, codified information, or simply descriptive attributes of objects in the world, such as names, locations, and physical characteristics. Because they are stored electronically, we can more quickly understand aspects of their content. The definition of data I will use throughout this book highlights not only data’s existence as part of information systems, but also data’s constructed-ness: Data are abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage.

Data as Representation

Each piece of this definition is important. The adjective abstract means “existing in thought or as an idea but not having a physical or concrete existence” (NOAD). As a verb, to abstract has several definitions, all of which include the concept of separating things from each other—for example, abstracting an idea from its historical context, or producing a summary (an abstract) of a book or article. Data enable us to understand facets of reality by abstracting (separating) and representing them in summary form (an abstract) (Peirce, 1955, p. 98). This ability works with facts based on well-known criteria (for example, birthdates based on the Julian calendar) as well as with facts based on more complex formulas, such as the performance of stocks on the New York Stock Exchange.

Data are always representations. Their function is primarily semiotic. They stand for things other than themselves (Chisholm, 2010; Orr, 1998). These things can be objects (people, places, things) or events or concepts. They can even be other data. Data functions as a sign of the thing it represents in the real world (semantics). It is rare that there would be one and only one way of representing the same thing. To be understood, any piece of data also operates within a system of signs, the meaning of which is dependent on each other (syntactics). And, finally, it is used for specific purposes and has particular effects (pragmatics) (Semiotics, Chandler, 2009).

Data represent only selected characteristics of the objects, events, and concepts. In this sense, data is a model of reality.⁵ We will talk about data models in Chapter 3. Models play an important function in our ability to understand the world. As Emanuel Derman asserts in Models Behaving Badly, “The world is impossible to grasp in its entirety. We can focus on only a small part of its vast confusion. …Models project a detailed and complex world onto a smaller subspace” (Derman, 2011, pp. 58–59). But, he continues, “Models are simplifications and simplification can be dangerous” (Derman, 2011, p. 59). The primary risk in using models is that we may believe in them. “The greatest conceptual danger,” writes Derman, “is idolatry. …Though I will use the models. …I will always look over my shoulder and never forget that the model is not the world” (2011, p. 198). Data presents a similar risk. In Data and Reality, his comprehensive exploration of the limits of data, William Kent observes, “A model is a basic system of constructs used in describing reality. …[It] is more than a passive medium for recording our view of reality. It shapes that view, and limits our perceptions” (Kent, 2000, p.107). Kent’s book delineates all the choices that go into representing “amorphous, disordered, contradictory, inconsistent, non-rational, and non-objective” reality in data models. Deciding what is an entity (what equals “one” of the things you are representing), what is an attribute, a category, the right level of detail in a description, what is the system, what is a name—all of these decisions contribute to selecting the parts of reality we choose to understand through data. Despite the impossibility of reaching an “absolute definition of truth and beauty,” Kent concludes that we can share a common and stable view of reality (p. 228). At the root of this shared reality is shared language (p. 226) (though he sees, as any semiotician would, that language itself participates in the problem of models).

I will give two examples to illustrate the kinds of decisions that are made when we structure data. The first has to do with the concept of “one” thing—even for a familiar concept. When we talk about a pair of shoes, are we talking about one thing or two things? The answer depends on definition and use. For most people, a pair of shoes is one thing. But for someone who has orthopedic problems, a pair of shoes means two things: a left shoe that may need one kind of alteration to be usable and a right shoe that may need another kind of alteration.

The second example is about the arbitrariness of representation. As any high school student who has written a research paper knows, a necessary part of research is compiling a bibliography. There are different conventions for presenting information about the sources used in research. The American Psychological Association (APA) represents books like this:

Talburt, J. (2011). Entity resolution and information quality. Boston, MA: Morgan Kaufmann.

whereas the Modern Language Association (MLA) represents them like this:

Talburt, John R. Entity Resolution and Information Quality. Boston, MA: Morgan Kaufmann, 2011. Print.

And the Chicago Manual of Style (CMS) has yet another variation:

Talburt, John R. Entity Resolution and Information Quality. Boston, MA: Morgan Kaufmann, 2011.

These citations present the same basic facts about the same book—author, title, publication date, publisher, and place of publication—but the conventions of representation differ between them. Undoubtedly, thought went into these choices so I assume there is some significance to them. But that significance is not apparent within the representation itself. (Why does the APA not capitalize all the nouns in the title? Why does MLA include the medium of publication while APA and CMS do not?) To almost anyone using them, they are simply conventions of representation.

The Implications of Data’s Semiotic Function

One implication of the semiotic function of data is that data do not simply exist; they are created. Another implication is that any given data model is only one way of representing reality. Data are thus both an interpretation of the objects they represent and themselves objects that must be interpreted.

We tend to use words that imply otherwise.⁶ Recognizing data’s representational function is another way of saying that data cannot be understood without context, but it goes a bit further than just that assertion. To be fully understood, data must be recognized within the particular context of its creation (or production). Most work on data quality emphasizes whether the data meet consumers’ requirements (whether data is “fit for use”).⁷ Ultimately, however, understanding whether data meet requirements requires understanding where the data come from and how they represent reality. Data always must be interpreted. In some situations, such as the presentation of financial reports to lay audiences or the description of scientific assertions to students, the need to describe context is recognized. Such data are presented with a full context. But in business, we often skip this step. Even for relatively straightforward data, it is important to remember that data are created through a set of choices about how to represent reality. Underlying these choices are assumptions about what constitutes reality in the first place.

Data are f...