eBook - ePub

Information Quality

Name: Information Quality
ISBN: 9781118890653

The Potential of Data and Analytics to Generate Knowledge

Ron S. Kenett,

Galit Shmueli,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Information Quality

The Potential of Data and Analytics to Generate Knowledge

Ron S. Kenett,

Galit Shmueli,

About this book

Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis

Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis. Whether the information quality of a dataset is sufficient is of practical importance at many stages of the data analytics journey, from the pre-data collection stage to the post-data collection and post-analysis stages. It is also critical to various stakeholders: data collection agencies, analysts, data scientists, and management.

This book:

Explains how to integrate the notions of goal, data, analysis and utility that are the main building blocks of data analysis within any domain.
Presents a framework for integrating domain knowledge with data analysis.
Provides a combination of both methodological and practical aspects of data analysis.
Discusses issues surrounding the implementation and integration of InfoQ in both academic programmes and business / industrial projects.
Showcases numerous case studies in a variety of application areas such as education, healthcare, official statistics, risk management and marketing surveys.
Presents a review of software tools from the InfoQ perspective along with example datasets on an accompanying website.

This book will be beneficial for researchers in academia and in industry, analysts, consultants, and agencies that collect and analyse data as well as undergraduate and postgraduate courses involving data analysis.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Wiley

Year

2016

Print ISBN

9781118874448

Edition

eBook ISBN

9781118890653

Topic

Mathematics

Subtopic

Probability & Statistics

Index

Mathematics

Part I
THE INFORMATION QUALITY FRAMEWORK

1
Introduction to information quality

1.1 Introduction

Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:

Data on all the online auctions that took place in January 2012
Data on all the online auctions, for cameras only, that took place in 2012
Data on all the online auctions, for cameras only, that will take place in the next year
Data on a random sample of online auctions that took place in 2012

Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):

Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.

While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.

Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing hypotheses of interest, predicting new observations, quantifying population effects, and summarizing data efficiently. In these empirical fields, measurable data is used to derive knowledge. Yet, a clean, exact, and complete dataset, which is analyzed professionally, might contain no useful information for the problem under investigation. In contrast, a very “dirty” dataset, with missing values and incomplete coverage, can contain useful information for some goals. In some cases, available data can even be misleading (Patzer, 1995, p. 14):

Data may be of little or no value, or even negative value, if they misinform.

The focus of this book is on assessing the potential of a particular dataset for achieving a given analysis goal by employing data analysis methods and considering a given utility. We call this concept information quality (InfoQ). We propose a formal definition of InfoQ and provide guidelines for its assessment. Our objective is to offer a general framework that applies to empirical research. Such element has not received much attention in the body of knowledge of the statistics profession and can be considered a contribution to both the theory and the practice of applied statistics (Kenett, 2015).

A framework for assessing InfoQ is needed both when designing a study to produce findings of high InfoQ as well as at the postdesign stage, after the data has been collected. Questions regarding the value of data to be collected, or that have already been collected, have important implications both in academic research and in practice. With this motivation in mind, we construct the concept of InfoQ and then operationalize it so that it can be implemented in practice.

In this book, we address and tackle a high‐level issue at the core of any data analysis. Rather than concentrate on a specific set of methods or applications, we consider a general concept that underlies any empirical analysis. The InfoQ framework therefore contributes to the literature on statistical strategy, also known as metastatistics (see Hand, 1994).

1.2 Components of InfoQ

Our definition of InfoQ involves four major components that are present in every data analysis: an analysis goal, a dataset, an analysis method, and a utility (Kenett and Shmueli, 2014). The discussion and assessment of InfoQ require examining and considering the complete set of its components as well as the relationships between the components. In such an evaluation we also consider eight dimensions that deconstruct the InfoQ concept. These dimensions are presented in Chapter 3. We start our introduction of InfoQ by defining each of its components.

Before describing each of the four InfoQ components, we introduce the following notation and definitions to help avoid confusion:

g denotes a specific analysis goal.
X denotes the available dataset.
f is an empirical analysis method.
U is a utility measure.

We use subscript indices to indicate alternatives. For example, to convey K different analysis goals, we use g₁, g₂,…, g_K; J different methods of analysis are denoted f₁, f₂,…, f_J.

Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we can think of the InfoQ framework as one for evaluating the application of a technology (data analysis) to a resource (data) for a given purpose.

1.2.1 Goal (g)

Data analysis is used for a variety of purposes in research and in industry. The term “goal” can refer to two goals: the high‐level goal of the study (the “domain goal”) and the empirical goal (the “analysis goal”). One starts from the domain goal and then converts it into an analysis goal. A classic example is translating a hypothesis driven by a theory into a set of statistical hypotheses.

There are various classifications of study goals; some classifications span both the domain and analysis goals, while other classification systems focus on describing different analysis goals.

One classification approach divides the domain and analysis goals into three general classes: causal explanation, empirical prediction, and description (see Shmueli, 2010; Shmueli and Koppius, 2011). Causal explanation is concerned with establishing and quantifying the causal relationship between inputs and outcomes of interest. Lab experiments in the life sciences are often intended to establish causal relationships. Academic research in the social sciences is typically focused on causal explanation. In the social science context, the causality structure is based on a theoretical model that establishes the causal effect of some constructs (abstract concepts) on other constructs. The data collection stage is therefore preceded by a construct operationalization stage, where the researcher establishes which measurable variables can represent the constructs of interest. An example is investigating the causal effect of parents’ intelligence on their children’s intelligence. The construct “intelligence” can be measured in various ways, such as via IQ tests. The goal of empirical prediction differs from causal explanation. Examples include forecasting future values of a time series and predicting the output value for new observations given a set of input variables. Examples include recommendation systems on various websites, which are aimed at predicting services or products that the user is most likely to be interested in. Predictions of the economy are another type of predictive goal, with forecasts of particul...

Cover
Title Page
Table of Contents
Foreword
About the authors
Preface
Quotes about the book
About the companion website
Part I: THE INFORMATION QUALITY FRAMEWORK
Part II: APPLICATIONS OF InfoQ
Part III: IMPLEMENTING InfoQ
Index
End User License Agreement

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Information Quality by Ron S. Kenett,Galit Shmueli in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over 1.5 million books available in our catalogue for you to explore.