CHAPTER 1
INTRODUCTION
ELIZABETH M. PIERCE
Abstract: Data and information quality represent an important and maturing area in the field of Management Information Systems. This introduction explores the motivation for organizations to pursue better data and information quality. This pursuit is fraught with challenges as organizations discover the difficulties surrounding the definition, measurement, analysis, and improvement of quality for data and information. For help in dealing with these challenges, organizations can turn to a growing body of research on data and information quality, including this volume, which features fourteen new and innovative papers by leading researchers and practitioners in the field.
āWater, water, every where,
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.ā
āFrom āThe Rime of the Ancient Marinerā
by Samuel Taylor Coleridge
MOTIVATION FOR IMPROVING DATA AND INFORMATION QUALITY
Like the lament of Coleridgeās ancient mariner who finds himself adrift at sea surrounded by water yet dying of thirst, many organizations find they are surrounded by data, yet much of it does not truly satisfy their information needs. Today we have at our disposal vast stores of information that come in a variety of forms: records, instructions, designs, blueprints, maps, images, sounds, metadata, detailed data, and summarized data, to name just a few. This information may be stored in places ranging from file cabinets to databases and from library shelves to the internet. Todayās organizations have achieved quantity of data and information, but not necessarily quality of either, meaning that the data or information lacks one or more vital characteristics necessary for it to be fit for use. Problems with the quality of data and information are further compounded by the struggles many organizations are experiencing as they try to improve their systems for knowledge management and organizational memory.
The relationship between data, information, and knowledge is important. Data are often viewed as simple facts. When data are put into a context and combined within some structure, information emerges. When information is given meaning by being interpreted, information becomes knowledge. Although it appears as if knowledge is built upon information and information is built upon data, Tuomi (1999ā2000) argues that in fact one must first define the knowledge that is needed. Only then can one describe the information needed to convey that knowledge, and only once the information has been defined, can one describe the raw data and processes required to transform the data into information. For example, an organization may need knowledge about a customerās desires in order to complete a sale. The need for that knowledge drives the organization to define a sales order, an information product composed of specified raw data about the customer, product(s), and service(s) that must be processed and arranged into an agreed-upon form.
Tuomiās (1999ā2000) concept of a reversed knowledge, information, and data hierarchy is illuminating because it implies that data and information quality criteria must also be defined according to this reversed hierarchy. An organization should identify the type of knowledge as well as the quality level of the knowledge it needed to conduct its day-to-day operations and to make decisions. Only then can the organization adequately specify the information products along with their quality criteria that will make it possible for the organization to retain and convey this knowledge. Once the quality criteria for the information are well understood, organizations can then proceed to make sound decisions about how to model, represent, and process the raw data to meet the necessary standards. Thus, in the example of a sales order, an organization cannot adequately specify its data-quality criteria for the different data components that make up the sales order until it fully comprehends the level of quality required for the sales order as a wholeāand that in turn depends on the quality expectations for the knowledge needed to complete a sale to a customer.
The motivation for organizations to understand and improve data and information quality is more pressing than ever. Increasingly many organizations no longer maintain face-to-face contact with customers, vendors, government regulators, or even employees. Their primary link to these people is through the data contained in the information passed between parties as goods and services are exchanged. And like the childrenās game of telephone, whereby a message is sent by one child whispering into the ear of the next child who then passes the message on, the final information that is received may be too garbled to understand or the original context of the message may have been lost. Without the validation and verification of data that come from personal contact between all participants in the information supply chain, poor data and information quality may escape detection, only to cause the organizations problems in making sense of their store of organizational knowledge.
Here are a few examples to illustrate the difficulties caused by poor data and information quality. On July 24, 2002, nine miners from the Quecreek Coal Mine in western Pennsylvania became trapped twenty-four stories underground after they accidentally breached an adjacent abandoned mine that let loose 77 million gallons of frigid water. Miraculously, the miners were rescued four days later, thanks to a major rescue effort involving several hundred volunteers who sank an access hole in a farmerās field. It is estimated that the accident and its aftermath cost over $2 million in rescue and cleanup costs and instigated four state and federal investigations and a pending lawsuit from several of the trapped miners, who say the coal company should have known its operations were too close to the abandoned mine (Erdley 2003a). The Pennsylvania State Department of Environmental Protection issued a draft report later that year identifying incomplete maps of the abandoned mine as a major factor in the Quecreek accident (Erdley and Prine 2003). Pennsylvania now requires old mine maps, that is, one type of information, to be cross-referenced with coal production records, that is, another type of information, to alert mine planners to possible inaccuracies in the maps, that is, a data quality problem, that would prevent a mining company from knowing where it could safely mine coal, that is, knowledge (Erdley 2003b).
NASA launched the Mars Climate Orbiter on December 11, 1998, as part of its $235.9 million Mars ā98 project (NASA 1998). After a ten-month journey to Mars, the spacecraft was to observe seasonal changes on the planet by mapping its surface for an entire Martian year (687 Earth days) (Boeing 1998). Unfortunately the spacecraft was lost upon arrival at Mars on September 23, 1999. Engineers concluded that the Climate Orbiter entered the planetās atmosphere too low and probably burned up (Jet Propulsion Laboratory 2003). An internal peer review by NASAās Jet Propulsion Laboratory concluded that a failure to recognize and correct an error in an information transfer between the spacecraft team in Colorado and the mission navigation team in California led to the maneuvers that placed the spacecraft in an improper orbit. Apparently one team used English units (e.g., inches, feet, and pounds), while the other used metric units for key spacecraft operations (Jet Propulsion Laboratory 1999). This confusion over the interpretation of data contained in the design documents led to the engineersā lacking the necessary information for correctly calculating the Climate Orbiterās entry into Marsās atmosphere.
In December 1999, the Institute of Medicine issued a report estimating that 44,000ā98,000 Americans die each year from preventable medical errors such as prescription mistakes, mislabeled blood samples, and illegible handwritten patient data on paper forms (Dash 1999). This makes medical errors the eighth-leading cause of death in the United States, with the total cost of injuries related to medical errors exceeding $17 billion a year (Hamblen 2000). Hospitals are now working to reduce medical errors caused by data-quality problems through better practices and technology such as adopting handheld-based systems to link data on drugs, patients, and lab specimens so medical personnel have better information about patients (Hamblen 2000).
The October 2001 report of the U.S. Postal Service Mailing Industry Task Force recommended reducing undeliverable mail by improving address quality and by providing a āfeedback loopā that captures and reports address errors. The Postal Serviceās costs associated with undeliverable-as-addressed mail total $1.9 billion each year, and the Task Force estimates that industry costs are two times that amount (USPS 2003). This issue illustrates how data-quality problems in an information product as simple as a mailing label prevent an organization from having the right knowledge about where a customer is located.
Betts (2001a) described a 2001 study by PricewaterhouseCoopers in New York that found that 75 percent of the 599 companies surveyed experienced financial pain from defective data. The report indicated that poor data management cost global businesses more than $1.4 billion per year in billing, accounting, and inventory snafus (Betts 2001b). In addition, one-third of the companies said that ādirty dataā forced them to delay or scrap a new system (Betts 2001b), while only 37 percent of the companies surveyed were āvery confidentā in the quality of their own data, and only 15 percent were āvery confidentā in the quality of the data of their trading partners (Betts 2001a). The study concluded that āpoor data quality is threatening to undermine massive investment being made elsewhere,ā such as customer relationship management and supply chain management systems. Thus, without data quality, one cannot have the quality information needed to support systems designed to enhance an organizationās knowledge management.
As a result of experiences such as these, many organizations are currently looking for ways to improve the quality of their data and information. Consequently, a new industry comprised of academics, consultants, and companies is now offering products and services to address this problem. Some of the advice and solutions being offered is sound, while some of it is not. In any case, resolving information- and data-quality problems has proven enormously difficult for many organizations for a variety of reasons.
INFORMATION- AND DATA-QUALITY CHALLENGES
Costs of Poor Data and Information Quality Are Difficult to Quantify
The costs associated with poor data and information quality are often difficult to quantify because they involve both tangible and intangible components. Without accurate cost estimates, organizations may not realize the impact that poor data and information quality is having on their bottom line, and, therefore, improvement is not a priority. Redman (2003) estimates that without an active quality program in place, the cost of poor data and information quality for a typical organization is about 20 percent of revenue. Although the fear of bad publicity keeps many companies silent about this issue, Knight (1992) estimates that poor quality data residing in corporate databases costs U.S. businesses and the government billions each year.
The presence of poor quality data and information can lead to higher costs in several ways. First there is the cost of remedying the mistake caused by the poor quality data or information along with the cost of correcting the data or information problem itself. Rectifying the harm caused by poor data and information quality may involve dealing with cleanup efforts, loss of lives, valuable equipment or production time, rework, lawsuits or penalties, and customer appeasements such as offering rebates or issuing apology letters. Redman (1996, 1ā16) also cites other quality-related expenses such as different departments within the same organization maintaining their own stores of redundant information because no one trusts the information in the otherās database; managers forming poorer, less-confident decisions that take longer to make; and organizational difficulties in adopting new technologies such as data warehouse or business-reengineering projects.
Besides causing additional expense, the presence of poor quality data may cause revenue reduction due to dissatisfied customers and partners opting to do business with someone else. In some organizations there are so many duplicate supplier records (e.g., IBM Corp., I.B.M. Corporation, Intl. Bus. Mach.) that companies are unable to negotiate better deals and volume discounts because they lack a complete, accurate view of their business contacts (Betts 2001b). Dealing with the effects of poor quality data and information can frustrate employees, lower job satisfaction, and raise levels of organizational mistrust. In a tight labor ma...