eBook - ePub

The Data Book

Name: The Data Book
Author: Meredith Zozus

Collection and Management of Research Data

Meredith Zozus

Compartir libro

336 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

The Data Book

Collection and Management of Research Data

Meredith Zozus

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

The Data Book: Collection and Management of Research Data is the first practical book written for researchers and research team members covering how to collect and manage data for research. The book covers basic types of data and fundamentals of how data grow, move and change over time. Focusing on pre-publication data collection and handling, the text illustrates use of these key concepts to match data collection and management methods to a particular study, in essence, making good decisions about data.

The first section of the book defines data, introduces fundamental types of data that bear on methodology to collect and manage them, and covers data management planning and research reproducibility. The second section covers basic principles of and options for data collection and processing emphasizing error resistance and traceability. The third section focuses on managing the data collection and processing stages of research such that quality is consistent and ultimately capable of supporting conclusions drawn from data. The final section of the book covers principles of data security, sharing, and archival. This book will help graduate students and researchers systematically identify and implement appropriate data collection and handling methods.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es The Data Book un PDF/ePUB en línea?

Sí, puedes acceder a The Data Book de Meredith Zozus en formato PDF o ePUB, así como a otros libros populares de Ciencia de la computación y Minería de datos. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Chapman and Hall/CRC

Año

2017

ISBN

9781351647731

Edición

Categoría

Ciencia de la computación

Categoría

Minería de datos

Collecting and Managing Research Data

Introduction

This chapter describes the scope and context of this book—principle-driven methods for the collection and management of research data. The importance of data to the research process is emphasized and illustrated through a series of true stories of data gone awry. These stories are discussed and analyzed to draw attention to common causes of data problems and the impact of those problems, such as the inability to use data, retracted manuscripts, jeopardized credibility, and wrong conclusions. This chapter closes with the presentation of the components of collecting and managing research data that outline the organization of the book. Chapter 1 introduces the reader to the importance of data collection and management to science and provides organizing frameworks.

Topics

• Importance of data to science

• Stories of data gone awry and analysis

• Quality system approach applied to data collection and management

• Determinates of the rigor with which data are managed

• Frameworks for thinking about managing research data

Data and Science

Data form the basic building blocks for all scientific inquiries. Research draws conclusions from analysis of recorded observations. Data management, the process by which observations including measurements are defined, documented, collected, and subsequently processed, is an essential part of almost every research endeavor. Data management and research are inextricably linked. The accuracy and validity of data have a direct effect on the conclusions drawn from them. As such, many research data management practices emanate directly from the necessity of research reproducibility, and replicability for example, those things that are necessary to define data, prevent bias, and assure consistency. The methods used to formulate, obtain, handle, transfer, and archive data collected for a study stem directly from the principles of research reproducibility. For example, traceability—a fundamental precept of data management is that raw data can be reconstructed from the file(s) used for the analysis and study documentation and vice versa. In other words, the data must speak for itself. Research data management practices have evolved from, reflect and support, these principles. These principles apply to research across all disciplines.

In a 1999 report, the Institute of Medicine (IOM) defined quality data as “data strong enough to support conclusions and interpretations equivalent to those derived from error-free data” (Davis 1999). Here, the minimum standard is the only standard that data collected for research purposes must be of sufficient quality to support the conclusions drawn from them. The level of quality meeting this minimum standard will differ from study to study and depends on the planned analysis. Thus, the question of how good is good enough (1) needs to be answered before data collection and (2) is inherently a statistical question emphasizing the importance of statistical involvement in data management planning. The remainder of this book covers the following:

1. How to define, document, collect, and process data such that the resulting data quality is consistent, predictable, and appropriate for the planned analysis.

2. How to document data and data handling to support traceability, reuse, reproducibility, and replicability.

Data management has been defined from an information technology perspective as the “function that develops and executes plans, policies, practices and projects that acquire, control, protect, deliver and enhance the value of data and information” (Mosley 2008). In a research context, this translates to data collection, processing, storage, sharing and archiving. In research, data management covers handling of data from their origination to final archiving or disposal and the data life cycle has three phases: (1) the origination phase during which data are first collected, (2) the active phase during which data are accumulating and changing, and (3) the inactive phase during which data are no longer expected to accumulate or change (Figure 1.1). This book focuses on data in origination and active phases because decisions and activities that occur in these phases most directly impact the fitness of data for a particular use, and after the origination and the active phase have passed, opportunities to improve the quality of data are slim. Other resources focus on data preservation and archival activities that occur in the inactive phase (Eynden et al. 2011, Keralis et al. 2013).

FIGURE 1.1
Phases of data management.

The FAIR Data Principles (Wilkinson 2016) state that data from research should be Findable, Accessible, Interoperable and Reusable (FAIR). These principles impact each phase of research data management. While Findable is achieved mainly through actions taken in the inactive phase, the foundations for Accessible, Interoperable and Reusable are created by decisions made during the planning stage, and actions taken during data origination and the active data management phase.

Data Gone Awry

Given the importance of good data to science, it is hard to imagine that lapses occur in research. However, problems clearly occur. The review by Fang, Steen, and Casadevall reported that of the 2047 papers listed as retracted in the online database PubMed, 21.3% were attributable to error, whereas 67.4% of retractions were attributable to misconduct, including fraud or suspected fraud (43.4%), duplicate publication (14.2%), and plagiarism (9.8%) (Fang 2012).These are only the known problems significant enough to prompt retraction. The following real stories include retractions and other instances of data problems in research. Each scenario given below discusses the causes of the problems, what happened, why, how it was detected, what (if anything) was done to correct it, and the impact on the research. Analyses of situations such as these illustrate common problems and can inform practice.

Case 1: False Alarm

A study published in the journal, Analytical Methods, reporting development and demonstration of a monitoring device for formaldehyde has earned an expression of concern and subsequent retraction for unreliable data (Zilberstein 2016, Hughes 2016). The cause of the data discrepancy was identified through subsequent work, and it was agreed that the levels of formaldehyde reported in the paper could not have been present (Shea 2016).

By way of analysis, the published statement reports that the levels originally reported in the paper would have caused significant physical discomfort and, in absence of these reports, could not have been present. Unfortunately this common sense test of the data occurred after publication. The lesson from this case is that where valid ranges for data are known, the data should be checked against them.

Case 2: Case of the Missing Eights

A large epidemiologic study including 1.2 million questionnaires was conducted in the early 1980s. The data as described by Stellman (1989) were hand entered and verified, after which extensive range and logic checks were run. On the food frequency questionnaire for the study, participants were asked the number of days per week they ate each of 28 different foods. While examining patterns of missing data, the investigators discovered that although more than 6000 participants left no items on the questionnaire blank, and more than 2000 left one item blank, no participants left exactly eight items blank (about 1100 were expected based on the distribution) and the same for 18 responses (about 250 were expected based on the distribution). The observed situation was extremely unlikely. The research team systematically pulled files to identify forms with 8 or 18 missing items to trace how the forms could have been miscounted in this way. When they did, a problem in the computer programming was identified.

By way of analysis, suspicious results were noted when the research team reviewed aggregate descriptive statistics, in this case, the distribution of missing questionnaire items. As the review was done while the project was ongoing, the problem was detected before it caused problems in the analysis. As it was a programming problem rather than something wrong in the underlying data, once the computer program was corrected and tested, the problem was resolved. There are two lessons here: (1) test and validate all computer programming and (2) look at data as early as possible in the process and on an ongoing basis throughout the project.

Case 3: Unnoticed Outliers

A 2008 paper (Gethin et al. 2008) on the use of honey to promote wound healing was retracted (no author 2014) after the journal realized that an outlier had skewed the data analysis. One patient had a much larger wound (61 cm²) versus the other patients in the study whose wounds ranged from 0.9 to 22 cm². Removing the patient with the large wound from the analysis changed the conclusions of the study. The lead investigator in a response to the retraction stated, “I should have checked graphically and statistically for an outlier of this sort before running the regression, and my failing to do so led the paper to an incorrect conclusion” (Ferguson 2015a). The investigator further stated that the error came to light during reanalysis of the data in response to a query about a different aspect of the analysis (Ferguson 2015a).

In the analysis of this case, outlier in the data was not detected before analysis and publication. The outlier was only detected later upon reanalysis. As the data problem was detected after publication, correction could only be accomplished by retraction and republication. The problem could have been prevented by things such as incorporating bias prevention into inclusion/exclusion criteria or by designing and running checks for data problems. The main lesson in this case is that data should be screened for outliers and other sources of bias as early in the project as possible and ongoing throughout the project.

Case 4: Data Leaks

In 2008, the Department of Health and Human Services publically apologized after a laptop with social security numbers was stolen from a National Institutes of Health (NIH) employee’s car (Weiss and Nakashima 2008). The laptop contained data from more than 1000 patients enrolled in a clinical trial. The data on the laptop were not encrypted. A few weeks later, a similar apology was made when a surplus file cabinet from a state mental health facility was sold with patient files in it (Bonner 2008). The latter problem was brought to light when the buyer of the cabinet reported his discovery of the files to the state. The seller of the file cabinet, reported that the center had moved several times and that everyone connected to the files were no longer there (Bonner 2008).

By way of analysis, in both cases, sensitive yet unprotected data were left accessible to others. Unfortunately, the problems came to light after inappropriate disclosure, after which little can be done to rectify the situation. The lesson here is that sensitive data, such as personal data, proprietary, competitive, or confidential information, should be protected from accidental disclosure.

Case 5: Surprisingly Consistent Responses

A colleague had run several focus groups and had recordings of each session. She hired a student research assistant to type the transcripts so that the data could be loaded into qualitative analysis software and coded for subsequent analysis. As the investigator started working with the data, she noticed that the first several transcripts seemed too similar. Upon investigation, she discovered that the research assistant had repeatedly cut and pasted the data from the first transcript instead of typing transcripts of each recording. Fortunately, the problem was found before analysis and publication; however, the situation was costly for the project in terms of rework and elapsed time.

By way of analysis, the data problem was created by misconduct of a research assistant. The root cause, however, was lack of or delayed oversight by the principal investigator. The lesson here is clear: When data are processed by others, a priori training and ongoing oversight are required.

Case 6: What Data?

The 2011 European Journal of Cancer paper titled “Expression of a truncated Hmga1b gene induces gigantism, lipomatosis and B-cell lymphomas in mice” (Fedele 2011) has been retracted (Fedele 2015). The story as reported by the retraction, watch (Oransky 2015) stated that a reader contacted the editors of the journal regarding possible duplications in two figures. When contacted, the authors were unable to provide the editors with the data used to create the figures and the journal retracted the paper in response. The corresponding author disagreed with the retraction, claiming that the source data files were “lost in the transfer of the laboratory in 2003” (Oransky 2015).

The analysis in this case is tough; the only available fact is that the data supporting the figure could not be provided. The lesson here is that data supporting publications should be archived in a manner that are durable over ...