1
What's the problem with missing data?
Michael O'Kelly and Bohdana Ratitch
Text not available in this digital edition.
Macavity the Mystery Cat, TS Eliot*
Key points
- Missing data for the purposes of this book are data that were planned to be recorded during a clinical trial but are not available. Non-monotone or intermediate missing data occur when a subject misses a visit but contributes data at later visits. Monotone missing data, where all data for a subject is missing after a certain time-point due to early withdrawal from the study, is the more serious problem in interpreting the results of a trial.
- The most important thing about missing data is that it is missing: we can never be sure whether the assumptions made about it are true.
- An example illustrates the potential bias of using only observed data in an analysis (a favorable subset of subjects); and of using a subject's last available observation or baseline observation in place of missing values (bias varies and may be difficult to predict).
- Assuming that data are missing at random (i.e., that given the data and the model, missingness is independent of the unobserved values) allows one to use study data to infer likely values for missing data, but is likely biased in that it assumes that subjects who withdrew from the study have results like similar subjects who remained in the study.
- Given that we can never be sure whether the assumptions made about missingness in the primary analysis are true, sensitivity analyses are needed to stress-test the trial results for robustness to assumptions about missing data: sensitivity analyses will help the reader of the clinical study report to assess the credibility of a trial with missing data.
1.1 What do we mean by missing data?
This book is about missing data in clinical trials. In a clinical trial, missing data are data that were planned to be recorded but are not present in the database. No matter how well designed and conducted a trial is, some missing data can almost always be expected. Missingness may be absolutely unrelated to the subject's medical condition and study treatment. For example, data could be missing due to a human error in recording data; due to a scheduling conflict that prevented the subject from attending the study visit; or due to a subject's moving to a region outside of the study's remit. On the other hand, data may be missing for reasons that are related to subject's health and the experimental treatment he/she is undergoing. For example, subjects may decide to discontinue from study prematurely if their condition worsens or fails to improve, or if they experience adverse reactions or adverse events (AEs). A contrary situation is also possible, although probably less common, where a subject is cured and observations are missing because the subject is not willing to bother with the rest of the study assessments. Apart from missingness due to missed visits, missing data can arise simply due to the nature of the measurement or the nature of the disease. An example of data that would be missing because not meaningful is a quality-of-life score for a subject who has died. Those cases where missingness is related to the subject's underlying condition and study treatment have the greatest potential to undermine the credibility of a trial. Sometimes, a subject's data collected prior to discontinuation reflects the reason for withdrawal (e.g., worsening, improvement or toxicity), but subjects can also discontinue without providing that crucial information that would have enabled us to assess the reason for missingness and thus incorporate it in our analysis. Such cases potentially hide some important information about treatment efficacy and/or safety, without which study conclusions may be biased.
When a subject has provided data over the course of the study, but some assessments, either in the middle of the trial or at the primary time point, are missing for any reason, their data can be referred to as partial subject data. In this book, we explore the implications of this partial data and ways to minimize the potential bias.
In many clinical trials, collected data are longitudinal in nature, that is, data about the same clinical parameter is collected on multiple occasions (e.g., during study visits or through subject diaries). In such studies, a primary endpoint (clinical parameter used to evaluate the primary objective of the trial at a specific time point) is typically required to be measured at the end of the treatment period or a period at the end of which the clinical benefit is expected to be attained or maintained, with assessments performed at that point as well as on several prior occasions, thus capturing subject's progress after the start of the treatment. This is in contrast with another type of trial, where the primary endpoint is event-driven, for example, based on such events as death or disease progression. In this book, we focus primarily on the former type of the trials, and we look at various ways in which partial subject data can be used for analysis.
Most of this book is about ways to handle missing data once it occurs, but it is also important to prevent missing data insofar as this is possible. Chapter 2 discusses this in detail, and describes some ways in which the statistician can contribute to prevention strategies. We now put some of the discussion above somewhat more formally.
1.1.1 Monotone and non-monotone missing data
A subject who completes a clinical trial may have data missing for a measurement because he/she failed to turn up for some visits in the middle of the trial. Such a measurement is said to have ānon-monotone missing,ā āintermediate missingā or āintermittent missingā data, because the status of the measurement for a subject can switch from missing to non-missing and back as the patient progresses through the trial. In many clinical trials, this kind of missingness is more likely to be unrelated to the study condition or treatment. However, in some trials, it may indicate a temporary but important worsening of the subject's health (e.g., pulmonary exacerbations in lung diseases).
In contrast, monotone missingness occurs when data for a measurement is not available for a subject after some given time point; in the case of monotone missingness, once a measurement starts being missing, it will be missing for the subsequent visits in the trial, even though it had been planned to be collected. Subjects that discontinue early from the study are the usual source of monotone missing data. In most trials, the amount of monotone missing data is much greater than the amount of non-monotone missing data. In trials where the primary endpoint is based on a measurement at a specific time point, prior intermittent missing data will have a smaller impact on the primary analysis, compared to monotone missing data. Nevertheless, even in these cases, non-monotone missing data can affect study conclusions. This can happen if the intermediate data are utilized in a statistical model for analysis ā the absence of such intermediate data may bias the estimates of the statistical model parameters. In this book, however, we will focus mostly on the problem of monotone missing data, because monotone missing data tend to pose more serious problems than non-monotone when estimating and interpreting trial results. For a more detailed discussion of handling non-monotone missing data, see Section 6.2.1. In this chapter, to introduce some of the concepts and problems in handling missing data, we will look at some common methods of handling monotone missing data in clinical trials, and examine the implications of each method.
In Section 4.2.1, we will also briefly discuss situations where subject discontinues study treatment prematurely, but may stay on study and provide data at the time points as planned originally, despite being off study treatment. These cases need special consideration when including data after treatment discontinuation in the analysis, so that the interpretation of results takes into account possible confounding factors incurred after discontinuation (e.g., alternative treatments).
1.1.2 Modeling missingness, modeling the missing value and ignorability
In the missing data methodology, we often use two terms: missing value and missingness (or missingness mechanism). It will be helpful to clarify what these terms refer to as they both play important and distinct roles in the statistical analysis. Missing value refers to a datum that was planned to be collected but is not available. A datum may be missing because, for example, the measurement was not made or was not collected. Missing and non-missing data may also be referred to as unobserved and observed, respectively. Missingness refers to a binary outcome (Yes/No), that of the datum being missing or not missing at a given time point. Missingness mechanism refers to the underlying random process that determines when data may be missing. In other words, missingness mechanism refers to the probability distribution of the binary missingness ...