1
Introduction
Katie Harron1, Harvey Goldstein2,3 and Chris Dibben4
1 London School of Hygiene and Tropical Medicine, London, UK
2 Institute of Child Health, University College London, London, UK
3 Graduate School of Education, University of Bristol, Bristol, UK
4 University of Edinburgh, Edinburgh, UK
1.1 Introduction: data linkage as it exists
The increasing availability of large administrative databases for research has led to a dramatic rise in the use of data linkage. The speed and accuracy of linkage have much improved over recent decades with developments such as string comparators, coding systems and blocking, yet the methods still underpinning most of the linkage performed today were proposed in the 1950s and 1960s. Linkage and analysis of data across sources remain problematic due to lack of identifiers that are totally accurate as well as being discriminatory, missing data and regulatory issues, especially concerned with privacy.
In this context, recent developments in data linkage methodology have concentrated on bias in the analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological developments in data linkage bring together a collection of chapters on cutting-edge developments in data linkage methodology, contributed by members of the international data linkage community.
The first section of the book covers the current state of data linkage, methodological issues that are relevant to linkage systems and analyses today and case studies from the United Kingdom, Canada and Australia. In this introduction, we provide a brief background to the development of data linkage methods and introduce common terms. We highlight the most important issues that have emerged in recent years and describe how the remainder of the book attempts to deal with these issues. Chapter 2 summarises the advances in linkage accuracy and speed that have arisen from the traditional probabilistic methods proposed by Fellegi and Sunter. The first section concludes with a description of the data linkage environment as it is today, with case study examples. Chapter 3 describes the opportunities and challenges provided by data linkage, focussing on legal and security aspects and models for data access and linkage.
The middle section of the book focusses on the immediate future of data linkage, in terms of methods that have been developed and tested and can be put into practice today. It concentrates on analysis of linked data and the difficulties associated with linkage uncertainty, highlighting the problems caused by errors that occur in linkage (false matches and missed matches) and the impact that these errors can have on the reliability of results based on linked data. This section of the book discusses two methods for handling linkage error, the first relating to regression analyses and the second to an extension of the standard multiple imputation framework. Chapter 7 presents an alternative data storage solution compared to relational databases that provides significant benefits for linkage.
The final section of the book tackles an aspect of the potential future of data linkage. Ethical considerations relating to data linkage and research based on linked data are a subject of continued debate. Privacy-preserving data linkage attempts to avoid the controversial release of personal identifiers by providing means of linking and performing analysis on encrypted data. This section of the book describes the debate and provides examples.
The establishment of large-scale linkage systems has provided new opportunities for important and innovative research that, until now, have not been possible but that also present unique methodological and organisational challenges. New linkage methods are now emerging that take a different approach to the traditional methods that have underpinned much of the research performed using linked data in recent years, leading to new possibilities in terms of speed, accuracy and transparency of research.
1.2 Background and issues
A statistical definition of data linkage is āa merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate recordā (Organisation for Economic Co-operation and Development (OECD)). Data linkage has many different synonyms (record linkage, record matching, re-identification, entity heterogeneity, merge/purge) within various fields of application (computer science, marketing, fraud detection, censuses, bibliographic data, insurance data) (Elmagarmid, Ipeirotis and Verykios, 2007).
The term ārecord linkageā was first applied to health research in 1946, when Dunn described linkage of vital records from the same individual (birth and death records) and referred to the process as āassembling the book of lifeā (Dunn, 1946). Dunn emphasised the importance of such linkage to both the individual and health and other organisations. Since then, data linkage has become increasingly important to the research environment.
The development of computerised data linkage meant that valuable information could be combined efficiently and cost-effectively, avoiding the high cost, time and effort associated with setting up new research studies (Newcombe et al., 1959). This led to a large body of research based on enhanced datasets created through linkage. Internationally, large linkage systems of note are the Western Australia Record Linkage System, which links multiple datasets (over 30) for up to 40 years at a population level, and the Manitoba Population-Based Health Information System (Holman et al., 1999; Roos et al., 1995). In the United Kingdom, several large-scale linkage systems have also been developed, including the Scottish Health Informatics Programme (SHIP), the Secure Anonymised Information Linkage (SAIL) Databank and the Clinical Practice Research Datalink (CPRD). As data linkage becomes a more established part of research relating to health and society, there has been an increasing interest in methodological issues associated with creating and analysing linked datasets (Maggi, 2008).
1.3 Data linkage methods
Data linkage brings together information relating to the same individual that is recorded in different files. A set of linked records is created by comparing records, or parts of records, in different files and applying a set of linkage criteria or rules to determine whether or not records belong to the same individual. These rules utilise the values on ālinking variablesā that are common to each file. The aim of linkage is to determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals.
As the true match status is unknown, linkage criteria are used to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals.
In a perfect linkage, all matches are classified as links, and all non-matches are classified as non-links. If comparison pairs are misclassified (false matches or missed matches), error is introduced. False matches occur when records from different individuals link erroneously; missed matches occur when records from the same individual fail to link.
1.3.1 Deterministic linkage
In deterministic linkage, a set of predetermined rules are used to classify pairs of records as links and non-links. Typically, deterministic linkage requires exact agreement on a specified set of identifiers or matching variables. For example, two records may be classified as a link if their values of National Insurance number, surname and sex agree exactly. Modifications of strict deterministic linkage include āstepwiseā deterministic linkage, which uses a succession of rules; the ānā1ā deterministic procedure, which allows a link to be made if all but one of a set of identifiers agree; and ad hoc deterministic procedures, which allow partial identifiers to be combined into a pseudo-identifier (Abrahams and Davy, 2002; Maso, Braga and Franceschi, 2001; Mears et al., 2010). For example, a combination of the first letter of surname, month of birth and postcode area (e.g. H01N19) could form the basis for linkage.
Strict deterministic methods that require identifiers to match exactly often have a high rate of missed matches, as any recording errors or missing values can prevent identifiers from agreeing. Conversely, the rate of false matches is typically low, as the majority of linked pairs are true matches (records are unlikely to agree exactly on a set of identifiers by chance) (Grannis, Overhage and McDonald, 2002). Deterministic linkage is a relatively straightforward and quick linkage method and is useful when records have highly discriminative or unique identifiers that are well completed and accurate. For example, the community health index (CHI) is used for much of the linkage in the Scottish Record Linkage System.
1.3.2 Probabilistic linkage
Newcombe was the first to propose that comparison pairs could be classified using a probabilistic approach (Newcombe et al., 1959). He suggested that a match weight be assigned to each comparison pair, representing the likelihood that two records are a true match, given the agreement of their identifiers. Each identifier contributes separately to an overall match weight. Identifier agreement contributes positively to the weight, and disagreement contributes a penalty. The size of the contribution depends on the discriminatory power of the identifier, so that agreement on name makes a larger contribution than agreement on sex (Zhu et al., 2009). Fellegi and Sunter formalised Newcombeās proposals into the statistical theory underpinning probabilistic linkage today (Fellegi and Sunter, 1969). Chapter 2 provides details on the match calculation.
In probabilistic linkage, link status is determined by comparing match weights to a threshold or cut-off match weight in order to classify as a match or non-match. In addition, manual review of record pairs is often performed to aid choice of threshold and to deal with uncertain links (Krewski et al., 2005). If linkage error rates are known, thresholds can be selected to minimise the total number of errors, so that the number of false matches and missed matches cancels out. However, error rates are usually unknown. The subjective process of choosing probabilistic thresholds is a limitation of probabilistic linkage, as different linkers may choose different thresholds. This can result in multiple possible versions of the linked data.
There are certain problems with the standard probabilistic procedure. The first is the assumption of independence for the probabilities associated with the individual matching variables. For example, observing an individual in any given ethnic group category may be associated with certain surname structures, and hence, the joint probability of agreeing across matching variables may not simply be the product of the separate probabilities. Ways of dealing with this are suggested in Chapters 2 and 6. A second typical problem is that records with match weights that do not reach the threshold are excluded from data analysis, reducing efficiency and introducing bias if this is associated with the characteristics of the variables to be analysed. Chapter 6 suggests a way of dealing with this using missing data methods. A third problem occurs when the errors in one or more m...