eBook - ePub

Methodological Developments in Data Linkage

Name: Methodological Developments in Data Linkage
ISBN: 9781119072485

Katie Harron,

Harvey Goldstein,

Chris Dibben,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Methodological Developments in Data Linkage

Katie Harron,

Harvey Goldstein,

Chris Dibben,

About this book

A comprehensive compilation of new developments in data linkage methodology

The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage.

Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed.

Key Features:

Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally
Covers the essential issues associated with data linkage today
Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides
Novel approach incorporates technical aspects of both linkage, management and analysis of linked data

This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Medicine

Subtopic

Biostatistics

1
Introduction

Katie Harron¹, Harvey Goldstein^2,3 and Chris Dibben⁴

¹ London School of Hygiene and Tropical Medicine, London, UK

² Institute of Child Health, University College London, London, UK

³ Graduate School of Education, University of Bristol, Bristol, UK

⁴ University of Edinburgh, Edinburgh, UK

1.1 Introduction: data linkage as it exists

The increasing availability of large administrative databases for research has led to a dramatic rise in the use of data linkage. The speed and accuracy of linkage have much improved over recent decades with developments such as string comparators, coding systems and blocking, yet the methods still underpinning most of the linkage performed today were proposed in the 1950s and 1960s. Linkage and analysis of data across sources remain problematic due to lack of identifiers that are totally accurate as well as being discriminatory, missing data and regulatory issues, especially concerned with privacy.

In this context, recent developments in data linkage methodology have concentrated on bias in the analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological developments in data linkage bring together a collection of chapters on cutting-edge developments in data linkage methodology, contributed by members of the international data linkage community.

The first section of the book covers the current state of data linkage, methodological issues that are relevant to linkage systems and analyses today and case studies from the United Kingdom, Canada and Australia. In this introduction, we provide a brief background to the development of data linkage methods and introduce common terms. We highlight the most important issues that have emerged in recent years and describe how the remainder of the book attempts to deal with these issues. Chapter 2 summarises the advances in linkage accuracy and speed that have arisen from the traditional probabilistic methods proposed by Fellegi and Sunter. The first section concludes with a description of the data linkage environment as it is today, with case study examples. Chapter 3 describes the opportunities and challenges provided by data linkage, focussing on legal and security aspects and models for data access and linkage.

The middle section of the book focusses on the immediate future of data linkage, in terms of methods that have been developed and tested and can be put into practice today. It concentrates on analysis of linked data and the difficulties associated with linkage uncertainty, highlighting the problems caused by errors that occur in linkage (false matches and missed matches) and the impact that these errors can have on the reliability of results based on linked data. This section of the book discusses two methods for handling linkage error, the first relating to regression analyses and the second to an extension of the standard multiple imputation framework. Chapter 7 presents an alternative data storage solution compared to relational databases that provides significant benefits for linkage.

The final section of the book tackles an aspect of the potential future of data linkage. Ethical considerations relating to data linkage and research based on linked data are a subject of continued debate. Privacy-preserving data linkage attempts to avoid the controversial release of personal identifiers by providing means of linking and performing analysis on encrypted data. This section of the book describes the debate and provides examples.

The establishment of large-scale linkage systems has provided new opportunities for important and innovative research that, until now, have not been possible but that also present unique methodological and organisational challenges. New linkage methods are now emerging that take a different approach to the traditional methods that have underpinned much of the research performed using linked data in recent years, leading to new possibilities in terms of speed, accuracy and transparency of research.

1.2 Background and issues

A statistical definition of data linkage is ‘a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record’ (Organisation for Economic Co-operation and Development (OECD)). Data linkage has many different synonyms (record linkage, record matching, re-identification, entity heterogeneity, merge/purge) within various fields of application (computer science, marketing, fraud detection, censuses, bibliographic data, insurance data) (Elmagarmid, Ipeirotis and Verykios, 2007).

The term ‘record linkage’ was first applied to health research in 1946, when Dunn described linkage of vital records from the same individual (birth and death records) and referred to the process as ‘assembling the book of life’ (Dunn, 1946). Dunn emphasised the importance of such linkage to both the individual and health and other organisations. Since then, data linkage has become increasingly important to the research environment.

The development of computerised data linkage meant that valuable information could be combined efficiently and cost-effectively, avoiding the high cost, time and effort associated with setting up new research studies (Newcombe et al., 1959). This led to a large body of research based on enhanced datasets created through linkage. Internationally, large linkage systems of note are the Western Australia Record Linkage System, which links multiple datasets (over 30) for up to 40 years at a population level, and the Manitoba Population-Based Health Information System (Holman et al., 1999; Roos et al., 1995). In the United Kingdom, several large-scale linkage systems have also been developed, including the Scottish Health Informatics Programme (SHIP), the Secure Anonymised Information Linkage (SAIL) Databank and the Clinical Practice Research Datalink (CPRD). As data linkage becomes a more established part of research relating to health and society, there has been an increasing interest in methodological issues associated with creating and analysing linked datasets (Maggi, 2008).

1.3 Data linkage methods

Data linkage brings together information relating to the same individual that is recorded in different files. A set of linked records is created by comparing records, or parts of records, in different files and applying a set of linkage criteria or rules to determine whether or not records belong to the same individual. These rules utilise the values on ‘linking variables’ that are common to each file. The aim of linkage is to determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals.

As the true match status is unknown, linkage criteria are used to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals.

In a perfect linkage, all matches are classified as links, and all non-matches are classified as non-links. If comparison pairs are misclassified (false matches or missed matches), error is introduced. False matches occur when records from different individuals link erroneously; missed matches occur when records from the same individual fail to link.

1.3.1 Deterministic linkage

In deterministic linkage, a set of predetermined rules are used to classify pairs of records as links and non-links. Typically, deterministic linkage requires exact agreement on a specified set of identifiers or matching variables. For example, two records may be classified as a link if their values of National Insurance number, surname and sex agree exactly. Modifications of strict deterministic linkage include ‘stepwise’ deterministic linkage, which uses a succession of rules; the ‘n−1’ deterministic procedure, which allows a link to be made if all but one of a set of identifiers agree; and ad hoc deterministic procedures, which allow partial identifiers to be combined into a pseudo-identifier (Abrahams and Davy, 2002; Maso, Braga and Franceschi, 2001; Mears et al., 2010). For example, a combination of the first letter of surname, month of birth and postcode area (e.g. H01N19) could form the basis for linkage.

Strict deterministic methods that require identifiers to match exactly often have a high rate of missed matches, as any recording errors or missing values can prevent identifiers from agreeing. Conversely, the rate of false matches is typically low, as the majority of linked pairs are true matches (records are unlikely to agree exactly on a set of identifiers by chance) (Grannis, Overhage and McDonald, 2002). Deterministic linkage is a relatively straightforward and quick linkage method and is useful when records have highly discriminative or unique identifiers that are well completed and accurate. For example, the community health index (CHI) is used for much of the linkage in the Scottish Record Linkage System.

1.3.2 Probabilistic linkage

Newcombe was the first to propose that comparison pairs could be classified using a probabilistic approach (Newcombe et al., 1959). He suggested that a match weight be assigned to each comparison pair, representing the likelihood that two records are a true match, given the agreement of their identifiers. Each identifier contributes separately to an overall match weight. Identifier agreement contributes positively to the weight, and disagreement contributes a penalty. The size of the contribution depends on the discriminatory power of the identifier, so that agreement on name makes a larger contribution than agreement on sex (Zhu et al., 2009). Fellegi and Sunter formalised Newcombe’s proposals into the statistical theory underpinning probabilistic linkage today (Fellegi and Sunter, 1969). Chapter 2 provides details on the match calculation.

In probabilistic linkage, link status is determined by comparing match weights to a threshold or cut-off match weight in order to classify as a match or non-match. In addition, manual review of record pairs is often performed to aid choice of threshold and to deal with uncertain links (Krewski et al., 2005). If linkage error rates are known, thresholds can be selected to minimise the total number of errors, so that the number of false matches and missed matches cancels out. However, error rates are usually unknown. The subjective process of choosing probabilistic thresholds is a limitation of probabilistic linkage, as different linkers may choose different thresholds. This can result in multiple possible versions of the linked data.

There are certain problems with the standard probabilistic procedure. The first is the assumption of independence for the probabilities associated with the individual matching variables. For example, observing an individual in any given ethnic group category may be associated with certain surname structures, and hence, the joint probability of agreeing across matching variables may not simply be the product of the separate probabilities. Ways of dealing with this are suggested in Chapters 2 and 6. A second typical problem is that records with match weights that do not reach the threshold are excluded from data analysis, reducing efficiency and introducing bias if this is associated with the characteristics of the variables to be analysed. Chapter 6 suggests a way of dealing with this using missing data methods. A third problem occurs when the errors in one or more m...

Cover
Title Page
Table of Contents
Foreword
Contributors
1 Introduction
2 Probabilistic linkage
3 The data linkage environment
4 Bias in data linkage studies
5 Secondary analysis of linked data
6 Record linkage
7 Using graph databases to manage linked data
8 Large-scale linkage for total populations in official statistics
9 Privacy-preserving record linkage
10 Summary
References
Index
Advert page
End User License Agreement

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Methodological Developments in Data Linkage by Katie Harron,Harvey Goldstein,Chris Dibben in PDF and/or ePUB format, as well as other popular books in Medicine & Biostatistics. We have over one million books available in our catalogue for you to explore.