|
| 1 | +--- |
| 2 | +title: The `automatedRecLin` Package |
| 3 | +output: github_document |
| 4 | +--- |
| 5 | + |
| 6 | +```{r, include = FALSE} |
| 7 | +knitr::opts_chunk$set( |
| 8 | + collapse = TRUE, |
| 9 | + comment = "#>", |
| 10 | + fig.path = "man/figures/README-", |
| 11 | + out.width = "100%" |
| 12 | +) |
| 13 | +``` |
| 14 | + |
| 15 | +## Description |
| 16 | + |
| 17 | +This R package is designed to perform record linkage (also known as entity resolution) in unsupervised or supervised settings. It compares pairs of records from two datasets using selected comparison functions to estimate the probability or density ratio between matched and non-matched records. Based on these estimates, it predicts a set of matches that maximizes entropy. |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +To install the development version from GitHub you can use the `pak` package. |
| 22 | + |
| 23 | +```{r, eval=FALSE} |
| 24 | +# install.packages("pak") # uncomment if needed |
| 25 | +pak::pkg_install("ncn-foreigners/automatedRecLin") |
| 26 | +``` |
| 27 | + |
| 28 | +## Basic usage |
| 29 | + |
| 30 | +Load the package for the examples. |
| 31 | + |
| 32 | +```{r} |
| 33 | +library(automatedRecLin) |
| 34 | +``` |
| 35 | + |
| 36 | +### Unsupervised maximum entropy classifier for record linkage |
| 37 | + |
| 38 | +Generate two simple datasets that contain some common records, with typos in some cases. |
| 39 | + |
| 40 | +```{r} |
| 41 | +df_1 <- data.frame( |
| 42 | + name = c("Emma", "Liam", "Olivia", "Noah", "Ava", |
| 43 | + "Ethan", "Sophia", "Mason", "Isabella", "James"), |
| 44 | + surname = c("Smith", "Johnson", "Williams", "Brown", "Jones", |
| 45 | + "Garcia", "Miller", "Davis", "Rodriguez", "Wilson"), |
| 46 | + city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix", |
| 47 | + "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose") |
| 48 | +) |
| 49 | +
|
| 50 | +df_2 <- data.frame( |
| 51 | + name = c( |
| 52 | + "Emma", "Liam", "Olivia", "Noah", |
| 53 | + "Ava", "Ehtan", "Sopia", "Mson", |
| 54 | + "Charlotte", "Benjamin", "Amelia", "Lucas" |
| 55 | + ), |
| 56 | + surname = c( |
| 57 | + "Smith", "Johnson", "Williams", "Brown", |
| 58 | + "Jnes", "Garca", "Miler", "Dvis", |
| 59 | + "Martinez", "Lee", "Hernandez", "Clark" |
| 60 | + ), |
| 61 | + city = c( |
| 62 | + "New York", "Los Angeles", "Chicago", "Houston", |
| 63 | + "Phonix", "Philadelpia", "San Antnio", "San Dieg", |
| 64 | + "Seattle", "Miami", "Boston", "Denver" |
| 65 | + ) |
| 66 | +) |
| 67 | +df_1 |
| 68 | +df_2 |
| 69 | +``` |
| 70 | + |
| 71 | +Specify key variables used for record linkage. Select a comparison function (i.e. a function to compare pairs of records) for each variable. For example, use the `jarowinkler_complement` function from the `automatedRecLin` package (1 - Jaro-Winkler distance). Choose a method for estimating the probability or density ratio for each variable. The available methods are: `"binary"`, `"continuous_parametric"` and `"continuous_nonparametric"`. |
| 72 | + |
| 73 | +```{r} |
| 74 | +variables <- c("name", "surname", "city") |
| 75 | +comparators <- list( |
| 76 | + "name" = jarowinkler_complement(), |
| 77 | + "surname" = jarowinkler_complement(), |
| 78 | + "city" = jarowinkler_complement() |
| 79 | +) |
| 80 | +methods <- list( |
| 81 | + "name" = "continuous_parametric", |
| 82 | + "surname" = "continuous_parametric", |
| 83 | + "city" = "continuous_parametric" |
| 84 | +) |
| 85 | +``` |
| 86 | + |
| 87 | +Perform record linkage using the `mec` function. The output contains the following information: |
| 88 | + |
| 89 | ++ the names of key variables, |
| 90 | ++ the number of predicted matches, |
| 91 | ++ the first 6 predicted matches (with their estimated probability or density ratio), |
| 92 | ++ the method for constructing the predicted set of matches (default: `"size"`), |
| 93 | ++ estimated false link rate (FLR), |
| 94 | ++ estimated missing match rate (MMR), |
| 95 | ++ estimated parameters for variables using the `"binary"` or `"continuous_parametric"` methods. |
| 96 | + |
| 97 | +```{r} |
| 98 | +set.seed(1) |
| 99 | +unsup_result <- mec(A = df_1, B = df_2, |
| 100 | + variables = variables, |
| 101 | + comparators = comparators, |
| 102 | + methods = methods) |
| 103 | +unsup_result |
| 104 | +``` |
| 105 | + |
| 106 | +### Supervised maximimum entropy classifier for record linkage |
| 107 | + |
| 108 | +Generate two simple training datasets that contain some common records, with typos in some cases. |
| 109 | + |
| 110 | +```{r} |
| 111 | +df_1_train <- data.frame( |
| 112 | + "name" = c("John", "Emily", "Mark", "Anna", "David"), |
| 113 | + "surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown") |
| 114 | +) |
| 115 | +df_2_train <- data.frame( |
| 116 | + "name" = c("John", "Emely", "Marc", "Michael"), |
| 117 | + "surname" = c("Smith", "Jonson", "Tailor", "Henderson") |
| 118 | +) |
| 119 | +df_1_train |
| 120 | +df_2_train |
| 121 | +``` |
| 122 | + |
| 123 | +Specify the key variables, select comparison functions and choose methods for estimating the probability or density ratio. Additionally, provide a `data.frame` indicating known matches. |
| 124 | + |
| 125 | +```{r} |
| 126 | +variables_train <- c("name", "surname") |
| 127 | +comparators_train <- list("name" = jarowinkler_complement(), |
| 128 | + "surname" = jarowinkler_complement()) |
| 129 | +methods_train <- list("name" = "continuous_nonparametric", |
| 130 | + "surname" = "continuous_nonparametric") |
| 131 | +matches_train <- data.frame("a" = 1:3, "b" = 1:3) |
| 132 | +``` |
| 133 | + |
| 134 | +Train a record linkage model using the `train_rec_lin` function. |
| 135 | + |
| 136 | +```{r} |
| 137 | +model <- train_rec_lin(A = df_1_train, B = df_2_train, |
| 138 | + matches = matches_train, |
| 139 | + variables = variables_train, |
| 140 | + comparators = comparators_train, |
| 141 | + methods = methods_train) |
| 142 | +model |
| 143 | +``` |
| 144 | + |
| 145 | +Generate two new datasets for record linkage prediction. |
| 146 | + |
| 147 | +```{r} |
| 148 | +df_1_new <- data.frame( |
| 149 | + "name" = c("Jame", "Lia", "Tomas", "Matthew", "Andrew"), |
| 150 | + "surname" = c("Wilsen", "Thomsson", "Davis", "Robinson", "Scott") |
| 151 | +) |
| 152 | +df_2_new <- data.frame( |
| 153 | + "name" = c("James", "Leah", "Thomas", "Mathew", "Andrew", "Sophie"), |
| 154 | + "surname" = c("Wilson", "Thompson", "Davies", "Robins", "Scots", "Clarks") |
| 155 | +) |
| 156 | +df_1_new |
| 157 | +df_2_new |
| 158 | +``` |
| 159 | + |
| 160 | +Predict matches using the `predict` function. The output has a similar structure to that of the `mec` function. |
| 161 | + |
| 162 | +```{r} |
| 163 | +predict(model, df_1_new, df_2_new) |
| 164 | +``` |
| 165 | + |
| 166 | +## Funding |
| 167 | + |
| 168 | +Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations -- quality, data integration and estimation). |
| 169 | + |
| 170 | +## References |
| 171 | + |
| 172 | +Lee, D., Zhang, L.-C. and Kim, J. K. (2022). [Maximum entropy classification for record linkage.](https://www150.statcan.gc.ca/n1/pub/12-001-x/2022001/article/00007-eng.htm) Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1. |
| 173 | + |
| 174 | +Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). [Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system.](https://ideas.repec.org/a/eee/csdana/v179y2023ics0167947322002365.html) Computational Statistics & Data Analysis, 179, 107656. |
| 175 | + |
| 176 | +Sugiyama, M., Suzuki, T., Nakajima, S. et al. [Direct importance estimation for covariate shift adaptation.](https://doi.org/10.1007/s10463-008-0197-x) Ann Inst Stat Math 60, 699–746 (2008). |
0 commit comments