Package 'bstrl'

Title: Bayesian Streaming Record Linkage
Description: Perform record linkage on streaming files using recursive Bayesian updating.
Authors: Brenda Betancourt [aut], Andee Kaplan [aut], Ian Taylor [aut, cre]
Maintainer: Ian Taylor <[email protected]>
License: MIT + file LICENSE
Version: 1.0.2
Built: 2024-11-05 04:13:48 UTC
Source: https://github.com/ianmtaylor1/bstrl

Help Index


Perform baseline bipartite record linkage before streaming updates

Description

This function establishes a baseline linkage between two files which can be built upon with streaming updates adding more files. It outsources the linkage work to the BRL package and appends information to the object which will allow streaming record linkage to continue

Usage

bipartiteRL(
  df1,
  df2,
  flds = NULL,
  flds1 = NULL,
  flds2 = NULL,
  types = NULL,
  breaks = c(0, 0.25, 0.5),
  nIter = 1000,
  burn = round(nIter * 0.1),
  a = 1,
  b = 1,
  aBM = 1,
  bBM = 1,
  seed = 0
)

Arguments

df1, df2

Files 1 and 2 as dataframes where each row is a record and each column is a field.

flds

Names of the fields on which to compare the records in each file

flds1, flds2

Allows specifying field names differently for each file.

types

Types of comparisons to use for each field

breaks

Breaks to use for Levenshtein distance on string fields

nIter, burn

MCMC run length parameters. The returned number of samples is nIter - burn.

a, b

Prior parameters for m and u, respectively.

aBM, bBM

Prior parameters for beta-linkage prior.

seed

Random seed to set at beginning of MCMC run

Value

A list with class "bstrlstate" which can be passed to future streaming updates.

Examples

data(geco_small)

# Names of the columns on which to perform linkage
fieldnames <- c("given.name", "surname", "age", "occup",
                "extra1", "extra2", "extra3", "extra4", "extra5", "extra6")

# How to compare each of the fields
# First name and last name use normalized edit distance
# All others binary equal/unequal
types <- c("lv", "lv",
           "bi", "bi", "bi", "bi", "bi", "bi", "bi", "bi")
# Break continuous difference measures into 4 levels using these split points
breaks <- c(0, 0.25, 0.5)

res.twofile <- bipartiteRL(geco_small[[1]], geco_small[[2]],
                           flds = fieldnames, types = types, breaks = breaks,
                           nIter = 10, burn = 5, # Very small number of samples
                           seed = 0)

Create a streaming link object from known record entity id's

Description

Create a streaming link object from known record entity id's

Usage

fromentities(...)

Arguments

...

Vectors containing entity IDs of each record in each file. Each parameter represents the records in a file. The values are an entity ID for the records in that file. Two records having the same entity ID are coreferent.

Value

A streaming link object with S3 class 'streaminglinks', defining the links between records implied by the supplied entity IDs

Examples

data(geco_30over_3err)

file1 <- geco_30over_3err[[1]]
file2 <- geco_30over_3err[[2]]
file3 <- geco_30over_3err[[3]]
file4 <- geco_30over_3err[[4]]

fromentities(file1$entity, file2$entity, file3$entity, file4$entity)

Simulated Noisy Records

Description

A dataset containing several files of noisy simulated records. Records are simulated using GeCo (Tran, Vatsalan, and Cristen (2013)) and organized into files of 200 records each. The columns in each file consist of two ID columns for validating links:

Usage

data(geco_30over_3err)

Format

A list of 7 data.frames. Each data.frame has 200 rows and 16 columns.

Details

  • rec.id. Contains the entity number and duplicate number of each record. This is unique to a record.

  • entity. Contains the entity number of which this record is a copy. Is identical for all records which are noisy duplicates of the same original.

The columns also consist of fields used to perform linkage, into which 3 errors have been randomly inserted:

  • given.name, surname. Text fields with potential typographical errors.

  • age, occup, extra1, ..., extra10. Categorical fields with potential swapped category errors.

Linkage may be performed on either the full dataset or on only a subset of the fields.

See Also

geco_small


Simulated Noisy Records (smaller set)

Description

A dataset containing several files of noisy simulated records. Records are simulated using GeCo (Tran, Vatsalan, and Cristen (2013)) and organized into files of 10 records each. These files are subsets of the larger dataset. The columns in each file consist of two ID columns for validating links:

Usage

data(geco_small)

Format

A list of 7 data.frames. Each data.frame has 10 rows and 16 columns.

Details

  • rec.id. Contains the entity number and duplicate number of each record. This is unique to a record.

  • entity. Contains the entity number of which this record is a copy. Is identical for all records which are noisy duplicates of the same original.

The columns also consist of fields used to perform linkage, into which 3 errors have been randomly inserted:

  • given.name, surname. Text fields with potential typographical errors.

  • age, occup, extra1, ..., extra10. Categorical fields with potential swapped category errors.

Linkage may be performed on either the full dataset or on only a subset of the fields.

See Also

geco_30over_3err


Example linkage result object for small dataset

Description

This object is useful for examples and for testing code designed to process record linkage results. It is identical to the PPRB result obtained from 4 files in the 'streaming-rl-howto' vignette included with this package.

Usage

data(geco_small_result)

Format

An object of class 'bstrlstate', as is returned by the streaming update functions PPRBupdate, SMCMCupdate, multifileRL, or bipartiteRL.

See Also

geco_small, PPRBupdate


Return True/False whether the two record are coreferent

Description

Return True/False whether the two record are coreferent

Usage

islinked(sl, file1, record1, file2, record2)

Arguments

sl

A streaming link object

file1, record1

The file number and record number of the first record

file2, record2

The file number and record number of the second record. Note that file2 must be greater than file1.

Value

A boolean value. True if these two records are linked within sl, False otherwise.

Examples

data(geco_small_result)
samples <- extractlinks(geco_small_result)

# Are record 9 in file 1 and record 7 in file 4 linked in the first posterior sample?
islinked(samples[[1]], file1=1, record1=9, file2=4, record2=7)

# In what proportion of posterior samples are record 9 in file 1 and record 7 in file 4 linked?
mean(sapply(samples, islinked, file1=1, record1=9, file2=4, record2=7))

# In what proportion of posterior samples are record 8 in file 1 and record 1 in file 2 linked?
mean(sapply(samples, islinked, file1=1, record1=8, file2=2, record2=1))

Perform multifile record linkage via Gibbs sampling "from scratch"

Description

Perform multifile record linkage via Gibbs sampling "from scratch"

Usage

multifileRL(
  files,
  flds = NULL,
  types = NULL,
  breaks = c(0, 0.25, 0.5),
  nIter = 1000,
  burn = round(nIter * 0.1),
  a = 1,
  b = 1,
  aBM = 1,
  bBM = 1,
  proposals = c("component", "LB"),
  blocksize = NULL,
  seed = 0,
  refresh = 0.1,
  maxtime = Inf
)

Arguments

files

A list of files

flds

Vector of names of the fields on which to compare the records in each file

types

Types of comparisons to use for each field

breaks

Breaks to use for Levenshtein distance on string fields

nIter, burn

MCMC run length parameters. The returned number of samples is nIter - burn.

a, b

Prior parameters for m and u, respectively.

aBM, bBM

Prior parameters for beta-linkage prior.

proposals

Which kind of full conditional proposals to use for the link vectors.

blocksize

What blocksize to use for locally balanced proposals. By default, LB proposals are not blocked

seed

Random seed to set at beginning of MCMC run

refresh

How often to output an update including the iteration number and percent complete. If refresh >= 1, taken as a number of iterations between messages (rounded). If 0 < refresh < 1, taken as the proportion of nIter. If refresh == 0, no messages are displayed.

maxtime

Amount of time, in seconds, after which the sampler will terminate with however many samples it has produced up to that point. The sample matrix columns for any unproduced samples will be filled with NAs

Value

An object of class "bstrlstate"

Examples

data(geco_small)

# Names of the columns on which to perform linkage
fieldnames <- c("given.name", "surname", "age", "occup",
                "extra1", "extra2", "extra3", "extra4", "extra5", "extra6")

# How to compare each of the fields
# First name and last name use normalized edit distance
# All others binary equal/unequal
types <- c("lv", "lv",
           "bi", "bi", "bi", "bi", "bi", "bi", "bi", "bi")
# Break continuous difference measures into 4 levels using these split points
breaks <- c(0, 0.25, 0.5)

# Three file linkage using first three files in example dataset
multifile.result <- multifileRL(geco_small[1:3],
                                flds = fieldnames, types = types, breaks = breaks,
                                nIter = 2, burn = 1, # Very small run for example
                                proposals = "comp",
                                seed = 0)

Perform a PPRB update of record linkage with a new file

Description

Perform a PPRB update of record linkage with a new file

Usage

PPRBupdate(
  state,
  newfile,
  flds = NULL,
  nIter = NULL,
  burn = 0,
  blocksize = NULL,
  threestep = TRUE,
  seed = 0,
  refresh = 0.1
)

Arguments

state

Existing record linkage state. Returned by either bipartiteRL, PPRBupdate, SMCMCupdate, or multifileRL.

newfile

A data.frame representing the new file: one row per record

flds

Names of fields in the new file to use for comparison. Only used if no global field names were specified in bipartiteRL initially.

nIter

Number of iterations for which to run the PPRB sampler. By default, this is the same as the number of samples present in 'state'.

burn

Number of initial iterations to discard. The total number of samples returned is nIter - burn.

blocksize

Size of blocks to use for locally balanced proposals. Default performs unblocked locally balanced proposals.

threestep

Whether to perform three Gibbs sampling steps per iteration, with past Z's updated with PPRB, m and u updated with full conditionals, and the current Z updated with locally balanced proposals. If false, a two step Gibbs sampler is used where past Z's, m and u are updated together using PPRB and the current Z is updated with locally balanced proposals

seed

Random seed to set at the beginning of the MCMC run

refresh

How often to output an update including the iteration number and percent complete. If refresh >= 1, taken as a number of iterations between messages (rounded). If 0 < refresh < 1, taken as the proportion of nIter. If refresh == 0, no messages are displayed.

Value

An object of class 'bstrlstate' containing posterior samples and necessary metadata for passing to future streaming updates.

Examples

data(geco_small)
data(geco_small_result)

# Add fifth file to previous four-file link result
file5.result <- PPRBupdate(geco_small_result, geco_small[[5]],
                           nIter=2, burn=1) # Very small run for example

Calculate the precision of estimated links relative to true links

Description

Calculate the precision of estimated links relative to true links

Usage

precision(sl.est, sl.true)

Arguments

sl.est

streaminglinks object representing link estimates

sl.true

streaminglinks object representing true links

Value

The precision of the estimated links.

Examples

data(geco_small)
data(geco_small_result)

sl.true <- fromentities(geco_small[[1]]$entity, geco_small[[2]]$entity,
                        geco_small[[3]]$entity, geco_small[[4]]$entity)

posterior <- extractlinks(geco_small_result)
# Compare one posterior sample to previously computed known truth
class(sl.true)
precision(posterior[[42]], sl.true)

Calculate the recall of estimated links relative to true links

Description

Calculate the recall of estimated links relative to true links

Usage

recall(sl.est, sl.true)

Arguments

sl.est

streaminglinks object representing link estimates

sl.true

streaminglinks object representing true links

Value

The recall of the estimated links.

Examples

data(geco_small)
data(geco_small_result)

sl.true <- fromentities(geco_small[[1]]$entity, geco_small[[2]]$entity,
                        geco_small[[3]]$entity, geco_small[[4]]$entity)

posterior <- extractlinks(geco_small_result)
# Compare one posterior sample to previously computed known truth
class(sl.true)
recall(posterior[[42]], sl.true)

Perform an SMCMC update of record linkage with a new file

Description

Perform an SMCMC update of record linkage with a new file

Usage

SMCMCupdate(
  state,
  newfile,
  flds = NULL,
  nIter.jumping = 5,
  nIter.transition = 10,
  cores = 1,
  proposals.jumping = c("component", "LB"),
  proposals.transition = c("LB", "component"),
  blocksize = NULL,
  seed = 0
)

Arguments

state

Existing record linkage state. Returned by either bipartiteRL, PPRBupdate, SMCMCupdate, or multifileRL. This is used as the ensemble of samples in SMCMC update

newfile

A data.frame representing the new file: one row per record

flds

Names of fields in the new file to use for comparison. Only used if no global field names were specified in bipartiteRL initially.

nIter.jumping, nIter.transition

Number of iterations to use in the jumping kernel and transition kernel, respectively, for each ensemble sample.

cores

Number of cores to use for parallel execution. If cores == 1, update is run sequentially. A cluster is created using parallel::makeCluster().

proposals.jumping, proposals.transition

Which kernel to use for Z updates in the jumping and transition kernels, respectively.

blocksize

Size of blocks to use for locally balanced proposals. Default performs unblocked locally balanced proposals.

seed

Random seed to set at the beginning of the MCMC run. This is ignored if cores > 1.

Value

An object of class 'bstrlstate' containing posterior samples and necessary metadata for passing to future streaming updates.

Examples

data(geco_small)
data(geco_small_result)

# Add fifth file to previous four-file link result
filtered <- thinsamples(geco_small_result, 2) # Filter ensemble to 2 - very small for example
file5.result <- SMCMCupdate(filtered, geco_small[[5]],
                            nIter.jumping=1, nIter.transition=1, # Very small run for example
                            proposals.jumping="LB", proposals.transition="LB",
                            blocksize=5)

Thin a bstrlstate object

Description

Thin a bstrlstate object by keeping only one sample every n, until the desired count remains.

Usage

thinsamples(state, count)

Arguments

state

Object of class bstrlstate, output by bipartiteRL, SMCMCupdate, PPRBupdate, or multifileRL

count

The number of desired samples after filtering

Details

This is useful to do before an SMCMC update. SMCMC produces independent samples, so fewer are required to get the same quality estimates.

Value

An object of class bstrlstate, containing count samples.

Examples

data(geco_small_result)
filtered <- thinsamples(geco_small_result, 50)
stopifnot(ncol(filtered$Z) == 50)