Predicting Matching Quality of Record Linkage Algorithms on Growing Data Sets


Martin Schuster, Lukas Tittmann, Andreas Wolf




Studies in health technology and informatics




A record linkage algorithm tries to identify records which belong to the same individual. We analyze the matching behavior of an approach used in the E-PIX matching tool on the very limited attribute set of first name, last name, date of birth and sex. Our benchmark set contains almost 37,000 records from the Popgen biobank. We develop a model which allows us to predict the workload on clerical review for data sets growing up to a factor of 10 or even more, without the need for a data set of this size. Based on this model we show two parameter sets with comparable detection rate of true duplicates, but where only one of them scales well on growing data sets. Our model provides realistic example records for each predicted matching of an upscaled data set. Thus, it enables to identify the parameters which need to be adjusted in order to improve the quality of the matching candidates. We also show that unreviewed merging of records is prone to homonym errors on data sets with 200,000 records and the limited attribute set above, while the merged record pairs are obviously different in clerical review.