Supervised learning for record linkage through weighted means and OWA operators

Torra, V.; Navarro-Arribas, G.; Abril, D.

Artykuł - szczegóły

Tytuł artykułu

Supervised learning for record linkage through weighted means and OWA operators

Autorzy

Torra V. , Navarro-Arribas G. , Abril D.

Treść / Zawartość

Pełne teksty:

http://matwbn.icm.edu.pl/ksiazki/cc/cc39/cc3946.pdf [zdalny]

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Record linkage is a technique used to link records from one database with records from another database, making reference to the same individuals. Although it is normally used in database integration, it is also frequently applied in the context of data privacy. Distance-based record linkage permits linking records by their closeness. In this paper we propose a supervised approach for linking records with numerical attributes. We provide two different approaches, one based on the weighted mean and another on the OWA operator. The parameterization in both cases is determined as an optimization problem. We evaluate our proposal and compare it with standard distance based record linkage, which does not rely on the parameterization of the distance functions. To that end we test the proposal in the context of data privacy by linking a data file with its corresponding protected version.

Słowa kluczowe

data privacy disclosure risk record linkage supervised learning weighted mean OWA operator

Wydawca

Systems Research Institute, Polish Academy of Sciences

Czasopismo

Control and Cybernetics

Rocznik

2010

Tom

Vol. 39, no 4

Strony

1011--1026

Opis fizyczny

Bibliogr. 31 poz.

Twórcy

autor

Torra V.

autor

Navarro-Arribas G.

autor

Abril D.

IIIA, Institut d'Investigacio en Intel-ligencia Artificial - CSIC, Consejo Superior de Investigaciones Cientificas, Campus UAB s/n, 08193 Bellaterra, Catalonia, Spain, vtorra@iiia.csic.es

Bibliografia

BATINI, C. and SCANNAPIECO, M. (2006) Data Quality - Concepts, Methodologies and Techniques Series: Data-Centric Systems and Applications. Springer, Secaucus, NJ.
BRONSELAER, A. and DE TRÉ, G. (2009) A Possibilistic Approach to String Comparison. IEEE Transactions on Fuzzy Systems 17 (1), 208-223.
CHIANG, J.H. and HAO, P.Y. (2003) A new kernel-based fuzzy clustering approach: support vector clustering with cell growing. IEEE Trans, on Fuzzy Systems 11 (4), 518- 527.
COLLEDGE, M. (1995) Frames and Business Registers: An Overview. Business Survey Methods. Wiley Series in Probability and Statistics.
DATA.GOV.UK (2010) UK Government.
DATA.GOV (2010) US Government.
DANTZIG, G.B. (1963) Linear Programming and Extensions. Princeton University Press and the RAND Corporation.
DEFAYS, D. and NANOPOULOS, P. (1993) Panels of enterprises and confidentiality: The small aggregates method. Proc. of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, Statistics Canada, 195-204.
DOMINGO-FERRER, J., MATEO-SANZ, J.M. and TORRA, V. (2001) Comparing sdc methods for microdata on the basis of information loss and disclosure risk. In: Preproceedings of ETK-NTTS 2001 (vol. 2). Eurostat, 807-826.
DOMINGO-FERRER, J. and MATEO-SANZ, J.M. (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans, on Knowledge and Data Engineering 14 (1), 189-201.
DOMINGO-FERRER, J. and TORRA, V. (2005) Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation. Data Mining and Knowledge Discovery 11 (2), 195-212.
DOMINGO-FERRER, J., TORRA, V., MATEO-SANZ, J.M. and SEBE, F. (2006) Empirical disclosure risk assessment of the ipso synthetic data generators. In: Monographs in Official Statistics-Work Session On Statistical Data Confidentiality. Eurostat, 227-238.
FELLEGI, I. and SUNTER, A. (1969) Fellegi, I., Sunter, A. (1969) A Theory for Record Linkage. Journal of the American Statistical Association 64 (328), 1183-1210.
HALBERT, D. (1946) Record Linkage. American Journal of Public Health 36 (12), 1412-1416.
IBM (2010) IBM ILOG CPLEX, High-performance mathematical programming engine. International Business Machines Corp. http://www-01. ibm.com/software/integration/optimization/cplex/
LASZLO, M. and MUKHERJEE, S. (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering 17 (7), 902-911.
MIYAMOTO, S. and Suizu, D. (2003) Miyamoto, S., Suizu, D. (2003) Fuzzy c-means clustering using kernel functions in support vector machines. Journal of Advanced Computational Intelligence and Intelligent Informatics 7 (1), 25-30.
NEWCOMBE, H.B., KENNEDY, J.M., AXFORD, S.J. and JAMES, A.P. (1959) Automatic Linkage of Vital Records. Science, 130, 954-959.
OGANIAN, A. and DOMINGO-FERRER, J. (2000) On the Complexity of Optimal Microaggregation for Statistical Disclosure Control. Statistical J. United Nations Economic Commission for Europe 18 (4), 345-354.
R (2010) R project, software environment for statistical computing and graphics. GNU project, http://www.r-project.org/
SAMARATI, P. (2001) Protecting Respondents’ Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering 13 (6), 1010-1027.
SWEENEY, L. (2002) k-Anonymity: a Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10 (5), 557-570.
TORRA, V. (2004) Microaggregation for categorical variables: a median based approach. Proc. Privacy in Statistical Databases (PSD 2004). LNCS 3050. Springer, 162-174.
TORRA, V. (2004) OWA operators in data modeling and Re-identification. IEEE Trans, on Fuzzy Systems 12 (5), 652-660.
TORRA, V. (2008) Constrained microaggregation: Adding constraints for data editing. Transactions on Data Privacy 1 (2), 86-104.
TORRA, V., ABOWD, J., and DOMINGO-FERRER, J. (2006) Using Mahalano-bis distance-based record linkage for disclosure risk assessment. Privacy in Statistical Databases (PSD 2006). LNCS 4302. Springer, 233-242.
WINKLER, W.E. (2003) Data cleaning methods. Proc. SIGKDD 2003. ACM.
WINKLER, W.E. (2004) Re-identification methods for masked microdata. Privacy in Statistical Databases (PSD 2004), LNCS 3050. Springer, 216-230.
YAGER, R. (1988) On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Trans. Syst. Man Cybern. 18 (1), 183-190.
YAGER, R. and KACPRZYK, J. (1997) The Ordered Weighted Averaging Operators: Theory and Applications. Springer.
YANCEY, W., WINKLER, W. and CREECY, R. (2002) Disclosure risk assessment in perturbative microdata protection. In: Inference Control in Statistical Databases. LNCS 2316. Springer, 135-152.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BAT5-0060-0013