Which of the Four Options for the Resampling Technique Are Good for Continuous Data
- Original Article
- Published:
Resampling methods for generating continuous multivariate synthetic data for disclosure control
Journal of Data, Information and Management volume 3,pages 225–235 (2021)Cite this article
Abstract
Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further.
Access options
Buy single article
Instant access to the full article PDF.
39,95 €
Price includes VAT (Indonesia)
References
-
Albuquerque G, Lowe T, Magnor M (2011) Synthetic generation of high-dimensional datasets. IEEE Trans Vis Comput Graph 17(12):2317–2324
-
Bethlehem JG, Keller WJ, Pannekoek J (1990) Disclosure control of microdata. J Am Stat Assoc 85(409):38–45
-
Domingo-Ferrer J, Torra V (2001) A quantitative comparison of disclosure control methods for microdata. In: Doyle P, Lane JI, Theeuwes JJM, Zayatz L (eds) Confidentiality, disclosure and data access: theory and practical applications for statistical agencies. North-Holland, Amsterdam, pp 111–134
-
Domingo-Ferrer J (2008) A survey of inference control methods for privacy-preserving data mining. In: Privacy-preserving data mining. Springer, pp 53–80
-
Domingo-Ferrer J, Saygın Y (2009) Recent progress in database privacy. Data Knowl Eng 68 (11):1157–1159
-
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212
-
Drechsler J, Reiter J (2009) Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB establishment survey. J Off Stat 25(4):589–603
-
Drechsler J, Reiter J P (2011) An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput Stat Data Anal 55(12):3232–3243
-
Duncan G, Lambert D (1989) The risk of disclosure for microdata. J Business Econ Stat 7 (2):207–217
-
Fisher A, Caffo B, Schwartz B, Zipunnikov V (2016) Fast, exact bootstrap principal component analysis for p > 1 million. J Amer Stat Assoc 111(514):846–860
-
Gomatam S, Karr AF, Reiter JP, Sanil AP (2005) Data dissemination and disclosure limitation in a world without microdata: a rsk-utility framework for remote access analysis servers. Stat Sci 20:163–177
-
Hardt M, Ligett K, McSherry F (2012) A simple and practical algorithm for differentially private data release. In: Advances in neural information processing systems, pp 2339–2347
-
Herdin M, Czink N, Ozcelik H, Bonek E (2005) Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In: Vehicular technology conference, 2005. VTC 2005-spring. 2005 IEEE 61st, vol 1. IEEE, pp 136–140
-
Hoff PD (2009) Simulation of the matrix Bingham–von Mises–Fisher distribution, with applications to multivariate and relational data. J Comput Graph Stat 18(2):438–456
-
Khan AR, Imon R (2018) Information theoretic approaches to principal component selection. Am J Math Manag Sci 37(3):293–305
-
Khan AR, OKeefe CM (2017) Disclosure risk reduction for generalized linear model output in a remote analysis system. Data Knowl Eng 111:90–102
-
Khan AR, Poskitt DS (2016) Signal identification in singular spectrum analysis. Australian New Zealand J Stat 58(1):71–98
-
Kou G, Peng Y, Shi Y, Chen Z (2007) Privacy-preserving data mining of medical data using data separation-based techniques. Data Sci J 6:S429–S434
-
Larsen R, Warne RT (2010) Estimating confidence intervals for eigenvalues in exploratory factor analysis. Behav Res Methods 42(3):871–876
-
Mateo-Sanz JM, Sebé F, Domingo-Ferrer J (2004) Outlier protection in continuous microdata masking. In: International workshop on privacy in statistical databases. Springer, pp 201–215
-
Nowok B, Raab GM, Dibben C (2016) synthpop: Bespoke creation of synthetic data in r. J Stat Softw 74(11):1–26
-
OKeefe CM, Connolly CJ (2010) Privacy and the use of health data for research. Med J Aust 193(9):537–541
-
Park Y, Ghosh J, Shankar M (2013) Perturbed Gibbs samplers for generating large-scale privacy-safe synthetic health data. In: IEEE international conference on healthcare informatics
-
Poovammal E, Ponnavaikko M (2010) Utility independent privacy preserving data mining-horizontally partitioned data. Data Sci J 9:62–72
-
Poskitt DS, Sengarapillai A (2013) Description length and dimensionality reduction in functional data analysis. Computat Stat Data Anal 58:98–113
-
Shang HL (2015) Resampling techniques for estimating the distribution of descriptive statistics of functional data. Commun Stat Simul Comput 44(3):614–635
-
Skinner CJ (1992) On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica 46(1):21–32
-
Templ M, Meindl B (2008) Robust statistics meets sdc: New disclosure risk measures for continuous microdata masking 5262:177–189
-
Templ M, Kowarik A, Meindl B (2015) Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 67(1):1–36
-
Timmerman ME, Kiers HAL, Smilde AK (2007) Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. Br J Math Stat Psychol 60:295–314
Acknowledgements
The authors wish to thank Dr. Christine M. O'Keefe for helpful comments on an early version of this manuscript. We are indebted to anonymous reviewers of an earlier version of this paper for providing insightful comments and providing directions for additional work which has resulted in this paper. Without the anonymous reviewers supportive work this paper would not have been possible.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khan, A.R., Kabir, E. Resampling methods for generating continuous multivariate synthetic data for disclosure control. J. of Data, Inf. and Manag. 3, 225–235 (2021). https://doi.org/10.1007/s42488-021-00054-2
-
Received:
-
Accepted:
-
Published:
-
Issue Date:
-
DOI : https://doi.org/10.1007/s42488-021-00054-2
Keywords
- Confidentiality protection
- Disclosure risk
- Information loss
- Microdata
- Multivariate synthetic data
Source: https://link.springer.com/article/10.1007/s42488-021-00054-2
0 Response to "Which of the Four Options for the Resampling Technique Are Good for Continuous Data"
Enviar um comentário