Which of the Four Options for the Resampling Technique Are Good for Continuous Data

Original Article
Published: 24 July 2021

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Journal of Data, Information and Management volume 3,pages 225–235 (2021)Cite this article

91 Accesses
1 Citations
Metrics details

Abstract

Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further.

Access options

Buy single article

Instant access to the full article PDF.

39,95 €

Price includes VAT (Indonesia)

References

Albuquerque G, Lowe T, Magnor M (2011) Synthetic generation of high-dimensional datasets. IEEE Trans Vis Comput Graph 17(12):2317–2324

Article Google Scholar
Bethlehem JG, Keller WJ, Pannekoek J (1990) Disclosure control of microdata. J Am Stat Assoc 85(409):38–45

Article Google Scholar
Domingo-Ferrer J, Torra V (2001) A quantitative comparison of disclosure control methods for microdata. In: Doyle P, Lane JI, Theeuwes JJM, Zayatz L (eds) Confidentiality, disclosure and data access: theory and practical applications for statistical agencies. North-Holland, Amsterdam, pp 111–134
Domingo-Ferrer J (2008) A survey of inference control methods for privacy-preserving data mining. In: Privacy-preserving data mining. Springer, pp 53–80
Domingo-Ferrer J, Saygın Y (2009) Recent progress in database privacy. Data Knowl Eng 68 (11):1157–1159

Article Google Scholar
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212

Article MathSciNet Google Scholar
Drechsler J, Reiter J (2009) Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB establishment survey. J Off Stat 25(4):589–603

Google Scholar
Drechsler J, Reiter J P (2011) An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput Stat Data Anal 55(12):3232–3243

Article MathSciNet Google Scholar
Duncan G, Lambert D (1989) The risk of disclosure for microdata. J Business Econ Stat 7 (2):207–217

Google Scholar
Fisher A, Caffo B, Schwartz B, Zipunnikov V (2016) Fast, exact bootstrap principal component analysis for p > 1 million. J Amer Stat Assoc 111(514):846–860

Article MathSciNet Google Scholar
Gomatam S, Karr AF, Reiter JP, Sanil AP (2005) Data dissemination and disclosure limitation in a world without microdata: a rsk-utility framework for remote access analysis servers. Stat Sci 20:163–177

Article Google Scholar
Hardt M, Ligett K, McSherry F (2012) A simple and practical algorithm for differentially private data release. In: Advances in neural information processing systems, pp 2339–2347
Herdin M, Czink N, Ozcelik H, Bonek E (2005) Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In: Vehicular technology conference, 2005. VTC 2005-spring. 2005 IEEE 61st, vol 1. IEEE, pp 136–140
Hoff PD (2009) Simulation of the matrix Bingham–von Mises–Fisher distribution, with applications to multivariate and relational data. J Comput Graph Stat 18(2):438–456

Article MathSciNet Google Scholar
Khan AR, Imon R (2018) Information theoretic approaches to principal component selection. Am J Math Manag Sci 37(3):293–305

Google Scholar
Khan AR, OKeefe CM (2017) Disclosure risk reduction for generalized linear model output in a remote analysis system. Data Knowl Eng 111:90–102

Article Google Scholar
Khan AR, Poskitt DS (2016) Signal identification in singular spectrum analysis. Australian New Zealand J Stat 58(1):71–98

Article MathSciNet Google Scholar
Kou G, Peng Y, Shi Y, Chen Z (2007) Privacy-preserving data mining of medical data using data separation-based techniques. Data Sci J 6:S429–S434

Article Google Scholar
Larsen R, Warne RT (2010) Estimating confidence intervals for eigenvalues in exploratory factor analysis. Behav Res Methods 42(3):871–876

Article Google Scholar
Mateo-Sanz JM, Sebé F, Domingo-Ferrer J (2004) Outlier protection in continuous microdata masking. In: International workshop on privacy in statistical databases. Springer, pp 201–215
Nowok B, Raab GM, Dibben C (2016) synthpop: Bespoke creation of synthetic data in r. J Stat Softw 74(11):1–26

Article Google Scholar
OKeefe CM, Connolly CJ (2010) Privacy and the use of health data for research. Med J Aust 193(9):537–541

Article Google Scholar
Park Y, Ghosh J, Shankar M (2013) Perturbed Gibbs samplers for generating large-scale privacy-safe synthetic health data. In: IEEE international conference on healthcare informatics
Poovammal E, Ponnavaikko M (2010) Utility independent privacy preserving data mining-horizontally partitioned data. Data Sci J 9:62–72

Article Google Scholar
Poskitt DS, Sengarapillai A (2013) Description length and dimensionality reduction in functional data analysis. Computat Stat Data Anal 58:98–113

Article MathSciNet Google Scholar
Shang HL (2015) Resampling techniques for estimating the distribution of descriptive statistics of functional data. Commun Stat Simul Comput 44(3):614–635

Article MathSciNet Google Scholar
Skinner CJ (1992) On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica 46(1):21–32

Article Google Scholar
Templ M, Meindl B (2008) Robust statistics meets sdc: New disclosure risk measures for continuous microdata masking 5262:177–189
Templ M, Kowarik A, Meindl B (2015) Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 67(1):1–36

Google Scholar
Timmerman ME, Kiers HAL, Smilde AK (2007) Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. Br J Math Stat Psychol 60:295–314

Article Google Scholar

Download references

Acknowledgements

The authors wish to thank Dr. Christine M. O'Keefe for helpful comments on an early version of this manuscript. We are indebted to anonymous reviewers of an earlier version of this paper for providing insightful comments and providing directions for additional work which has resulted in this paper. Without the anonymous reviewers supportive work this paper would not have been possible.

Author information

Authors and Affiliations

North South University, Dhaka, Bangladesh

Atikur R Khan
University of Southern Queensland, Towoomba, QLD, 4350, Australia

Enamul Kabir

Corresponding author

Correspondence to Atikur R Khan.

Ethics declarations

Competing interests

The authors have no competing interests to declare.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khan, A.R., Kabir, E. Resampling methods for generating continuous multivariate synthetic data for disclosure control. J. of Data, Inf. and Manag. 3, 225–235 (2021). https://doi.org/10.1007/s42488-021-00054-2

Download citation

Received: 30 December 2020
Accepted: 07 June 2021
Published: 24 July 2021
Issue Date: September 2021
DOI : https://doi.org/10.1007/s42488-021-00054-2

Keywords

Confidentiality protection
Disclosure risk
Information loss
Microdata
Multivariate synthetic data

bassquitorger.blogspot.com

Source: https://link.springer.com/article/10.1007/s42488-021-00054-2