Which of the Four Options for the Resampling Technique Are Good for Continuous Data

  • Original Article
  • Published:

Resampling methods for generating continuous multivariate synthetic data for disclosure control

  • 91 Accesses

  • 1 Citations

  • Metrics details

Abstract

Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further.

Access options

Buy single article

Instant access to the full article PDF.

39,95 €

Price includes VAT (Indonesia)

References

  • Albuquerque G, Lowe T, Magnor M (2011) Synthetic generation of high-dimensional datasets. IEEE Trans Vis Comput Graph 17(12):2317–2324

    Article  Google Scholar

  • Bethlehem JG, Keller WJ, Pannekoek J (1990) Disclosure control of microdata. J Am Stat Assoc 85(409):38–45

    Article  Google Scholar

  • Domingo-Ferrer J, Torra V (2001) A quantitative comparison of disclosure control methods for microdata. In: Doyle P, Lane JI, Theeuwes JJM, Zayatz L (eds) Confidentiality, disclosure and data access: theory and practical applications for statistical agencies. North-Holland, Amsterdam, pp 111–134

  • Domingo-Ferrer J (2008) A survey of inference control methods for privacy-preserving data mining. In: Privacy-preserving data mining. Springer, pp 53–80

  • Domingo-Ferrer J, Saygın Y (2009) Recent progress in database privacy. Data Knowl Eng 68 (11):1157–1159

    Article  Google Scholar

  • Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212

    Article  MathSciNet  Google Scholar

  • Drechsler J, Reiter J (2009) Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB establishment survey. J Off Stat 25(4):589–603

    Google Scholar

  • Drechsler J, Reiter J P (2011) An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput Stat Data Anal 55(12):3232–3243

    Article  MathSciNet  Google Scholar

  • Duncan G, Lambert D (1989) The risk of disclosure for microdata. J Business Econ Stat 7 (2):207–217

    Google Scholar

  • Fisher A, Caffo B, Schwartz B, Zipunnikov V (2016) Fast, exact bootstrap principal component analysis for p > 1 million. J Amer Stat Assoc 111(514):846–860

    Article  MathSciNet  Google Scholar

  • Gomatam S, Karr AF, Reiter JP, Sanil AP (2005) Data dissemination and disclosure limitation in a world without microdata: a rsk-utility framework for remote access analysis servers. Stat Sci 20:163–177

    Article  Google Scholar

  • Hardt M, Ligett K, McSherry F (2012) A simple and practical algorithm for differentially private data release. In: Advances in neural information processing systems, pp 2339–2347

  • Herdin M, Czink N, Ozcelik H, Bonek E (2005) Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In: Vehicular technology conference, 2005. VTC 2005-spring. 2005 IEEE 61st, vol 1. IEEE, pp 136–140

  • Hoff PD (2009) Simulation of the matrix Bingham–von Mises–Fisher distribution, with applications to multivariate and relational data. J Comput Graph Stat 18(2):438–456

    Article  MathSciNet  Google Scholar

  • Khan AR, Imon R (2018) Information theoretic approaches to principal component selection. Am J Math Manag Sci 37(3):293–305

    Google Scholar

  • Khan AR, OKeefe CM (2017) Disclosure risk reduction for generalized linear model output in a remote analysis system. Data Knowl Eng 111:90–102

    Article  Google Scholar

  • Khan AR, Poskitt DS (2016) Signal identification in singular spectrum analysis. Australian New Zealand J Stat 58(1):71–98

    Article  MathSciNet  Google Scholar

  • Kou G, Peng Y, Shi Y, Chen Z (2007) Privacy-preserving data mining of medical data using data separation-based techniques. Data Sci J 6:S429–S434

    Article  Google Scholar

  • Larsen R, Warne RT (2010) Estimating confidence intervals for eigenvalues in exploratory factor analysis. Behav Res Methods 42(3):871–876

    Article  Google Scholar

  • Mateo-Sanz JM, Sebé F, Domingo-Ferrer J (2004) Outlier protection in continuous microdata masking. In: International workshop on privacy in statistical databases. Springer, pp 201–215

  • Nowok B, Raab GM, Dibben C (2016) synthpop: Bespoke creation of synthetic data in r. J Stat Softw 74(11):1–26

    Article  Google Scholar

  • OKeefe CM, Connolly CJ (2010) Privacy and the use of health data for research. Med J Aust 193(9):537–541

    Article  Google Scholar

  • Park Y, Ghosh J, Shankar M (2013) Perturbed Gibbs samplers for generating large-scale privacy-safe synthetic health data. In: IEEE international conference on healthcare informatics

  • Poovammal E, Ponnavaikko M (2010) Utility independent privacy preserving data mining-horizontally partitioned data. Data Sci J 9:62–72

    Article  Google Scholar

  • Poskitt DS, Sengarapillai A (2013) Description length and dimensionality reduction in functional data analysis. Computat Stat Data Anal 58:98–113

    Article  MathSciNet  Google Scholar

  • Shang HL (2015) Resampling techniques for estimating the distribution of descriptive statistics of functional data. Commun Stat Simul Comput 44(3):614–635

    Article  MathSciNet  Google Scholar

  • Skinner CJ (1992) On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica 46(1):21–32

    Article  Google Scholar

  • Templ M, Meindl B (2008) Robust statistics meets sdc: New disclosure risk measures for continuous microdata masking 5262:177–189

  • Templ M, Kowarik A, Meindl B (2015) Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 67(1):1–36

    Google Scholar

  • Timmerman ME, Kiers HAL, Smilde AK (2007) Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. Br J Math Stat Psychol 60:295–314

    Article  Google Scholar

Download references

Acknowledgements

The authors wish to thank Dr. Christine M. O'Keefe for helpful comments on an early version of this manuscript. We are indebted to anonymous reviewers of an earlier version of this paper for providing insightful comments and providing directions for additional work which has resulted in this paper. Without the anonymous reviewers supportive work this paper would not have been possible.

Author information

Authors and Affiliations

Corresponding author

Correspondence to Atikur R Khan.

Ethics declarations

Competing interests

The authors have no competing interests to declare.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khan, A.R., Kabir, E. Resampling methods for generating continuous multivariate synthetic data for disclosure control. J. of Data, Inf. and Manag. 3, 225–235 (2021). https://doi.org/10.1007/s42488-021-00054-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI : https://doi.org/10.1007/s42488-021-00054-2

Keywords

  • Confidentiality protection
  • Disclosure risk
  • Information loss
  • Microdata
  • Multivariate synthetic data

bassquitorger.blogspot.com

Source: https://link.springer.com/article/10.1007/s42488-021-00054-2

0 Response to "Which of the Four Options for the Resampling Technique Are Good for Continuous Data"

Enviar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel