• Development of a data generator for multivariate numerical data with arbitrary correlations and distributions

      Vahldiek, Kai; Zhou, Libing; Zhu, Wenfeng; Klawonn, Frank; HZI,Helmholtz-Zentrum für Infektionsforschung GmbH, Inhoffenstr. 7,38124 Braunschweig, Germany. (IOS Press, 2021-01-01)
      Artificial or simulated data are particularly relevant in tests and benchmarks for machine learning methods, in teaching for exercises and for setting up analysis workflows. They are relevant when real data may not be used for reasons of data protection, or when special distributions or effects should be present in the data to test certain machine learning methods. In this paper a generator for multivariate numerical data with arbitrary marginal distributions and – as far as possible – arbitrary correlations is presented. The data generator is implemented in the open source statistics software R. It can also be used for categorical variables, if data are generated separately for the corresponding characteristics of a categorical variable. Additionally, outliers can be integrated. The use of the data generator is demonstrated with a concrete example.