• Stud Health Technol Inform · Jan 2005

    Tools for statistical analysis with missing data: application to a large medical database.

    • Cristian Preda, Alain Duhamel, Monique Picavet, and Tahar Kechadi.
    • Cristian Preda, CERIM, Faculté de médecine, 1 Place de Verdun, F-59045 Lille cedex, France. cpreda@univ-lille2.fr
    • Stud Health Technol Inform. 2005 Jan 1; 116: 181-6.

    AbstractMissing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: (1) a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps 1 to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures.

      Pubmed     Full text   Copy Citation     Plaintext  

      Add institutional full text...

    Notes

     
    Knowledge, pearl, summary or comment to share?
    300 characters remaining
    help        
    You can also include formatting, links, images and footnotes in your notes
    • Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.
    • Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.
    • Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.
    • Links can be included with: [my link to pubmed](http://pubmed.com)
    • Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
    • For footnotes use [^1](This is a footnote.) inline.
    • Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..

    hide…