Behavior of Various Machine Learning Models in the Face of Noisy Data
Michael D. Blechner, M.D.
MIT HST.951 Final project
Fall 2005
Abstract
Although
a great deal of attention has been focused on the future potential for
molecular-based cancer diagnosis, histologic examination of tissue specimens
remains the mainstay of diagnosis. The process of histologic diagnosis entails
the identification of visual features from a slide, followed by the recognition
of a feature pattern to which the case belongs. The combination of image
analysis and machine learning imitates this process and in certain
circumstances may be able to aid the pathologist. However, there is a great
deal of variability and noise inherent in such an approach. Therefore, a
classification model developed from data at one institution is likely to
perform acceptably at other institutions, only if the model can handle such variability.
This paper compares the performance of machine learning models based on fuzzy
rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and
logistic regression (LR) and examines how these models behave in the face of
noisy and variant data. Results suggest that FDT models may be more resistant
to data noise.
Click here for full text report (html).
Click here to download full text report (MS word).
Data & Software
1.
Data - http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/wdbc.data
2.
Documentation - http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/wdbc.names