Behavior of Various Machine Learning Models in the Face of Noisy Data
Michael D. Blechner, M.D.
MIT HST.951 Final project
Although a great deal of attention has been focused on the future potential for molecular-based cancer diagnosis, histologic examination of tissue specimens remains the mainstay of diagnosis. The process of histologic diagnosis entails the identification of visual features from a slide, followed by the recognition of a feature pattern to which the case belongs. The combination of image analysis and machine learning imitates this process and in certain circumstances may be able to aid the pathologist. However, there is a great deal of variability and noise inherent in such an approach. Therefore, a classification model developed from data at one institution is likely to perform acceptably at other institutions, only if the model can handle such variability. This paper compares the performance of machine learning models based on fuzzy rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and logistic regression (LR) and examines how these models behave in the face of noisy and variant data. Results suggest that FDT models may be more resistant to data noise.
Although a great deal of attention has been focused on the future potential of molecular-based cancer identification, histologic examination of tissue specimens remains the mainstay of diagnosis. The process of histologic diagnosis entails the identification of visual features from a slide, followed by the recognition of a feature pattern to which the case belongs. The pattern is associated with a high or low probability of cancer. For example a pathologist examining a breast biopsy may identify breast epithelial cells with large, irregular shaped nuclei, irregularly clumped chromatin, growing in poorly arranged sheets and showing invasion into the surrounding connective tissue with an associated fibrotic reaction. These findings compose a pattern that is highly correlated with malignancy and would warrant such a diagnosis.
Imaging equipment and image analysis software can partially, and perhaps eventually, completely automate the process of feature extraction.1, 2 Given a list of previously identified visual features for a large number of cases, machine learning techniques can be used to discern patterns relevant to the separation of cancer from benign. The process of discerning such patterns from data results in a model of the domain. Diagnostic predictions can be made by applying such models to the data generated from new cases.
Wolberg, etal, demonstrated the correspondence between human histologic diagnosis and the combined techniques of image analysis and machine learning using the cytologic diagnosis of breast cancer for illustration.3 Breast cancer is the most common cancer in women and the second leading cause of female cancer deaths. Cancer screening involves mammography followed by tissue sampling and histologic examination of any mammographically worrisome area. Tissue samples are also obtained without mammography in the setting of palpable breast lumps. Initial tissue sampling in either situation is typically by needle core biopsy or fine needle aspiration (FNA). Core biopsy provides more tissue and retains tissue architecture for evaluation, while FNA typically yields a smaller sample and destroys or severely alters the tissue architecture. Although more invasive, core biopsy is the initial tissue procurement technique of choice in most situations. However, FNA is less invasive, can be performed in the physician’s office at a moments notice, is less expensive and therefore is still widely used. In addition, FNA is used more extensively for cancer diagnosis and screening in many other organ systems.
The histologic features used to diagnose breast cancer fall into 2 major categories; architectural and cytologic. Architectural features include those that describe how groups of cells relate to one another and to the surrounding connective tissue. They include characteristics such as the presence or absence of irregular, distorted or excessively cellular glands, too many glandular structures and the presence of single epithelial cells invading into connective tissue. By and large, these features cannot be reliably ascertained in FNA specimens. Cytologic features describe characteristics of single cells and include cell size, nuclear size, nuclear membrane irregularity and nuclear chromatin distribution to name a few. The FNA diagnosis of breast cancer is largely based on the nuclear cytologic features of increased nuclear size, nuclear membrane irregularity and irregularity of chromatin distribution. These features are relatively easily assessed by FNA.
Wolberg and his colleagues examined 569 breast FNA specimens.3 Semi-automated image analysis techniques were applied to digital photomicrographs taken from each case. The image analysis process identified the nuclear outline of 10-20 human selected cells within each image. Provided a rough estimate of the location of a cell nucleus, image analysis techniques used variations in pixel values to automatically identify a nuclear contour. For each nucleus, the nuclear outline and pixel values within the nucleus were used to calculate the following 10 values; radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. These attributes are all representations of the 3 key attributes mentioned above; nuclear size, nuclear membrane irregularity and irregularity of chromatin distribution. The values from each of the 10-20 selected cells were used to calculate the mean and standard error for each variable within each case. In addition, the three worst or largest values within a case were used to calculate a worst mean value for each attribute. The resulting data set consists of 30 variables for 569 cases. A 31st variable is the class assignment of benign or malignant, based on the pathologist’s final cytologic diagnosis which was confirmed in subsequent histologic examination of any additional biopsies as well as clinical follow-up. The data set includes 212 cases of cancer and 357 cases of benign breast changes.
Wolberg and his colleagues subsequently applied 2 supervised machine learning algorithms to the data and then evaluated the diagnostic performance if these models. The algorithms used were logistic regression and a decision tree algorithm known as Multisurface Method-Tree (MSM-T). In order to avoid over-fitting the training data, a stepwise approach was used to select 3 of the 30 variables, one to represent each of nuclear size, texture and shape. The attributes worst area, worst smoothness and mean texture demonstrated a classification accuracy of 96.2% using logistic regression and 97.5% using MSM-T. Both results represent averages from 10-fold cross validation.
Although their purpose was not to develop an actual diagnostic technique for laboratory use, the general idea of combining image analysis and machine learning can be used for the automation of visual classification tasks in medicine. However, there is a great deal of variability and noise inherent in such an approach. The optical components of imaging equipment would likely vary from one laboratory to another, resulting in variability in image capture that could alter the results of feature extraction. Different image analysis software would likely add additional variability. Even if, the imaging equipment and software were standardized, differences in tissue processing from one lab to the next would result in significant variability. For example, the use of different varieties and concentrations of tissue fixatives, as well as variations in fixation times, can significantly alter nuclear size and staining of chromatin., In addition, the biological variability, even within cancer of a single tissue type like breast epithelium, generates a great deal of variability in the histologic features. Therefore, a prediction model developed from data at one institution is likely to perform acceptably at other institutions, only if the model can handle this variability.
logic is an extension of Boolean logic that replaces binary truth values with
degrees of truth. It was introduced in
1965 by Prof. Lotfi Zadeh at the
This paper compares the performance of machine learning models based on fuzzy rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and logistic regression (LR). The study hypothesizes that fuzzy-logic-based modeling approaches will exhibit significantly more stable classification performance with increasingly noisy test data. All models were built using an identical training set and evaluated on an unaltered holdout test set as well as multiple versions of the same test set distorted with noise to simulate variance from image analysis and biologic variance.
Materials & Methods
Data set: The Wisconsin Diagnostic Breast Cancer (WDBC) dataset was obtained from the UCI Machine learning repository.A The dataset was created by Wolberg, Street and Olvi and consists of data from 569 breast FNA cases containing 30 descriptive attributes and one binary classification variable (benign or malignant). The descriptive attributes were obtained by semi-automated image analysis applied to digital photomicrographs obtained from the FNA slides. The case distribution includes 357 cases of benign breast changes and 212 cases of malignant breast cancer. The descriptive attributes are recorded with four significant digits and include the nuclear radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The mean, standard deviation and mean of the worst 3 measurements are recorded for each of these ten attributes for a total of 30 variables. There are no missing attribute values.
Data pre-processing: The original dataset was divided into a training set containing the first 380 cases and a test set consisting of the remaining 189 cases. Models were constructed and tested using both the full 30 variable data set as well as a limited dataset consisting of only the 3 variables used in the Wolberg models (worst area, worst smoothness and mean texture). Six additional test sets were created by adding increasing amounts of noise to the original test set data. The noise for each variable in each case was generated by selecting at random from a normal distribution with a mean of zero and a standard deviation of 0.001, 0.01, 0.1, 1, 10 and100 for each of the six increasingly noisy data sets respectively. These six data sets attempted to simulate a regular degree of noise that might be the result of variability from image analysis and tissue processing. In the results and discussion, these test datasets will be referred to as “noisy” test datasets. One additional test set was generated by selecting at random from a normal distribution with a mean and standard deviation equal to the mean and standard deviation for that attribute within the specific case’s class brethren. This was an attempt to simulate the natural biologic variability of these attributes within human cancers and benign tissues. Since this process essentially randomly redistributes attribute values from a pool within the class benign or malignant, this dataset will be referred to as the “redistributed” dataset.
Software: All modeling and analysis was performed within R version 2.2.0.B Additional packages and code used include the NNET library available as part of Ripley’s VR bundle version 7.2-23C and Vinterbo’s GCLD library, version 1.05c. The nrc() function for building artificial neural networks using nnet() and the lrc() function for building logistic regression models using the glm() function, were provided by VinterboE.
Models: An FR model and an FDT model were built using the gcl and tcl functions from the GCL package. 2x10-fold cross validation was used to select an appropriate setting for the “nlev” parameter based on the highest mean c-index with nlev set to 2 through 7. Final models were generated using an nlev of 2 for both the 30-variable and 3-variable FR models. Values of 5 and 6 were used for the 3 and 30-variable FDT models respectively. All other arguments to gcl and tcl functions used the default values. Single hidden layer aNNs were generated using the nrc function. 5x10-fold cross validation was used to select an appropriate setting for the “nunits” parameter based on the highest mean c-index with nunits set to 1 through 20. The nunits parameter determines the number of units in the hidden layer. Two final aNN models were generated for each training set (3 and 30-variables) using nunit values of 9 and 20. All other arguments to the nrc function used the default values. A single LR model was generated for each training dataset using the lrc function with the default glm parameters.
Parameter settings: The nlev parameter settings for the FR models were compared using 2x10-fold CV on the training set for both the 3 and 30-variable datasets. Values from 2 through 7 were examined. Figure 1 shows the results for the 3-variable data. An nlev of 2 had the highest mean performance (0.979902) and this value was used for the final FR model construction. Figure 2 shows similar results for the 30-variable dataset, with the highest mean performance (0.9752503) for an nlev setting of 2. It is worth noting that although an nlev of 2 results in the highest mean performance for both datasets, the difference in performance compared to other nlev values is not statistically significant for nlev of 5, 6, or 7 for either dataset (by paired t-test, p-value cutoff of p = 0.05).
The comparable results for the FDT models are shown in figures 3 and figure 4 with the highest mean performance with an nlev of 5 (0.9779412) and 6 (0.9783514) for the 3 and 30-variable datasets respectively. Again, it is worth noting that although an nlev of 5 and 6 result in the highest mean performance for the 2 datasets, the difference in performance compared to other nlev values is not statistically significant except when compared to an nlev of 2 or 3 for the 3-variable data and an nlev of 3 for the full dataset.
Results of the 5x10-fold cross validation for aNN models are shown in figure 5 and figure 6. For reasons that are not clear, aNN models showed wide variance in performance across different folds, ranging from a c-index of near 0.5 up to 1.0 in many models. There is a trend towards decreased variance with increased number of hidden units but this does not hold across the board as witnessed by the poor performance in some folds with 13 and 14 hidden units. This variance in performance was not diminished by increasing the maximum number of weight adjustment iterations from the default 100 up to 1000. An nunit setting of 9 was selected for both the 3 and 30 variable data in order to strike a balance between maximizing average performance, minimizing performance variance and minimizing hidden units to avoid over-fitting. This was an ad hoc decision based largely on the visual data presented in figures 5 and 6. For comparison, an additional aNN model was generated for both datasets using an nunits setting of 20.
Final Model performance: The performance of each model (2 FR models, 2 FDT models, 4 aNN models and 2 LR models) was evaluated by calculating the c-index from the results of applying the model to the appropriate test set (3 or 30 variables). Performance deterioration was determined by calculating the c-index from the results of applying the noisy datasets as well as the redistributed dataset. The results are shown in table 1 and table 2 and Figure 7.
Due to the fuzzy nature of set membership in fuzzy logic approaches to modeling, this study hypothesized that such machine learning algorithms would be more resistant to noise in the data than other models. The study examined the response to increasing levels of random noise generated from a normal distribution around a mean of zero. This type of noise attempted to simulate noise generated from the imaging process. The response to biologic variability was examined by re-selecting each variable at random from a normal distribution with a mean and SD equivalent to those for that variable within the corresponding class (benign or malignant).
The results of the cross validation studies for parameter selection reveal an interesting trend. The variance in c-index values for the fuzzy algorithms appears to be slightly less on average for the 30-variable data while the aNN and LR models show distinctly lower variance with the 3-variable data. This suggests the possibility that fuzzy models are more stable in the face of excess variables.
In relation to the nlev setting, one might expect that a very low nlev value would not enable sufficient separation of the data while too large a value would result in over-fitting. However, the data do not support this conclusion since the FR model performed best with an nlev of 2 for both data sets and an nlev of 2 significantly outperformed a value of 3 in all cases except FDTs build from the 3-variable data. A more detailed analysis of the effect of the nlev parameter was beyond the scope of this study. The wide variance in performance for the aNN model across numerous nunit settings remains a mystery. The nunit parameter signifies the number of hidden units. Alterations to the maximum number of weight adjustment iterations as well the decay rate did not alter these results.
All of the final models performed quite well on the original, unaltered test data with a lowest c-index for the 30 variable LR model (0.947754). These results are in agreement with Wolberg’s original results and conclusion that the data are linearly separable. It is worth noting that all 5 3-variable models outperformed the 30-variable models. This underscores the benefits of variable selection for most situations. With increasing noise the first model to exhibit a significant performance drop is the 30-variable LR model which is sensitive to relatively low levels of noise and argues strongly for the variable selection in LR. However, as the noise level continues to increase, the first apparent trend is that the 3-variable models are more sensitive to noise, while the 30-variable models retain respectable performance longer on average. The aNN models based on a larger feature set appear to be significantly more resistant to noise. At a noise of SD 10 units, both FDT models (3 and 30-variable) as well as the 20 unit aNN retain reasonable performance. Surprisingly, even with a noise SD of 100 units, the 3-variable FDT model performs with a c-index approaching 0.9. In this analysis, FDTs appear to be most resistant to noisy data and appear to gain no significant benefit from maintaining a larger variable set.
This data set is linearly separable using the 3 variables of worst area, worst smoothness and mean texture. The mean for the 2nd parameter is 559 for benign compared with 1422 for malignant cases. The respective standard deviations are 164 and 598. The other 2 parameters exhibit significant overlap between benign and cancer populations. It appears that with the addition of noise of SD 100, the FDT model is able to retain fairly good classification based on the wide margin of separation that exists for this 2nd parameter.
Perhaps the most interesting finding is the marked difference in response to noise between FR and FDT models. The nlev parameter works identically for both models. The FR models both used an nlev of 2 while the FDT models used values of 5 and 6. Follow-up analysis should examine the behavior of FR models with higher nlev settings. However, conceptually, a higher nlev would be expected to over-fit the training data and thus perform more poorly. This idea can be best illustrated by imagining a very high nlev value that results in very narrow fuzzy sets. In this scenario, each fuzzy set for a given variable contains only one member (either a full or partial member) in the training data. As a result, the fuzziness of these sets becomes irrelevant and the creation of rules or decision tree boundaries is based on individual cases. Why the FDT models seem to be so much more resistant to noise then the FR models is not clear and requires more in depth analysis of the actual algorithms.
The redistributed data attempts to simulate biologic variability. Arguably, the true level of biologic variability is already represented in the training set. The redistributed data could be considered to represent the extremes of such variability and provide one assessment of performance in a worst case scenario. It is worth noting that this approach of “redistributing” the data destroys any co-dependencies between variables that may exist in the original data. With all 30 variables retained, the performance of both aNN models as well as the LR model diminished markedly. Both fuzzy-based models, however, retained a c-index of greater than 0.93. All five 3-variable models demonstrated similar performance degradation compared to the unaltered test set, but retained reasonably good performance with c-indices ranging from 0.89 to 0.93. The LR model had the best performance. However, the maintained performance after data redistribution is probably more a feature of the original linearly separable data than the models.
It is worth noting that in controlled situations where noise in the test data can be kept to a minimum, these arguments do not apply and all of these models perform equally well. The importance of these findings for real world applications may also be insignificant if noise levels are below 0.01 assuming any LR model also applies variable selection, as is typically the case. Nonetheless, these results provide some insight into the behavior of these models and can be used as a guide to further analyzing and understanding these algorithms.
No clear and convincing patterns emerge from the results. The hypothesis of superior performance of fuzzy-based models in the face of noisy and redistributed data is not supported. However, the FDT models do appear to be more resistant to noise than the other models. Further study should focus on this algorithm while examining other parameter settings and evaluating other datasets, especially less linearly separable ones. Additional comparison with support vector machine performance might also be useful since the SVM algorithm’s ability to identify the separating plane with the largest margin might also provide significant performance protection from noisy data. The gcl and tcl functions use triangular fuzzy regions. Additional studies might examine the effects of more complex fuzzy set boundaries. The results do provide some insight into the behavior of these models and can be used as a guide for further algorithm analysis.
Data & Software
2. Documentation - http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/wdbc.names