Random Forests (RF) Classification on a Dataset — Cardiotocography data — with R-script

Phuong Del Rosario
5 min readFeb 28, 2021

This article to apply Random Forests to classify a fetal state (N= normal, S = suspect , P = pathologic) of Cardiotocogaphy dataset. This is the part 3 of my project, please read the article below to understand more about the dataset and how to do an EDA.

1. Random Forests (RF) for classification

To have a deep understand about RF classification for 3 classes. Please check out this article:

Before apply RF to train our data, we need to split data into TRAIN and TEST set, and balance the data. Please read this article below for how to do it:

In R, the randomforest() function is applied to train the newTRAIN data, then predict() is used to predict the class of all cases in the newTEST data. The two parameters which can be used to tune the RF model are the number of decision trees (ntree) and the number of features randomly sampled at each stage (mtry).

Number of features used as potential candidates for each split (mtry) is usally set as sqrt(p). P is the total number of the features in the data set. In this data set, there are 21 features; thus, the mtry is set as sqrt(21) = 5. The number of decision trees for this study is set as 100, 200, 300, and 400. Figure 15 shows the accuracy rate of newTRAIN and newTEST at different ntrees = 100,200,300,400. Table 9, 10, 11 and 12 show the test data confusion matrices at ntrees = 100,200,300,400.

From the figure below, we see that the highest accuracy rate of train data is 0.977 at ntrees = 300 while the highest accuracy rate of test data is 0.95 at ntrees = 200. Just based on the global accuracy rate, ntree = 200 is the best ntree since it produces the highest accuracy rate at the test data. However, to select the best ntree, we need to look closely at the accuracy rates of each class in the confusion matrices of different ntrees, and their practical impacts.

Figure “Accuracy rates by class vs. ntrees(K)” below shows the accuracy rates of each class at different ntrees = 100,200,300,400. The highest accuracy rate of Normal class is 0.979 at 𝑛𝑡𝑟𝑒𝑒 = 200 , while in Suspect class, it has accuracy rate of 0.949 at 𝑛𝑡𝑟𝑒𝑒𝑠 = 100,200, and accuracy rate of 0.945 at 𝑛𝑡𝑟𝑒𝑒𝑠 = 300,400. For pathology class, the highest accuracy rate is .903 at 𝑛𝑡𝑟𝑒𝑒𝑠 = 300,400, and the lowest accuracy rate is 0.891 at 𝑛𝑡𝑟𝑒𝑒 =100.

In medical practice, we would like to choose the ntrees value that produces the highest rate of true positive and the lowest rate of false negative. It means that we want to predict Suspect cases and Pathology cases when their true class are Suspect and Pathology, respectively. We want to minimize the cases whose prediction are Normal, but their true class are Suspect and Pathology.

Looking at the 4 confusion matrix tables below, the best ntrees value is 200. Although P class does not have its highest accuracy rate at ntree = 200, it has its lowest false negative at 𝑛𝑡𝑟𝑒𝑒= 200. Furthermore, S class also has its lowest false negative rate and its highest accuracy rate at 𝑛𝑡𝑟𝑒𝑒 = 200. Again, the best value of ntrees chosen is 200.

2. RF Importance at ntry = 5 and best ntrees = 200

With the best 𝑛𝑡𝑟𝑒𝑒𝑠 = 200 and 𝑛𝑡𝑟𝑦 = 5, the randomforest() function is applied to train the newTRAIN and create a RF classifier noted as RF.200. After extracting the important features from RF.200 and plotted them as seen in figure below:

We see the ASTV feature has the highest mean decrease accuracy and mean decrease Gini. ASTV has the greatest effect on decreasing the accuracy in predictions for variables from the training set in the random forest model. This figure also shows that ASTV has the greatest effect on decreases in the Gini or node impurity that results from splits over ASTV averaged over all trees of the random forest method. The most important feature here is ASTV.

3. Histograms and KS-test for the most important feature (ASTV).

Looking at the histograms of ASTV feature for each class below,, there are differences in the distribution for each class. The distribution of N class looks normal, while the distribution of S and P class skewed left. Ks.test is used to compare the histograms of N class versus S class, the histograms of N class versus P class, and the histograms of S class versus P class for ASTV feature.

All 3 p-values of ks.test between each pair of Histograms of class are relatively small and approximated 0 value. Therefore, for ASTV feature, the histograms of each class are distinct from each other. The ks-discriminating power of ASTV feature between classes equal to 1 approximately. Thus, it means that ASTV feature may discriminate well between three classes N,S,P.

--

--

Phuong Del Rosario

I am passionate about data, and love beauty ! _ M.S. Student in Statistics and Data Science.