Data Insights
It was essential as a first step to pulling up some relationship between the columns in the dataset;
the ratio for diabetic to non-diabetic is increasing after the age of 40. It was also increasing as
the
glucose level increase for the patients, which does make sense.
The correlation matrix and the heat map show that no two variables have a high correlation.
The highest correlation is between Age and Pregnancies (0.54), which is not significant enough (>0.7)
to be explored more.
Modeling
After that, we developed a model to predict if a patient has diabetes or not.
For this, the first step would be to split the data into train and test data sets.
Logistic Regression Model
We started with Logistic Regression. Logistic regression, also called a logit model,
is used to model dichotomous outcome variables. In the logit model,
the log odds of the outcome are modeled as a linear combination of the predictor variables.
Logistic regression is conceptually similar to linear regression, where linear regression estimates
the
target variable.
Instead of predicting values, as in the linear regression, logistic regression would estimate the
odds
of a
particular event occurring.
Logistic Regression is a supervised parametric learning model. In other words,
the data should obey certain assumptions before developing the logistic regression model.
Assumptions of the Logistic Regression Model
- Binary logistic regression requires the dependent variable to be binary, and ordinal
logistic
regression requires the dependent variable to be ordinal.
- Our dependent variable (Outcome) is a binary categorical
variable.
- Logistic regression requires observations to be independent of each other. In other
words,
the
observations should not come from repeated measurements or matched data.
- Each row in the data set is independent of that of the other.
- Logistic regression requires there to be little or no multicollinearity among the
independent
variables. This means that the independent variables should not be too highly correlated
with each
other.
- From the correlation matrix, there is not much collinearity between the columns
- Logistic regression assumes the linearity of independent variables and logs odds.
Although
this
analysis does not require the dependent and independent variables to be related
linearly, it
requires that the independent variables are linearly related to the log odds.
- The independent variables are linear with log-odds.
- Logistic regression typically requires a large sample size.
- In our data set, we have 798 observations, which are large enough for
eight
independent variables.
Building the Model
As all the assumptions are satisfied. we developed the model
Understanding the Results
These were the results obtained for a logistic regression model.
The
result can be interpreted as follows:
Intercepts, Estimates, Standard Errors, and P values.
We have various columns and estimate values. The estimated values
indicate the slope for the best fit line, and each column’s value is multiplied by this
slope.
Then we have the standard error of the slope for each variable in the model. We also have
the z
scores associated with each column. The last column is the p-value for these z scores. If
the
p-value falls below 0.05(95% interval, then those columns are highlighted using the asterisk
(*)
symbol). The first row consists of the intercept, which indicates the offset that has to be
added to the model equation.
Null and Residual Deviance The null deviance shows how well the
response variable is predicted by a model that includes only the intercept (grand mean). In
contrast, residual deviance is the deviance with the inclusion of independent variables.
Hence
for calculating the null deviance, the degrees of freedom will be 512-1=511, and we have 503
degrees of freedom for residual deviance (503=512-8-1). We subtract eight, as we have eight
independent variables. Residual is the difference between the actual and predicted value. So
lower the residual score better than the model.
AIC Akaike’s Information Criterion (AIC) is -2log-likelihood+2k,
where
k is the number of estimated parameters. It is useful for comparing models of different
models.
Lower the AIC better is the performance of the model.
Fisher Scoring Iterations This is the number of iterations to fit
the
model. The logistic regression uses an iterative maximum likelihood algorithm to fit the
data.
The Fisher method is the same as fitting a model by iteratively re-weighting the least
squares.
It indicates the optimal number of iterations. Similar to Linear Regression, the model
generates
a linear equation using the given estimates and intercepts. Then the values from this
equation
will be converted into probabilities using the logit function. From the above result, we
find
that Pregnancies, Glucose, Blood Pressure, Diabetes Pedigree Function, MI, and Age are
significant variables as they have the p values close to 0.05 or less than 0.05. The
variable
which does not contribute much for the model would be skin thickness.
Predicting in train and test data
After that, we tried to predict the values in the train and test set using our developed model.
The prediction column in both the data sets will now have probability values associated with
each
row.
Now we need to convert this probability scores to whether a patient has diabetes or not using a
threshold value.
Identifying the correct threshold value:
Before setting the threshold we took a look at the plot below:
In an ideal situation, we would want the peaks of the two curves to be separated as much as
possible. We
also find that setting a threshold of 0.5 for the prediction score will not work out. From the
graph, we
find that the two curves cutoff at 0.3.
The other reason why a threshold of 0.5 will not work out is that, in our sample out of the 768
samples,
only 268 are affected by diabetes. This is not a 50-50 proportion. In other words, the probability
of
encountering a diabetic patient is less than that of a non-diabetic patient.
The (Receiver Operating Characteristics) ROC curve will now help us to fix the threshold.
We should choose the threshold value from this curve such that the ROC curve saturates and shows not
much
increase in True Positive Rate for a decrease in threshold value. So we will choose 0.30 to be our
threshold
value for our model.
Confusion Matrix
For a threshold of 0.3, the model performs well and gives an accuracy of 75.2% on train data. Both
sensitivity and specificity are also high.
The model performs similarly with the test data and has nearly the same accuracy,
sensitivity, and
specificity with that of the train data. This is a good indication that the model is not
trying to
overfit or underfit the data.
Decision Tree
After that, we developed our next model, Decision Trees. Which are non-parametric models
mean that
they don't have any underlying assumptions about the distribution of data and errors.
We were also able to plot the tree by selecting a very limited number of columns for better
visualization. We can see the splits at each level and the number of elements considered at
each level. This model is very easy to develop and understand.
Performance of the Model
The last step is analyzing the accuracy and other parameters of the decision tree model.
From the above result, we observe that the decision tree gives an accuracy of about 90% on train
data and 70% on test data. This is because the decision tree has overfitted itself to the train data
and hence the accuracy drops in test data. Even though the accuracy is less, the decision tree shows
a very high sensitivity to test data.
Future Work
We can improve these models by adding extra features and fine-tuning them. For example, the logistic
regression model can be improved by choosing a better threshold value. This can be done by
evaluating a cost function for various threshold values. The decision tree model could have been
pruned to avoid overfitting. Apart from these models, we could have tried other advanced models such
as Ada Boosting trees, Gradient Boosting in Random forest, Deep Learning, and Neural Network models.
We could also improve the training of all models by using k-fold cross-validation. One more option
to improve the model would be to add more external features that can be related to diabetes in Pima
Indians and we could also take a larger sample to improve the quality of our models.
References
This project was done with my colleague: Naveen Narayanan Meyyappan