Q1. This assignment requires understanding the concepts explained in data mining, predictive
analytics and machine learning sections.
(a) For this exercise, your goal is to build a model to identify inputs or predictors that differentiate risky
customers from others (based on patterns pertaining to previous customers) and then use those inputs
to predict new risky customers. This sample case is typical for this domain. The sample data to be used
in this exercise is CreditRisk.xlsx .
The data set has 425 cases and 15 variables pertaining to past and current customers who have borrowed
from a bank for various reasons. The data set contains customer-related information such as financial
standing, reason for the loan, employment, demographic information, and the outcome or dependent
variable for credit standing, classifying each case as good or bad, based on the institution’s past
Take 400 of the cases as training cases and set aside the other 25 for testing. Build a decision tree model
to learn the characteristics of the problem. Test its performance on the other 25 cases. Report on your
model’s learning and testing performance. Prepare a report that identifies the decision tree model and
training parameters, as well as the resulting performance on the test set.
You can use either R (and related packages e.g., rattle Package) or a GUI-based software Weka.
To use Weka go through Learning Resource for Weka decision tree
See R resources posted in the blackboard.
(b) Using the same dataset also develop a Neural Network (NN) model using either R or Weka
(c) Compare and evaluate the model performances of decision tree and NN. (use 10-fold cross
validation and Leave-one-out for classification assessment). Also generate ROC plots. Explain and
discuss the results.
(d) How can you improve the prediction accuracy? What are the pre-processing or post- processing
steps required to improve the accuracy? Finally, implement them to show that they really improve