Skip to content

Latest commit

 

History

History
148 lines (99 loc) · 3.99 KB

File metadata and controls

148 lines (99 loc) · 3.99 KB

Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Dataset

In this homework, we will use the Bank Marketing dataset. Download it from here.

Or you can do it with wget:

wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

We need to take bank/bank-full.csv file from the downloaded zip-file. Please use semicolon as a separator in the read_csv function.

In this dataset our desired target for classification task will be y variable - has the client subscribed a term deposit or not.

Features

For the rest of the homework, you'll need to use only these columns:

  • age,
  • job,
  • marital,
  • education,
  • balance,
  • housing,
  • contact,
  • day,
  • month,
  • duration,
  • campaign,
  • pdays,
  • previous,
  • poutcome,
  • y

Data preparation

  • Select only the features from above.
  • Check if the missing values are presented in the features.

Question 1

What is the most frequent observation (mode) for the column education?

  • unknown
  • primary
  • secondary
  • tertiary

Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

  • age and balance
  • day and campaign
  • day and pdays
  • pdays and previous

Target encoding

  • Now we want to encode the y variable.
  • Let's replace the values yes/no with 1/0.

Split the data

  • Split your data in train/val/test sets with 60%/20%/20% distribution.
  • Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
  • Make sure that the target value y is not in your dataframe.

Question 3

  • Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.
  • Round the scores to 2 decimals using round(score, 2).

Which of these variables has the biggest mutual information score?

  • contact
  • education
  • housing
  • poutcome

Question 4

  • Now let's train a logistic regression.
  • Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
  • Fit the model on the training dataset.
    • To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    • model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
  • Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

  • 0.6
  • 0.7
  • 0.8
  • 0.9

Question 5

  • Let's find the least useful feature using the feature elimination technique.
  • Train a model using the same features and parameters as in Q4 (without rounding).
  • Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
  • For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

  • age
  • balance
  • marital
  • previous

Note: The difference doesn't have to be positive.

Question 6

  • Now let's train a regularized logistic regression.
  • Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].
  • Train models using all the features as in Q4.
  • Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these C leads to the best accuracy on the validation set?

  • 0.01
  • 0.1
  • 1
  • 10
  • 100

Note: If there are multiple options, select the smallest C.

Submit the results