4. Modeling

4.1 Loading libraries & dataset

4.2 Feature selection : Univariate filter selection

Select K best using Chi_squared https://dicits.ugr.es/software/FSinR/

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

We select only the best feature, we added the categorical variable (segment and channel), we removed id

4.3 First model : logistic regression & class balancing technique

Using mlr3 package for modeling : https://mlr3.mlr-org.com/index.html https://mlr3book.mlr-org.com/basics.html

Create a Task from mlr3 that we will use for each model.

Let us plot a summary view of our data

Plot correlation analysis with selected feature expect factor

First, let us define a lositic regression

Create class balancing technique : under, over and none

Create a pipe combining undersampling with Log reg learner within mlr3 Graph Learner

Apply same steps for oversampling

Here we want to benchmark (compare) the results of the different technique. To do that, we use benchmark_grid of mlr3.

Define a splitting technique for benchmark. Here we want that the bench performs on the same split,i.e rsmp("holdout")

Retrive results from benchmark with the accuracy metrics (used for the challenge)

Same for recall and precision

=> Over and undersampling technique do not increase accuracy. Precision is slighlyt better with class balance sampling technique.
But the recall value of classif.log_reg is too high (=1).

We define a train/test split and train our different learners to have a detailed view on the prediction.

=> From this first model, we can notice that the better accuracy is confirmed for no sampling techniques. Nontheless, we can notice from our confusion matrix that this model do not distinct man and woman. The model has predicted that all ids are men.
One of the main reason is probably coming from imbalanced data. Let us confirm that.

Here, we can notice that thanks to undersampling the model can distinct men and women. Let us see the results for oversampling.

=> Precision better, accuracy lower.

Let us use resampling technique with Cross validation set to 3 in order to check consistancy of oversampling technique.

We can notice that classif acc does not change much over different split which allows us to confirm the use of oversampling in our next models.

4.4 Benchmarking models

4.4.1 Define models and pipe

Here we define different classification models inside pipe with oversampling technique and one-hot factor encoding.

Random Forest

Elastic Net Regularization Regression Learner

Log Regression

Single-hidden-layer neural network

Design benchmark grid. We will compare ou models inside 5 Cross-Validation splits

Let us lunch our benchmark

And plot metrics results

Single layer neural-net does not have bad results.
We construct a more complexe Neural Net to benchmark with other models.
We use keras library which has good framework inside mlr3 package.

4.4.2 Define more complex Neural Net using keras

Let us re-run benchmark with keras

Best model : Neural Net but recall value near 1. As we have already seen that phenomenon, this means that our model does not distinct women to men. This probably comes from the metric used for optimizing the loss function of the sdg optimizer set inside keras.
One idea could be to change this metrics with class weight.

Let's set our tunning part with Random Forest instead (our second best model)

4.5 Hyperparameters tunning

We will evaluate all hyperparameter configurations using 10-fold CV. We use a fixed train-test split, i.e. the same splits for each evaluation. Otherwise, some evaluation could get unusually “hard” splits, which would make comparisons unfair.

Set search space with Parameter grid search

We choose Precision and Recall- Under the Curve metrics for our GridSearch.
This is a recommanded metric for imbalance dataset : https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/

We can now use the learner like any other learner, calling the train() and predict() method.
This time however, we pass it to benchmark() to compare the tuner to a classification tree without tuning.
This way, the AutoTuner will do its resampling for tuning on the training set of the respective split of the outer resampling.
The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during tuning or fitting of the respective learner. This is called nested resampling.

We can notice that our accuracy has increased with tunned parameters. The model will be then used in part 5 to predict gender with unseen data, i.e X_test