U1-Predicting-response-times-of-the-Paris-Fire_Brigade-vehicles

Edgar Jullien, Antoine Settelen, Simon Weiss

2020-11-14

  • Navigation
    • 1. Introduction
      • 1.0 Presentation of the project
      • 1.0 Features description
      • 1.1 Load libraries
      • 1.2 Load data, transform them into Datatable for better computation time and combine them.
      • 1.3 File structure and content
      • 1.4 Missing values
      • 1.5 Convertion into right format
    • 2. EDA
      • 2.1 Intro : map
      • 2.2 Dependant Variables Ys
    • 3. Explanatory data : Manage Outliers
      • 3.1 Verify data outliers
      • 3.2. What is time difference between departure.presentation and OSRM estimated duration ?
    • 4. Feature engineering
      • 4.1: Speed [km/h]
      • 4.2 Add month, day of week
      • 4.3 Order of the brigade
      • 4.4 GPS
    • 5. Correlation analysis
    • 6. Boosted Tree with XGB
      • 6.1 Sample
      • 6.2 Yo : Selection - Departure
      • 6.3 Y1 : Departure - Presentation
      • 6.4 Ys : Compute global
      • 6.5 Use feature importance to create group for factor variables for futur models
    • 7. OLS - Regression
      • 7.1 Yo : Selection - Departure
      • 7.2 Y1
      • 7.4 Ys : Compute global
    • 8. Actual predict : Compute Global for x_test
    • 9. Xgboost : x_test
      • 9.1 reduced x_test
      • 9.2 all x_test
      • Y0
      • Y1
      • Compute global
    • 11. To Follow …
    • Acknowledgements and references

Navigation

1. Introduction

1.0 Presentation of the project

This group project responds to professor Nadine Galy instructions written below :
The project involves identifying a real-world business problem or opportunity and designing and implementing an analysis plan to address it using at least one of the modelling methods studied in the course. You are free to choose any business problem or opportunity or public policy issue that you consider challenging and useful to address using business analytics. The data that you use should be readily available and verifiable.

This is a notebook for the Paris Fire Brigate data challenge 2020 with ENS and College de France.

The goal of this playground challenge is to predict the The response times of the Paris Fire Brigade vehicles which is the delay between: * the selection of a rescue vehicle (the time when a rescue team is warned) * and the rescue team arrival time at the scene of the request (information sent manually via portable radio)

This measurement is composed by the 2 following periods of time: * the activation period of the rescue team * the transit time of the rescue team

Based on features like trip coordinates, pickup date, type of the arrivall destination, vehicules etc..

The data which covers the entier year 2018 for which inoperable data have been squeezed out comes in the shape of 219 337 training observations and 108 033 test observation. The dataset covers the entire year of 2018. Each row contains one Paris fire brigade intervention.

“Response time is one of the most important factors for firefighters because their ability to save lives and rescue people depends on it. Every fire department in the world seeks strategies to decrease their response time, and several analyses have been conducted in the past years to determine what could impact response time. In the meantime, fire departments have been collecting data on their interventions; yet, few of them actually use data science to develop a data-driven decision making approach.”https://medium.com/crim/predicting-the-response-times-of-firefighters-using-data-science-da79f6965f93

“A lot of fire departments and emergency services rely on geographic information systems tools, such as ESRI ARCGis or Network Analyst, to obtain estimations about the response time. These tools rely on computing the shortest route using a graphical representation of the road network, which usually gives an accurate estimate of the travel time. Their drawback is that they cannot always take into consideration external dynamic factors such as the weather, traffic or type of units or intervention. Hence, there is an opportunity for machine learning tools to be used here.”

In this notebook, we will first study and visualize the original data, engineer new features, and examine potential outliers. Then, we implement a boosted Tree for our first model, do some dimension reductions on qualitative features and implement a linear regression. Finaly, we created a final predict and uploaded it to the data plateforme.

We hope that this notebook will have good results to the challenge and responds fully to Nadine Galy requirement. As always, any feedback, questions, or constructive criticism are much appreciated.

1.0 Features description

Input parameters (x_train.csv and x_test.csv):

  • [ID] emergency vehicle selection: identifier of the selection instance of an emergency vehicle for an intervention
  • Intervention
    • intervention: identifier of the intervention
    • Alert reason
      • alert reason category (category): alert reason category
      • alert reason (category): alert reason
    • Address
      • intervention on public roads (boolean): 1 when it concerns an intervention on public roads, 0 otherwise
      • floor (int): floor of the intervention
      • location of the event (category): qualifies the location of the emergency request, for example: entrance hall, boiler room, motorway, etc.
      • longitude intervention (float): approximate longitude of the intervention address. ATTENTION: intervention_longitude !
      • latitude intervention (float): approximate latitude of the intervention address. ATTENTION: intervention_latitude !
    • Emergency vehicle
      • emergency vehicle: identifier of the emergency vehicle
        • emergency vehicle type (category): type of the emergency vehicle
        • rescue center (category): identifier of the rescue center to which belong the vehicle (parking spot of the emergency vehicle)
      • selection time (datetime): selection time of the emergency vehicle
        • date key selection (int): selection date in YYYYMMDD format
        • time key selection (int): selection time in HHMMSS format
      • State of the emergency vehicle preceding its selection for an intervention
        • Operational status of the vehicle preceding its selection
          • status preceding selection (category): status of the emergency vehicle prior to selection. An emergency vehicle is in various statuses during an intervention:
            • Selection - selection of the emergency vehicle by the rescue commitment application
            • Departed - the vehicle starts its route to the location of the emergency request
            • Presented - the vehicle arrives at the location of the request
            • Hospital transportation - the vehicle starts its transport of a victim to hospital
            • Hospital arrival - the vehicle arrives at the hospital
            • Leaving hospital - the vehicle leaves the hospital
            • Returned - the vehicle has returned to its parking spot
            • Leave the premises - because the vehicle can also simply leave the scene of an intervention without having to transport any victim
            • Not available - for various reasons the vehicle can be in an unavailable position
            • Not relevant - statutes without interest
          • delta status preceding selection-selection (int): number of seconds before the vehicle was selected when its previous status was entered
        • departed from its rescue center (boolean) : 1 when the vehicle departed from its rescue center (emergency vehicle parking spot), 0 otherwise
        • GPS position of the vehicle before departure
          • longitude before departure (float): longitude of the position of the vehicle preceding his departure. ATTENTION: departure_longitude !
          • latitude previous departure (float): latitude of the position of the vehicle preceding his departure. ATTENTION: departure_latitude !
          • delta position gps previous departure-departure (int): number of seconds before the selection of the vehicle where its GPS position was recorded (when not parked at its emergency center)
        • GPS tracks
          • GPS tracks departure-presentation (float pair list): successive GPS positions (longitude,latitude;longitude,latitude, etc.) of the vehicle between departure and presentation. This information is for informational purposes to study vehicle behaviors. (The beacons, emitting the GPS positions of vehicles, are currently not always lit)
          • GPS tracks departure-presentation datetime (datetime list): datetime associated with successive GPS positions between the departure and the presentation of the vehicle.
        • Estimated route
          • OSRM estimated route (json object): service route response of an OSRM instance (http://project-osrm.org/docs/v5.15.2/api/#route-service) setup with the Ile-de-France OpenStreetMap data
          • OSRM estimated distance (float): distance calculated by the OSRM route service
          • OSRM estimated duration (float): transit delay calculated by the OSRM route service

Output parameters (y_train.csv and y_test.csv):

  • [ID] emergency vehicle selection: identifier of the selection instance of an emergency vehicle for an intervention
  • [TO PREDICT] delta selection-departure(int): elapsed time in seconds between the selection and the departure of the emergency vehicle
  • [TO PREDICT] delta departure-presentation(int): elapsed time in seconds between the departure of the emergency vehicle and its presentation on the intervention scene
  • [TO PREDICT] delta selection-presentation(int): elapsed time in seconds between the selection of the emergency vehicle and its presentation on the intervention scene (delta selection-departure + delta departure-presentation)

Supplementary files (x_train_additional_file.csv and x_test_additional_file.csv)

  • [ID] emergency vehicle selection: identifier of the selection instance of an emergency vehicle for an intervention
  • OSRM estimate from last observed GPS position(json object): service route response from last observed GPS position of an OSRM instance (http://project-osrm.org/docs/v5.15.2/api/#route-service) setup with the Ile-de-France OpenStreetMap data
  • OSRM estimated distance from last observed GPS position(float): distance (in meters) calculated by the OSRM route service from last observed GPS position
  • OSRM estimated distance from last observed GPS position(float): distance (in meters) calculated by the OSRM route service from last observed GPS position
  • OSRM estimated duration from last observed GPS position(float): transit delay (in seconds) calculated by the OSRM route service from last observed GPS position
  • time elapsed between selection and last observed GPS position (float): in seconds
  • updated OSRM estimated duration (float): time elapsed (in seconds) between selection and last observed GPS position + OSRM estimated duration from last observed GPS position

Good reading !

1.1 Load libraries

library(magrittr)
library(data.table)
library(sandwich)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(hms)
library(imputeTS)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(leaflet)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(FactoMineR)
library(corrplot)
## corrplot 0.84 loaded
library(Matrix)
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(DataExplorer)

1.2 Load data, transform them into Datatable for better computation time and combine them.

x_train <- read.csv("x_train.csv") %>% setDT
y_train <- read.csv("y_train.csv") %>% setDT
x_test <- read.csv("x_test.csv") %>% setDT
#View(x_train)
data <- cbind(x_train,y_train[,-1])#we don't keep id vehicule selection for no duplicate

#We rename a column which has a special caracter 
c<- colnames(data)
c[14] <- "date.key.selection"
c[15] <- "time.key.selection"
colnames(data) <- c

#Same for x_test
c<- colnames(x_test)
c[14] <- "date.key.selection"
c[15] <- "time.key.selection"
colnames(x_test) <- c

Store Id order for x_test

id_order<-data.frame(x_test$emergency.vehicle.selection)

1.3 File structure and content

Let’s have an overview of the data sets using the introduce and head tools. First the training data:

plot_intro(data)

glimpse(data)
## Rows: 219,337
## Columns: 29
## $ emergency.vehicle.selection                     <int> 5105452, 4720915, 5...
## $ intervention                                    <int> 13264186, 12663715,...
## $ alert.reason.category                           <int> 3, 3, 3, 3, 3, 3, 9...
## $ alert.reason                                    <int> 2162, 2124, 2163, 2...
## $ intervention.on.public.roads                    <int> 0, 0, 0, 0, 0, 0, 0...
## $ floor                                           <int> 0, 1, 2, 0, 3, 1, 4...
## $ location.of.the.event                           <dbl> 148, 136, 139, 136,...
## $ longitude.intervention                          <dbl> 2.284796, 2.247464,...
## $ latitude.intervention                           <dbl> 48.87967, 48.81819,...
## $ emergency.vehicle                               <int> 4511, 4327, 4509, 5...
## $ emergency.vehicle.type                          <chr> "VSAV BSPP", "PSE",...
## $ rescue.center                                   <int> 2447, 2464, 2438, 2...
## $ selection.time                                  <chr> "2018-07-08 19:02:4...
## $ date.key.selection                              <int> 20180708, 20180104,...
## $ time.key.selection                              <int> 190243, 90259, 1011...
## $ status.preceding.selection                      <chr> "Rentré", "Rentré...
## $ delta.status.preceding.selection.selection      <int> 2027, 28233, 1981, ...
## $ departed.from.its.rescue.center                 <int> 1, 1, 0, 1, 1, 1, 1...
## $ longitude.before.departure                      <dbl> 2.288053, 2.268519,...
## $ latitude.before.departure                       <dbl> 48.88470, 48.82396,...
## $ delta.position.gps.previous.departure.departure <dbl> NA, NA, 33, NA, NA,...
## $ GPS.tracks.departure.presentation               <chr> "2.289000,48.885113...
## $ GPS.tracks.datetime.departure.presentation      <chr> "2018-07-08 19:04:4...
## $ OSRM.response                                   <chr> "{\"code\":\"Ok\",\...
## $ OSRM.estimated.distance                         <dbl> 952.5, 2238.5, 3026...
## $ OSRM.estimated.duration                         <dbl> 105.8, 243.2, 295.4...
## $ delta.selection.departure                       <int> 86, 164, 125, 168, ...
## $ delta.departure.presentation                    <int> 324, 297, 365, 160,...
## $ delta.selection.presentation                    <int> 410, 461, 490, 328,...
plot_intro(x_test)

glimpse(x_test)
## Rows: 108,033
## Columns: 26
## $ emergency.vehicle.selection                     <int> 5271704, 5092931, 5...
## $ intervention                                    <int> 13535032, 13244794,...
## $ alert.reason.category                           <int> 3, 3, 3, 3, 3, 3, 1...
## $ alert.reason                                    <int> 2113, 2113, 2112, 2...
## $ intervention.on.public.roads                    <int> 0, 0, 1, 0, 1, 0, 0...
## $ floor                                           <int> 2, 0, 0, 0, 0, 15, ...
## $ location.of.the.event                           <dbl> 136, 228, 148, 201,...
## $ longitude.intervention                          <dbl> 2.464084, 2.325948,...
## $ latitude.intervention                           <dbl> 48.81844, 48.92520,...
## $ emergency.vehicle                               <int> 5755, 3100, 3538, 6...
## $ emergency.vehicle.type                          <chr> "VSAV BSPP", "VSAV ...
## $ rescue.center                                   <int> 2483, 2462, 2482, 2...
## $ selection.time                                  <chr> "2018-10-02 12:41:2...
## $ date.key.selection                              <int> 20181002, 20180703,...
## $ time.key.selection                              <int> 124122, 131447, 134...
## $ status.preceding.selection                      <chr> "Rentré", "Rentré...
## $ delta.status.preceding.selection.selection      <int> 953, 1906, 654, 108...
## $ departed.from.its.rescue.center                 <int> 1, 1, 1, 1, 1, 1, 1...
## $ longitude.before.departure                      <dbl> 2.481148, 2.301399,...
## $ latitude.before.departure                       <dbl> 48.84103, 48.92930,...
## $ delta.position.gps.previous.departure.departure <dbl> NA, NA, NA, NA, NA,...
## $ GPS.tracks.departure.presentation               <chr> "", "2.309139,48.92...
## $ GPS.tracks.datetime.departure.presentation      <chr> "", "2018-07-03 13:...
## $ OSRM.response                                   <chr> "{\"code\":\"Ok\",\...
## $ OSRM.estimated.distance                         <dbl> 3266.8, 2710.3, 914...
## $ OSRM.estimated.duration                         <dbl> 336.5, 218.4, 85.1,...

We find : - We have a great mix of qualitative data and quantitative data - Some quali data are characters such as "status.preceding.selection, other dummy variables coded 0,1 such as $intervention.on.public.roads - We have NA values - We have ID variables that we can remove - We will have to deal with outliers

1.4 Missing values

# visualize missing data
introduce(data)
##      rows columns discrete_columns continuous_columns all_missing_columns
## 1: 219337      29                6                 23                   0
##    total_missing_values complete_rows total_observations memory_usage
## 1:               227144          4553            6360773    167533280
plot_missing(data)

introduce(x_train)
##      rows columns discrete_columns continuous_columns all_missing_columns
## 1: 219337      26                6                 20                   0
##    total_missing_values complete_rows total_observations memory_usage
## 1:               227144          4553            5702762    164900528
plot_missing(x_train)

We remoev the useless column and the raw with empty cells

data <- data[,-24] # delete the column OSMR response json object
data <- data[,-21] #delete the column delta position gps

Same with x_test data

x_test<-x_test[,-24]
x_test<-x_test[,-21]

Let’s check how many na it remains

sum(is.na(data))
## [1] 12710
sum(is.na(x_test))
## [1] 6331

Good proporition (around 5% of dataset), acceptable to omit those values.

data <- na.omit(data)

which(is.na(data))
## integer(0)

(We apply same cleaning in x_train and y_train in case)

y_train <- na.omit(y_train)
x_train<-na.omit(x_train)

1.5 Convertion into right format

Convert qualitative variables into factor

data$alert.reason.category<- as.factor(data$alert.reason.category)
data$alert.reason<-as.factor(data$alert.reason)
data$location.of.the.event <- as.factor(data$location.of.the.event)
data$intervention.on.public.roads <- as.factor(data$intervention.on.public.roads)
data$emergency.vehicle.type <- as.factor(data$emergency.vehicle.type)
data$rescue.center <- as.factor(data$rescue.center)
data$status.preceding.selection  <- as.factor(data$status.preceding.selection)
data$departed.from.its.rescue.center  <- as.factor(data$departed.from.its.rescue.center)
data$floor<-as.factor(data$floor)
data$emergency.vehicle<-as.factor(data$emergency.vehicle)

Apply same in x_test

x_test$alert.reason.category<- as.factor(x_test$alert.reason.category)
x_test$alert.reason<-as.factor(x_test$alert.reason)
x_test$location.of.the.event <- as.factor(x_test$location.of.the.event)
x_test$intervention.on.public.roads <- as.factor(x_test$intervention.on.public.roads )
x_test$emergency.vehicle.type <- as.factor(x_test$emergency.vehicle.type)
x_test$rescue.center <- as.factor(x_test$rescue.center)
x_test$status.preceding.selection  <- as.factor(x_test$status.preceding.selection)
x_test$departed.from.its.rescue.center  <- as.factor(x_test$departed.from.its.rescue.center)
x_test$floor<-as.factor(x_test$floor)
x_test$emergency.vehicle<-as.factor(x_test$emergency.vehicle)

Manage *Selection time

  • selection time (datetime): selection time of the emergency vehicle * date key selection (int): selection date in YYYYMMDD format * time key selection (int): selection time in HHMMSS format

We can delete selection time. We keep numeric format for date key selection and time key selection for now (more manageable for cor analysis and regression). For EDA part, we could convert them into date and time format

data<-data[,-c("selection.time")]
x_test<-x_test[,-c("selection.time")]

Store quali, quali and quali with Ys variables

str(data)
## Classes 'data.table' and 'data.frame':   206627 obs. of  26 variables:
##  $ emergency.vehicle.selection               : int  5105452 4720915 5365374 4741586 5381209 4731603 5196431 4774057 5277444 5277017 ...
##  $ intervention                              : int  13264186 12663715 13675521 12695745 13698743 12680636 13415648 12744692 13544161 13543445 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 3 3 3 3 3 3 9 3 3 3 ...
##  $ alert.reason                              : Factor w/ 122 levels "1911","1912",..: 60 41 61 60 60 31 101 60 32 31 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ floor                                     : Factor w/ 45 levels "-10","-6","-5",..: 8 9 10 8 11 9 12 8 8 8 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 48 36 39 36 5 56 36 36 48 94 ...
##  $ longitude.intervention                    : num  2.28 2.25 2.26 2.39 2.46 ...
##  $ latitude.intervention                     : num  48.9 48.8 48.8 48.8 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 398 334 396 517 473 330 293 575 398 606 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 37 24 37 37 37 37 24 37 37 37 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 15 30 6 72 42 55 42 16 15 50 ...
##  $ date.key.selection                        : int  20180708 20180104 20181116 20180115 20181124 20180109 20180824 20180130 20181005 20181004 ...
##  $ time.key.selection                        : int  190243 90259 101147 3846 3426 222327 81800 74243 82648 5918 ...
##  $ status.preceding.selection                : Factor w/ 2 levels "Disponible","Rentré": 2 2 1 2 2 2 2 2 2 2 ...
##  $ delta.status.preceding.selection.selection: int  2027 28233 1981 1842 2716 5592 37282 5661 31361 304 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
##  $ longitude.before.departure                : num  2.29 2.27 2.27 2.39 2.44 ...
##  $ latitude.before.departure                 : num  48.9 48.8 48.9 48.8 48.9 ...
##  $ GPS.tracks.departure.presentation         : chr  "2.289000,48.885113;2.288861,48.884998;2.288000,48.883335;2.284444,48.878582;2.286250,48.880196" "" "2.272972,48.850498;2.269056,48.847443;2.262611,48.839554;2.260306,48.836887;2.257917,48.836250" "2.394278,48.782112;2.393639,48.776833" ...
##  $ GPS.tracks.datetime.departure.presentation: chr  "2018-07-08 19:04:43;2018-07-08 19:05:55;2018-07-08 19:07:07;2018-07-08 19:08:19;2018-07-08 19:09:18" "" "2018-11-16 10:14:31;2018-11-16 10:15:43;2018-11-16 10:16:57;2018-11-16 10:18:07;2018-11-16 10:19:19" "2018-01-15 00:42:46;2018-01-15 00:43:58" ...
##  $ OSRM.estimated.distance                   : num  952 2238 3026 1934 2707 ...
##  $ OSRM.estimated.duration                   : num  106 243 295 167 263 ...
##  $ delta.selection.departure                 : int  86 164 125 168 138 124 104 118 103 129 ...
##  $ delta.departure.presentation              : int  324 297 365 160 523 419 452 404 411 237 ...
##  $ delta.selection.presentation              : int  410 461 490 328 661 543 556 522 514 366 ...
##  - attr(*, ".internal.selfref")=<externalptr>
colnames(data)
##  [1] "emergency.vehicle.selection"               
##  [2] "intervention"                              
##  [3] "alert.reason.category"                     
##  [4] "alert.reason"                              
##  [5] "intervention.on.public.roads"              
##  [6] "floor"                                     
##  [7] "location.of.the.event"                     
##  [8] "longitude.intervention"                    
##  [9] "latitude.intervention"                     
## [10] "emergency.vehicle"                         
## [11] "emergency.vehicle.type"                    
## [12] "rescue.center"                             
## [13] "date.key.selection"                        
## [14] "time.key.selection"                        
## [15] "status.preceding.selection"                
## [16] "delta.status.preceding.selection.selection"
## [17] "departed.from.its.rescue.center"           
## [18] "longitude.before.departure"                
## [19] "latitude.before.departure"                 
## [20] "GPS.tracks.departure.presentation"         
## [21] "GPS.tracks.datetime.departure.presentation"
## [22] "OSRM.estimated.distance"                   
## [23] "OSRM.estimated.duration"                   
## [24] "delta.selection.departure"                 
## [25] "delta.departure.presentation"              
## [26] "delta.selection.presentation"
data.quali<-data[,c(3,4,5,6,7,10,11,12,15,17)]


data.quanti.y<-data[,-c(3,4,5,6,7,10,11,12,15,17)] #with Ys

data.quanti<-data[,-c(3,4,5,6,7,10,11,12,15,17,24,25,26)]

2. EDA

2.1 Intro : map

We start with a map of Paris and overlay a manageable number of coordinates to get a general overview of the locations and distances in question. For this visualization we use the leaflet package, which includes a variety of nice tools for interactive maps. In this map you can zoom and pan through the intervention locations:

set.seed(1234)
foo <- sample_n(data, 8e3)

leaflet(data = foo) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ longitude.intervention, ~latitude.intervention, radius = 1,
                   color = "blue", fillOpacity = 0.3)

Fig. 1

2.2 Dependant Variables Ys

We have 2 dependant variables and 1 global variable

summary(y_train[,-1])
##  delta.selection.departure delta.departure.presentation
##  Min.   :    0.0           Min.   :    1.0             
##  1st Qu.:  100.0           1st Qu.:  231.0             
##  Median :  131.0           Median :  319.0             
##  Mean   :  138.8           Mean   :  356.2             
##  3rd Qu.:  168.0           3rd Qu.:  434.0             
##  Max.   :17758.0           Max.   :22722.0             
##  delta.selection.presentation
##  Min.   :    4.0             
##  1st Qu.:  363.0             
##  Median :  458.0             
##  Mean   :  494.9             
##  3rd Qu.:  581.0             
##  Max.   :22934.0

Extreme values for your Ys variables around 6 hours !

Plot density distribution for time of vehicle selection

data %>%
  ggplot(aes(delta.selection.departure)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 5 rows containing non-finite values (stat_density).

Note the logarithmic x-axis and square-root y-axis.

We find:

  • Whe majority of vehicule rather timinh follow a rather smooth distribution that looks almost log-normal with a peak just short of 200 seconds, i.e. about 4 minutes.

  • There are several suspiciously short rides with less than 10 seconds duration.

  • Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:

Plot density distribution the time to present to the place

data %>%
  ggplot(aes(delta.departure.presentation)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

We find:

  • same smooth distribution that looks almost log-normal with a peak around 400, 500 seconds, i.e. about 6 minutes.

  • Global variable : selection to arrival around 8 to 10 mintues.

  • In many fire departments the measurement of turnout time and travel time are done manually. An officer presses a button located in the vehicle to signal his departure and his arrival. This process introduces irregularities and variation in the data, which will need to be cleaned.

Box plots Ys

For delta.selection.departure

data %>% 
  ggplot(aes(delta.selection.departure)) +
  geom_boxplot() 

For delta.departure. presentation

data %>% 
  ggplot(aes(delta.departure.presentation)) +
  geom_boxplot() 

Cor between Ys

corr<-rcorr(as.matrix(y_train[,-c(1,4)])) # We remove global and ID which will be cor
y_train_cor= corr$r

corr
##                              delta.selection.departure
## delta.selection.departure                         1.00
## delta.departure.presentation                      0.03
##                              delta.departure.presentation
## delta.selection.departure                            0.03
## delta.departure.presentation                         1.00
## 
## n= 219337 
## 
## 
## P
##                              delta.selection.departure
## delta.selection.departure                             
## delta.departure.presentation  0                       
##                              delta.departure.presentation
## delta.selection.departure     0                          
## delta.departure.presentation
corrplot(y_train_cor, method="square",type="upper", order="hclust", tl.col="black", tl.srt=45)

The 2 Ys are not corr between them.

2.2.1 Manage outliers for Ys values

Plot delta.selection.departure

y_train %>% 
  mutate(
      delta.selection.departure_minutes = round(delta.selection.departure / 60, 0),
    duration_grp = case_when(
      between(delta.selection.departure_minutes, 0,  2)  ~ "Less than 2 minutes",
      between(delta.selection.departure_minutes, 2, 4) ~ "2 to 4 minutes",
      between(delta.selection.departure_minutes, 4, 8) ~ "8 to 12 minutes",
      delta.selection.departure_minutes >= 8 ~ "8 or more minutes"
    ),
    duration_grp = factor(duration_grp, 
                          levels = c("Less than 2 minutes", "2 to 4 minutes", "8 to 12 minutes", "12 or more minutes"))
  ) %>%
  group_by(duration_grp) %>% 
  ggplot(aes(x=duration_grp,group=duration_grp)) + 
  geom_bar(fill="#E41A1C") +
  labs(x="Group", y="count") +
  theme_minimal()

Most of data is around 0 to 12 max minutes for delta.selection.departure.

Departure to presentation

y_train %>% 
  mutate(
      delta.departure.presentation_minutes = round(delta.departure.presentation / 60, 0),
    duration2_grp = case_when(
      between(delta.departure.presentation_minutes, 0,  5)  ~ "Less than 5 minutes",
      between(delta.departure.presentation_minutes, 5, 10) ~ "5 to 10 minutes",
      between(delta.departure.presentation_minutes, 10, 15) ~ "10 to 15 minutes",
      between(delta.departure.presentation_minutes, 15, 20) ~ "15 to 20 minutes",
      delta.departure.presentation_minutes >= 20 ~ " 20 or more minutes"
    ),
    duration2_grp = factor(duration2_grp, 
                          levels = c("Less than 5 minutes", "5 to 10 minutes", "10 to 15 minutes", "15 to 20 minutes","20 or more minutes" ))
  ) %>%
  group_by(duration2_grp) %>% 
  ggplot(aes(x=duration2_grp,group=duration2_grp)) + 
  geom_bar(fill="#E41A1C") +
  labs(x="Group", y="count") +
  theme_minimal()

Max 20 min

y_train %>% 
  mutate(
      delta.selection.presentation_minutes = round(delta.selection.presentation / 60, 0),
    duration3_grp = case_when(
      between(delta.selection.presentation_minutes, 0,  5)  ~ "Less than 5 minutes",
      between(delta.selection.presentation_minutes, 5, 10) ~ "5 to 10 minutes",
      between(delta.selection.presentation_minutes, 10, 15) ~ "10 to 15 minutes",
      between(delta.selection.presentation_minutes, 15, 20) ~ "15 to 20 minutes",
      delta.selection.presentation_minutes >= 20 ~ " 20 or more minutes"
    ),
    duration3_grp = factor(duration3_grp, 
                          levels = c("Less than 5 minutes", "5 to 10 minutes", "10 to 15 minutes", "15 to 20 minutes","20 or more minutes" ))
  ) %>%
  group_by(duration3_grp) %>% 
  ggplot(aes(x=duration3_grp,group=duration3_grp)) + 
  geom_bar(fill="#E41A1C") +
  labs(x="Group", y="count") +
  theme_minimal()

=> Most of duration are between 0 to 20 min (1200 sec) We can set our outlier cleaning

data<- data[data$delta.selection.departure < 1200,]
data<-data[data$delta.departure.presentation < 1200,]
data<-data[data$delta.selection.presentation < 1200,]

2.2.2 Relation analysis between Y and other variables

In this part, we will first try to check if there is relation between our dependent variables and the dependent

Let’s study Y0 ‘selection-departure’, Y1’departure-presentation’,Y2’selection-presentation’ vs feature vars

data and y without GPS data

data.nogps<-data[,-c(21,20)]

We sample data in 3000 observations for scatter plot

set.seed(1234)
foocor.all <- sample_n(data.nogps, 3000)

Let’s use here scatter plot. Scatter plot are plotted along two axes, the pattern of the resulting points revealing any correlation present. One pattern of special interest is a linear pattern, where the data has a general look of a line going uphill or downhill.

For instance, let’s plot y0 vs alert.reason.category

car::scatterplot(delta.selection.departure ~ OSRM.estimated.duration, data = foocor.all, 
                 smoother = TRUE, grid = TRUE)
## Warning in plot.window(...): "smoother" n'est pas un paramètre graphique
## Warning in plot.xy(xy, type, ...): "smoother" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in box(...): "smoother" n'est pas un paramètre graphique
## Warning in title(...): "smoother" n'est pas un paramètre graphique

Here we can see that there no clear linear correlation, maybe a some little smooth trend. But let’s take delta. departure. presentation vs the OSRM estimated duration.

car::scatterplot(delta.departure.presentation~ OSRM.estimated.duration, data = foocor.all, 
                 smoother = TRUE, grid = TRUE)
## Warning in plot.window(...): "smoother" n'est pas un paramètre graphique
## Warning in plot.xy(xy, type, ...): "smoother" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in box(...): "smoother" n'est pas un paramètre graphique
## Warning in title(...): "smoother" n'est pas un paramètre graphique

We can observe a clear linear positive correlation. Normal since distance and time of presentation are correlated.

Let’s scatter plot matrix. We just check the first lines for ys vs other variables for now and try to identify visually if there is some high independent explicative variable.

Let’s start by Y0 ‘selection-departure’

Plot 5 first

pairs(delta.selection.departure~.,data=foocor.all[,c(1,2,3,4,5,22)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

We find :

It seems that we don’t have clear linear correlation. Let’s scatter plot the 5 next features.

5 next

pairs(delta.selection.departure~.,data=foocor.all[,c(6,7,8,9,10,11,22)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion.

5 next

pairs(delta.selection.departure~.,data=foocor.all[,c(12,13,14,15,16,17,22)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion.

5 next

pairs(delta.selection.departure~.,data=foocor.all[,c(18,19,20,21,22,23,24)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion.

Here, We find that there is no clear visual linear relation between yo and other features.

We will deeper our correlation analysis in the 4. part.

Let’s quickly check scatter plot matrix for delta.departure.presentation.

pairs(delta.departure.presentation~.,data=foocor.all[,c(1,2,3,4,5,23)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion.

pairs(delta.departure.presentation~.,data=foocor.all[,c(6,7,8,9,10,11,23)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion

pairs(delta.departure.presentation~.,data=foocor.all[,c(12,13,14,15,16,17,23)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

Same conclusion.

pairs(delta.departure.presentation~.,data=foocor.all[,c(18,19,20,21,22,23)],
   main="Simple Scatterplot Matrix",lower.panel = NULL)

We find a linear correlation between y1 and OSRM estimated estimated time and distance (which seems to be normal).

We will deeper our correlation analysis in the 4part of EDA.

But first, let’s manage outliers with some EDA with independant variables !

3. Explanatory data : Manage Outliers

3.1 Verify data outliers

3.1.1 Quantitative Variables

3.1.1.1 Histograms
which(is.na(data.quanti)) #check for NA values before starting
## integer(0)
hist(data.quanti$longitude.intervention, col="blue",main="Longitude intervention")

hist(data.quanti$latitude.intervention, col="blue",main="Latitude intervention")

hist(data.quanti$delta.status.preceding.selection.selection, col="blue",main="delta status preceding selection-selection")

hist(data.quanti$longitude.before.departure, col="blue",main="longitude before departure ")

hist(data.quanti$latitude.before.departure, col="blue",main="Latitude before departure ")

hist(data.quanti$OSRM.estimated.distance, col="blue",main="OSRM estimated distance")

hist(data.quanti$OSRM.estimated.duration, col="blue",main="OSRM estimated duration")

3.1.1.2 Plot Variables Density Distribution
data.quanti %>%
  ggplot(aes(longitude.intervention)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

data.quanti %>%
  ggplot(aes(latitude.intervention)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

data.quanti %>%
  ggplot(aes(delta.status.preceding.selection.selection)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing non-finite values (stat_density).

data.quanti %>%
  ggplot(aes(longitude.before.departure)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

data.quanti %>%
  ggplot(aes(latitude.before.departure)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

data.quanti %>%
  ggplot(aes(OSRM.estimated.distance)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

data.quanti %>%
  ggplot(aes(OSRM.estimated.duration)) +
  geom_density(fill = "red", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt() + 
  theme_minimal()
## Warning: Ignoring unknown parameters: bins

3.1.1.3 Boxplots
boxplot(data.quanti$longitude.intervention, col="blue",main="Longitude intervention")

boxplot(data.quanti$latitude.intervention, col="blue",main="Latitude intervention")

boxplot(data.quanti$delta.status.preceding.selection.selection , col="blue",main="delta status preceding selection-selection")

boxplot(data.quanti$longitude.before.departure, col="blue",main="longitude before departure")

boxplot(data.quanti$latitude.before.departure, col="blue",main="latitude before departure")

boxplot(data.quanti$OSRM.estimated.distance, col="blue",main="OSRM estimated distance")

boxplot(data.quanti$OSRM.estimated.duration, col="blue",main="OSRM estimated duration")

3.1.1.3 Manage Outliers
data.clean <- data[data$delta.status.preceding.selection.selection < 100000,]
data.clean <- data[data$OSRM.estimated.duration < 1000,]


#Assign the new value to data.quanti
colnames(data.clean)
##  [1] "emergency.vehicle.selection"               
##  [2] "intervention"                              
##  [3] "alert.reason.category"                     
##  [4] "alert.reason"                              
##  [5] "intervention.on.public.roads"              
##  [6] "floor"                                     
##  [7] "location.of.the.event"                     
##  [8] "longitude.intervention"                    
##  [9] "latitude.intervention"                     
## [10] "emergency.vehicle"                         
## [11] "emergency.vehicle.type"                    
## [12] "rescue.center"                             
## [13] "date.key.selection"                        
## [14] "time.key.selection"                        
## [15] "status.preceding.selection"                
## [16] "delta.status.preceding.selection.selection"
## [17] "departed.from.its.rescue.center"           
## [18] "longitude.before.departure"                
## [19] "latitude.before.departure"                 
## [20] "GPS.tracks.departure.presentation"         
## [21] "GPS.tracks.datetime.departure.presentation"
## [22] "OSRM.estimated.distance"                   
## [23] "OSRM.estimated.duration"                   
## [24] "delta.selection.departure"                 
## [25] "delta.departure.presentation"              
## [26] "delta.selection.presentation"
data.quanti<-data.clean[,-c(3,4,5,6,7,10,11,12,15,17,24,25,26)]
data.quanti.y<-data.clean[,-c(3,4,5,6,7,10,11,12,15,17)]

3.1.2 Qualitative Variables

3.1.2.1 Tables
str(data.quali)
## Classes 'data.table' and 'data.frame':   206627 obs. of  10 variables:
##  $ alert.reason.category          : Factor w/ 9 levels "1","2","3","4",..: 3 3 3 3 3 3 9 3 3 3 ...
##  $ alert.reason                   : Factor w/ 122 levels "1911","1912",..: 60 41 61 60 60 31 101 60 32 31 ...
##  $ intervention.on.public.roads   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ floor                          : Factor w/ 45 levels "-10","-6","-5",..: 8 9 10 8 11 9 12 8 8 8 ...
##  $ location.of.the.event          : Factor w/ 210 levels "100","101","102",..: 48 36 39 36 5 56 36 36 48 94 ...
##  $ emergency.vehicle              : Factor w/ 639 levels "1815","1823",..: 398 334 396 517 473 330 293 575 398 606 ...
##  $ emergency.vehicle.type         : Factor w/ 41 levels "AR","BEAA BSPP",..: 37 24 37 37 37 37 24 37 37 37 ...
##  $ rescue.center                  : Factor w/ 79 levels "2418","2434",..: 15 30 6 72 42 55 42 16 15 50 ...
##  $ status.preceding.selection     : Factor w/ 2 levels "Disponible","Rentré": 2 2 1 2 2 2 2 2 2 2 ...
##  $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
table(data.quali$alert.reason.category)
## 
##      1      2      3      4      5      6      7      8      9 
##  15032   5402 172294   1185    614   4019    715    188   7178
prop.table(table(data.quali$alert.reason.category))
## 
##            1            2            3            4            5            6 
## 0.0727494471 0.0261437276 0.8338406888 0.0057349717 0.0029715381 0.0194505074 
##            7            8            9 
## 0.0034603416 0.0009098521 0.0347389257
table(data.quali$alert.reason) 
## 
##  1911  1912  1914  1918  1920  1922  1923  1924  1926  1927  1929  1932  1934 
##   111   699   977  1305     3  8426   261   264     5    24   575     3     2 
##  1940  1941  1942  1944  1951  1952  2011  2012  2014  2015  2017  2018  2020 
##   433  1911     6     2    23     2  1914    22     3    21  1490  1934     4 
##  2021  2022  2026  2028  2112  2113  2115  2116  2118  2119  2120  2121  2122 
##     3     3     5     3 36679 22790    82   511  1722  2164  3327    62    11 
##  2123  2124  2126  2127  2128  2129  2131  2132  2133  2134  2135  2136  2137 
##    17  5693   184     1     1    73    33   791     6  3362  7498  1661    43 
##  2141  2142  2143  2144  2145  2146  2147  2162  2163  2210  2211  2212  2213 
##   457    18   341    10     1     8    66 66311 15052     4    28   556    17 
##  2214  2215  2216  2218  2221  2311  2312  2313  2314  2317  2411  2412  2413 
##   358     3   212     1     9   162   143   229    79     1   163    93     1 
##  2414  2416  2421  2422  2423  2424  2426  2430  2431  2432  2511  2514  2519 
##     3     6  2350   164    28    19   621     2   240    45    21    30     1 
##  2523  2524  2525  2532  2612  2613  2614  2623  2624  2711  2712  2714  2715 
##    79   467    55    62     1     2    10   165    10  5744    13     5     2 
##  2716  2720  2724  2725  2727  2752  2753  4926  5061  5062  5811  7912  7913 
##  1185    84    16    18     1    56    10    12   369  2934     1    57   162 
##  7914 10821 10971 11021 93529 
##    65     1     1     5    37
prop.table(table(data.quali$alert.reason))
## 
##         1911         1912         1914         1918         1920         1922 
## 5.371999e-04 3.382907e-03 4.728327e-03 6.315728e-03 1.451892e-05 4.077879e-02 
##         1923         1924         1926         1927         1929         1932 
## 1.263146e-03 1.277665e-03 2.419819e-05 1.161513e-04 2.782792e-03 1.451892e-05 
##         1934         1940         1941         1942         1944         1951 
## 9.679277e-06 2.095564e-03 9.248549e-03 2.903783e-05 9.679277e-06 1.113117e-04 
##         1952         2011         2012         2014         2015         2017 
## 9.679277e-06 9.263068e-03 1.064720e-04 1.451892e-05 1.016324e-04 7.211061e-03 
##         2018         2020         2021         2022         2026         2028 
## 9.359861e-03 1.935855e-05 1.451892e-05 1.451892e-05 2.419819e-05 1.451892e-05 
##         2112         2113         2115         2116         2118         2119 
## 1.775131e-01 1.102954e-01 3.968504e-04 2.473055e-03 8.333858e-03 1.047298e-02 
##         2120         2121         2122         2123         2124         2126 
## 1.610148e-02 3.000576e-04 5.323602e-05 8.227386e-05 2.755206e-02 8.904935e-04 
##         2127         2128         2129         2131         2132         2133 
## 4.839639e-06 4.839639e-06 3.532936e-04 1.597081e-04 3.828154e-03 2.903783e-05 
##         2134         2135         2136         2137         2141         2142 
## 1.627086e-02 3.628761e-02 8.038640e-03 2.081045e-04 2.211715e-03 8.711349e-05 
##         2143         2144         2145         2146         2147         2162 
## 1.650317e-03 4.839639e-05 4.839639e-06 3.871711e-05 3.194161e-04 3.209213e-01 
##         2163         2210         2211         2212         2213         2214 
## 7.284624e-02 1.935855e-05 1.355099e-04 2.690839e-03 8.227386e-05 1.732591e-03 
##         2215         2216         2218         2221         2311         2312 
## 1.451892e-05 1.026003e-03 4.839639e-06 4.355675e-05 7.840214e-04 6.920683e-04 
##         2313         2314         2317         2411         2412         2413 
## 1.108277e-03 3.823314e-04 4.839639e-06 7.888611e-04 4.500864e-04 4.839639e-06 
##         2414         2416         2421         2422         2423         2424 
## 1.451892e-05 2.903783e-05 1.137315e-02 7.937007e-04 1.355099e-04 9.195313e-05 
##         2426         2430         2431         2432         2511         2514 
## 3.005416e-03 9.679277e-06 1.161513e-03 2.177837e-04 1.016324e-04 1.451892e-04 
##         2519         2523         2524         2525         2532         2612 
## 4.839639e-06 3.823314e-04 2.260111e-03 2.661801e-04 3.000576e-04 4.839639e-06 
##         2613         2614         2623         2624         2711         2712 
## 9.679277e-06 4.839639e-05 7.985404e-04 4.839639e-05 2.779888e-02 6.291530e-05 
##         2714         2715         2716         2720         2724         2725 
## 2.419819e-05 9.679277e-06 5.734972e-03 4.065296e-04 7.743422e-05 8.711349e-05 
##         2727         2752         2753         4926         5061         5062 
## 4.839639e-06 2.710198e-04 4.839639e-05 5.807566e-05 1.785827e-03 1.419950e-02 
##         5811         7912         7913         7914        10821        10971 
## 4.839639e-06 2.758594e-04 7.840214e-04 3.145765e-04 4.839639e-06 4.839639e-06 
##        11021        93529 
## 2.419819e-05 1.790666e-04

According to the previous results, we need to combine several categories for the alert reason variable. However, we don’t have enough information for grouping levels of that variable.

table(data.quali$intervention.on.public.roads) 
## 
##      0      1 
## 174494  32133
prop.table(table(data.quali$intervention.on.public.roads))
## 
##         0         1 
## 0.8444879 0.1555121
table(data.quali$floor) 
## 
##    -10     -6     -5     -4     -3     -2     -1      0      1      2      3 
##      1      5      5    113    141    612   3702 127325  17707  14725  12298 
##      4      5      6      7      8      9     10     11     12     13     14 
##   9770   6892   4708   2756   1691   1121    789    564    418    316    250 
##     15     16     17     18     19     20     21     22     23     24     25 
##    182    131    134     73     25     14     23     23     11     12      8 
##     26     27     28     29     30     31     32     33     37     52     79 
##     15     17      3     12     10      1      7      2      4      1      1 
##    100 
##      9
prop.table(table(data.quali$floor))
## 
##          -10           -6           -5           -4           -3           -2 
## 4.839639e-06 2.419819e-05 2.419819e-05 5.468792e-04 6.823890e-04 2.961859e-03 
##           -1            0            1            2            3            4 
## 1.791634e-02 6.162070e-01 8.569548e-02 7.126368e-02 5.951788e-02 4.728327e-02 
##            5            6            7            8            9           10 
## 3.335479e-02 2.278502e-02 1.333804e-02 8.183829e-03 5.425235e-03 3.818475e-03 
##           11           12           13           14           15           16 
## 2.729556e-03 2.022969e-03 1.529326e-03 1.209910e-03 8.808142e-04 6.339927e-04 
##           17           18           19           20           21           22 
## 6.485116e-04 3.532936e-04 1.209910e-04 6.775494e-05 1.113117e-04 1.113117e-04 
##           23           24           25           26           27           28 
## 5.323602e-05 5.807566e-05 3.871711e-05 7.259458e-05 8.227386e-05 1.451892e-05 
##           29           30           31           32           33           37 
## 5.807566e-05 4.839639e-05 4.839639e-06 3.387747e-05 9.679277e-06 1.935855e-05 
##           52           79          100 
## 4.839639e-06 4.839639e-06 4.355675e-05

According to the previous results, we need to combine several categories for the floor variable.

table(data.quali$location.of.the.event) 
## 
##   100   101   102   103   104   105   106   107   108   109   110   111   112 
##  2357   388     6    93  2687  1354   161  4869   350     2    50   498   579 
##   113   114   115   116   118   119   120   121   122   123   124   125   126 
##   146   677   228    14   315    26    94   181    16    61     2   124    24 
##   127   128   129   130   131   132   133   134   135   136   137   138   139 
##   308   513    13   401  3130    54   120   304  1796 44400  4679   516 51031 
##   140   141   142   143   144   145   146   147   148   149   150   151   153 
##  5237   235    90   562     8     9   320  1024 31997  9905     5     1     2 
##   154   155   156   157   158   159   160   162   163   164   165   166   167 
##     3    18    18   149    30     2     1    15    11   709   100    72   364 
##   168   169   170   171   172   174   175   177   178   179   180   181   182 
##    74     6    14    21    12     6     4    12     9   142    28    74     8 
##   183   184   186   187   188   189   190   191   192   193   194   195   196 
##     2    19   111     7    16   115    17   260    34    18   107    70  1776 
##   197   198   199   200   201   202   203   204   205   206   207   208   209 
##    11   292   785     1   747  1297   165   281   465   478   199    92   135 
##   210   211   212   213   214   215   216   217   218   219   220   221   222 
##  1040   763   194     1    65     1   622    81   842  1866   289    98   391 
##   223   224   225   226   227   228   229   230   231   232   233   234   235 
##     3   945    78   727   274  2142   321   382   330   706   366    35    63 
##   236   237   238   239   240   241   242   243   244   245   246   247   248 
##    93   186    27    12     3  1932   193    34   384     2    18     1     1 
##   250   252   254   255   256   257   258   259   260   261   262   263   264 
##     3    11   137    29    49   749   544  2349   134   226    13   103   201 
##   265   267   268   269   270   271   272   274   275   276   277   279   280 
##     1     6    10    41   238    34    31   617    17   127    10     2     5 
##   282   284   285   286   287   288   289   290   291   292   293   294   295 
##     3     1     3   189     6    48     4     1    12     2     1    34    17 
##   296   297   298   299   300   301   302   303   305   306   307   308   309 
##    20     2     3     2    32     3     1     3     7     3    12    80    31 
##   310   311   312   313   315   316   317   318   319   320   321   322   323 
##     6     2    38     2     2   135    34     8    21   110    41   143   589 
##   324   325 
##     1  2413
prop.table(table(data.quali$location.of.the.event))
## 
##          100          101          102          103          104          105 
## 1.140703e-02 1.877780e-03 2.903783e-05 4.500864e-04 1.300411e-02 6.552871e-03 
##          106          107          108          109          110          111 
## 7.791818e-04 2.356420e-02 1.693874e-03 9.679277e-06 2.419819e-04 2.410140e-03 
##          112          113          114          115          116          118 
## 2.802151e-03 7.065872e-04 3.276435e-03 1.103438e-03 6.775494e-05 1.524486e-03 
##          119          120          121          122          123          124 
## 1.258306e-04 4.549260e-04 8.759746e-04 7.743422e-05 2.952180e-04 9.679277e-06 
##          125          126          127          128          129          130 
## 6.001152e-04 1.161513e-04 1.490609e-03 2.482735e-03 6.291530e-05 1.940695e-03 
##          131          132          133          134          135          136 
## 1.514807e-02 2.613405e-04 5.807566e-04 1.471250e-03 8.691991e-03 2.148800e-01 
##          137          138          139          140          141          142 
## 2.264467e-02 2.497254e-03 2.469716e-01 2.534519e-02 1.137315e-03 4.355675e-04 
##          143          144          145          146          147          148 
## 2.719877e-03 3.871711e-05 4.355675e-05 1.548684e-03 4.955790e-03 1.548539e-01 
##          149          150          151          153          154          155 
## 4.793662e-02 2.419819e-05 4.839639e-06 9.679277e-06 1.451892e-05 8.711349e-05 
##          156          157          158          159          160          162 
## 8.711349e-05 7.211061e-04 1.451892e-04 9.679277e-06 4.839639e-06 7.259458e-05 
##          163          164          165          166          167          168 
## 5.323602e-05 3.431304e-03 4.839639e-04 3.484540e-04 1.761628e-03 3.581333e-04 
##          169          170          171          172          174          175 
## 2.903783e-05 6.775494e-05 1.016324e-04 5.807566e-05 2.903783e-05 1.935855e-05 
##          177          178          179          180          181          182 
## 5.807566e-05 4.355675e-05 6.872287e-04 1.355099e-04 3.581333e-04 3.871711e-05 
##          183          184          186          187          188          189 
## 9.679277e-06 9.195313e-05 5.371999e-04 3.387747e-05 7.743422e-05 5.565584e-04 
##          190          191          192          193          194          195 
## 8.227386e-05 1.258306e-03 1.645477e-04 8.711349e-05 5.178413e-04 3.387747e-04 
##          196          197          198          199          200          201 
## 8.595198e-03 5.323602e-05 1.413174e-03 3.799116e-03 4.839639e-06 3.615210e-03 
##          202          203          204          205          206          207 
## 6.277011e-03 7.985404e-04 1.359938e-03 2.250432e-03 2.313347e-03 9.630881e-04 
##          208          209          210          211          212          213 
## 4.452467e-04 6.533512e-04 5.033224e-03 3.692644e-03 9.388899e-04 4.839639e-06 
##          214          215          216          217          218          219 
## 3.145765e-04 4.839639e-06 3.010255e-03 3.920107e-04 4.074976e-03 9.030766e-03 
##          220          221          222          223          224          225 
## 1.398656e-03 4.742846e-04 1.892299e-03 1.451892e-05 4.573458e-03 3.774918e-04 
##          226          227          228          229          230          231 
## 3.518417e-03 1.326061e-03 1.036651e-02 1.553524e-03 1.848742e-03 1.597081e-03 
##          232          233          234          235          236          237 
## 3.416785e-03 1.771308e-03 1.693874e-04 3.048972e-04 4.500864e-04 9.001728e-04 
##          238          239          240          241          242          243 
## 1.306702e-04 5.807566e-05 1.451892e-05 9.350182e-03 9.340502e-04 1.645477e-04 
##          244          245          246          247          248          250 
## 1.858421e-03 9.679277e-06 8.711349e-05 4.839639e-06 4.839639e-06 1.451892e-05 
##          252          254          255          256          257          258 
## 5.323602e-05 6.630305e-04 1.403495e-04 2.371423e-04 3.624889e-03 2.632763e-03 
##          259          260          261          262          263          264 
## 1.136831e-02 6.485116e-04 1.093758e-03 6.291530e-05 4.984828e-04 9.727674e-04 
##          265          267          268          269          270          271 
## 4.839639e-06 2.903783e-05 4.839639e-05 1.984252e-04 1.151834e-03 1.645477e-04 
##          272          274          275          276          277          279 
## 1.500288e-04 2.986057e-03 8.227386e-05 6.146341e-04 4.839639e-05 9.679277e-06 
##          280          282          284          285          286          287 
## 2.419819e-05 1.451892e-05 4.839639e-06 1.451892e-05 9.146917e-04 2.903783e-05 
##          288          289          290          291          292          293 
## 2.323027e-04 1.935855e-05 4.839639e-06 5.807566e-05 9.679277e-06 4.839639e-06 
##          294          295          296          297          298          299 
## 1.645477e-04 8.227386e-05 9.679277e-05 9.679277e-06 1.451892e-05 9.679277e-06 
##          300          301          302          303          305          306 
## 1.548684e-04 1.451892e-05 4.839639e-06 1.451892e-05 3.387747e-05 1.451892e-05 
##          307          308          309          310          311          312 
## 5.807566e-05 3.871711e-04 1.500288e-04 2.903783e-05 9.679277e-06 1.839063e-04 
##          313          315          316          317          318          319 
## 9.679277e-06 9.679277e-06 6.533512e-04 1.645477e-04 3.871711e-05 1.016324e-04 
##          320          321          322          323          324          325 
## 5.323602e-04 1.984252e-04 6.920683e-04 2.850547e-03 4.839639e-06 1.167805e-02

According to the previous results, we need to combine several categories for the alert reason variable. However, we don’t have enough information for grouping levels of that variable.

table(data.quali$emergency.vehicle)
## 
## 1815 1823 1828 1832 1834 1843 1844 1847 1856 1859 1867 1877 1879 1880 1893 1895 
##   61   62  210  281  272   11   96  221   92  157   17    2  503  103   99    2 
## 1901 1910 1913 1914 1926 1933 1941 1943 1952 1969 1980 1985 1986 1991 1992 1994 
##  328  328  716  240   59    1  130   52  494  287    8  264  191   34  457  160 
## 1996 2004 2008 2019 2021 2022 2047 2053 2056 2062 2065 2067 2068 2094 2098 2100 
##   39   36  494  261  157   72  307   46  150   19  608  468   30  247    5  496 
## 2105 2115 2118 2125 2132 2144 2145 2151 2154 2171 2174 2207 2216 2219 2221 2231 
##  107  288  339  380  127  128    6   15  239  325   19    4  199  243  104   46 
## 2241 2242 2244 2252 2253 2255 2276 2280 2281 2289 2290 2291 2292 2293 2294 2296 
##  378  155  204   33  543  284   27  300   60  662  265   15   51  277  341   39 
## 2297 2298 2302 2303 2304 2307 2308 2311 2312 2319 2321 2324 2326 2327 2329 2339 
##  278   53   50  407   10   39  269   60  225   31   60  284  406   61   91   57 
## 2348 2349 2364 2403 2407 2414 2417 2418 2419 2420 2422 2426 2428 2429 2432 2434 
##  185  354  169   47    3   32   25  435   36    6   11  147  217  175  177   34 
## 2456 2473 2476 2477 2478 2485 2493 2496 2499 2500 2501 2515 2517 2518 2525 2526 
##    3  199   76   40  144  105  306    7  129   31   86    1   90  206    4    8 
## 2527 2532 2535 2536 2537 2539 2540 2541 2543 2545 2552 2555 2557 2560 2561 2567 
##    4  420  320   43   48   71  268  242  650   35  142   42   39  264  230  100 
## 2569 2572 2575 2577 2581 2582 2583 2586 2587 2588 2590 2592 2615 2626 2629 2634 
##   47   30   68   84   17    1   68    2   25  108  649   52    1  106   42  662 
## 2635 2638 2639 2641 2663 2664 2669 2670 2671 2672 2699 2705 2734 2766 2896 2902 
##  402  427  811  584   34    2  430  131    5   48  124   76   18   28   60    2 
## 2903 2940 2958 2959 3032 3033 3041 3042 3051 3053 3054 3063 3065 3067 3073 3075 
##    4  489    1   20   26   49   80   24   60    1   30    1  142  414   44   30 
## 3076 3077 3078 3081 3082 3083 3084 3085 3086 3087 3089 3091 3093 3094 3095 3097 
##   33  475  499   78   46   80   23   44    1   74   64   64   22  187   64  424 
## 3099 3100 3101 3103 3105 3107 3113 3114 3121 3122 3123 3133 3135 3136 3218 3219 
##  574  692  508  626  273   23    2    1    6    7    7  225   14  210   62   13 
## 3221 3223 3228 3229 3230 3231 3233 3234 3235 3236 3251 3257 3258 3291 3293 3294 
##    1    1    1    1   23    4   47    1   58   11    1    8   12   13    1   12 
## 3295 3296 3298 3300 3301 3304 3305 3440 3441 3442 3520 3521 3524 3528 3530 3532 
##   14   57   63   39   14    4   30   13    1    1  533  250  326   71   38  867 
## 3533 3536 3537 3538 3539 3540 3542 3543 3544 3550 3551 3603 3902 3903 3904 3905 
##  652  473   48  707  831  321  908  384  889  241  472   28   45   62   61   51 
## 4114 4175 4176 4177 4178 4179 4180 4181 4184 4185 4186 4187 4188 4189 4190 4207 
##    1  334  433  208  719  289  273  673    6  158  182  154  116  298  160    1 
## 4209 4210 4211 4214 4215 4216 4219 4220 4223 4226 4227 4228 4229 4230 4231 4235 
##  451  429  195  246  609    4    2    3  330  195  215  562  150  269  209  202 
## 4238 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4293 4304 4305 4307 
##  512  583  604  441    1   54   90  112   11   99   60   66    1  415  451  458 
## 4308 4309 4311 4313 4314 4315 4316 4317 4318 4319 4320 4322 4323 4327 4328 4333 
##  658  909  309  476  945  716 1020  887  821  709  419  289 1070  334    1    4 
## 4335 4336 4340 4342 4343 4345 4349 4351 4353 4359 4362 4363 4366 4367 4370 4372 
##  492    1  437   38   38  104  560    2    6   87   35  312  137  609    3    3 
## 4375 4376 4377 4379 4386 4387 4388 4389 4391 4399 4400 4401 4406 4407 4415 4430 
##  194  617  216    1  374  912  676  654    2   59  699  775   55  349  259  135 
## 4432 4436 4438 4442 4444 4453 4454 4455 4456 4458 4460 4461 4462 4466 4467 4469 
##    6   80    2 1197  326   25    7   32   34   18  674  385    2  744   32 1187 
## 4471 4475 4485 4486 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 
##  770  577   21   17  727 1203 1185  544 1087  512  662 1228 1256 1291  613  873 
## 4515 4516 4518 4519 4520 4522 4523 4524 4530 4531 4533 4534 4536 4537 4538 4539 
## 1015  386   17 1160  818   63 1133  948    3    2    1  174 1198 1024  493 1305 
## 4540 4541 4542 4543 4544 4545 4546 4565 4648 4653 4844 4870 4871 4872 4873 4874 
## 1281 1337  724 1121  743  669    4    1  594    1    1  665  717  591 1055  643 
## 4875 4876 4877 4878 4879 4891 4893 4894 4899 4900 4902 4903 4904 4905 4906 4907 
##  785  819   96  220 1268   30   55    1    2 1109   11 1196  755  755  864   23 
## 4908 4909 4910 4911 4912 4913 4914 4915 4920 4923 4924 4926 4927 4928 4929 4930 
## 1291 1039  447  709  718    2   86   22   16   23    1  711  549  930   71  876 
## 4931 4940 5059 5060 5264 5269 5621 5622 5623 5624 5625 5633 5635 5637 5642 5645 
## 1122   48    3   40    1    1 1287 1255 1615  933  980  912    4  990    1   61 
## 5646 5657 5661 5665 5666 5670 5671 5673 5683 5684 5685 5686 5695 5697 5699 5700 
##   16    4 1708    3    1  687  912 1155   97   36   42   55   52  415   17   61 
## 5701 5702 5703 5708 5710 5712 5713 5715 5717 5718 5719 5720 5721 5723 5725 5726 
##    1 1524   54   24   45   46   94 1248  116   92 1466  616 1224  956   82 1390 
## 5727 5728 5729 5730 5731 5732 5734 5742 5745 5746 5749 5750 5751 5752 5754 5755 
##  325 1070 1368  323 1024 1027 1123  853  856  947 1228   37   50  966   41  974 
## 5757 5758 5760 5761 5762 5763 5764 5765 5766 5767 5772 5774 5775 5776 5778 5779 
##    3   85 1455 1257  772  674  356  870  587   28   63   57 1394   55   69  735 
## 5780 5781 5783 5785 5787 5788 5789 5795 5798 5799 5803 5804 5805 5806 5807 5808 
##  700 1343  542  125   49    2    1  372    3   38   88  131   27    4  108    4 
## 5810 5811 5812 5813 5814 5816 5817 5819 5820 5821 5822 5823 5881 5882 5883 5885 
##   29  116   99  135    6  467   42  105 1898 1384  384  316  826 1334 1272 1184 
## 5886 5887 5888 5889 5890 5894 5896 5917 5921 5923 5924 5930 5934 5936 5940 5948 
##  941 1285  591  651   13  632    1  110    1   94   64    3   47   48    3    7 
## 5949 5962 5963 5964 5965 5966 5968 5969 5975 5982 5995 5996 5997 5998 6000 6001 
##    3    2   53  210   27  277  247  373    1   66    1  684  725 1801 1378  836 
## 6002 6003 6004 6005 6008 6030 6032 6033 6035 6037 6038 6039 6041 6042 6043 6044 
## 1580  439 1308 1273    1   37  150    4  138  619  548  652  361  719    1    4 
## 6045 6046 6047 6048 6049 6051 6052 6053 6055 6056 6064 6065 6067 6080 6092 
##   36   38   38   22  229   49   53   42   41    5  322  657  603   24    3
prop.table(table(data.quali$emergency.vehicle))
## 
##         1815         1823         1828         1832         1834         1843 
## 2.952180e-04 3.000576e-04 1.016324e-03 1.359938e-03 1.316382e-03 5.323602e-05 
##         1844         1847         1856         1859         1867         1877 
## 4.646053e-04 1.069560e-03 4.452467e-04 7.598233e-04 8.227386e-05 9.679277e-06 
##         1879         1880         1893         1895         1901         1910 
## 2.434338e-03 4.984828e-04 4.791242e-04 9.679277e-06 1.587401e-03 1.587401e-03 
##         1913         1914         1926         1933         1941         1943 
## 3.465181e-03 1.161513e-03 2.855387e-04 4.839639e-06 6.291530e-04 2.516612e-04 
##         1952         1969         1980         1985         1986         1991 
## 2.390781e-03 1.388976e-03 3.871711e-05 1.277665e-03 9.243710e-04 1.645477e-04 
##         1992         1994         1996         2004         2008         2019 
## 2.211715e-03 7.743422e-04 1.887459e-04 1.742270e-04 2.390781e-03 1.263146e-03 
##         2021         2022         2047         2053         2056         2062 
## 7.598233e-04 3.484540e-04 1.485769e-03 2.226234e-04 7.259458e-04 9.195313e-05 
##         2065         2067         2068         2094         2098         2100 
## 2.942500e-03 2.264951e-03 1.451892e-04 1.195391e-03 2.419819e-05 2.400461e-03 
##         2105         2115         2118         2125         2132         2144 
## 5.178413e-04 1.393816e-03 1.640637e-03 1.839063e-03 6.146341e-04 6.194737e-04 
##         2145         2151         2154         2171         2174         2207 
## 2.903783e-05 7.259458e-05 1.156674e-03 1.572883e-03 9.195313e-05 1.935855e-05 
##         2216         2219         2221         2231         2241         2242 
## 9.630881e-04 1.176032e-03 5.033224e-04 2.226234e-04 1.829383e-03 7.501440e-04 
##         2244         2252         2253         2255         2276         2280 
## 9.872863e-04 1.597081e-04 2.627924e-03 1.374457e-03 1.306702e-04 1.451892e-03 
##         2281         2289         2290         2291         2292         2293 
## 2.903783e-04 3.203841e-03 1.282504e-03 7.259458e-05 2.468216e-04 1.340580e-03 
##         2294         2296         2297         2298         2302         2303 
## 1.650317e-03 1.887459e-04 1.345420e-03 2.565008e-04 2.419819e-04 1.969733e-03 
##         2304         2307         2308         2311         2312         2319 
## 4.839639e-05 1.887459e-04 1.301863e-03 2.903783e-04 1.088919e-03 1.500288e-04 
##         2321         2324         2326         2327         2329         2339 
## 2.903783e-04 1.374457e-03 1.964893e-03 2.952180e-04 4.404071e-04 2.758594e-04 
##         2348         2349         2364         2403         2407         2414 
## 8.953331e-04 1.713232e-03 8.178989e-04 2.274630e-04 1.451892e-05 1.548684e-04 
##         2417         2418         2419         2420         2422         2426 
## 1.209910e-04 2.105243e-03 1.742270e-04 2.903783e-05 5.323602e-05 7.114269e-04 
##         2428         2429         2432         2434         2456         2473 
## 1.050202e-03 8.469368e-04 8.566160e-04 1.645477e-04 1.451892e-05 9.630881e-04 
##         2476         2477         2478         2485         2493         2496 
## 3.678125e-04 1.935855e-04 6.969080e-04 5.081621e-04 1.480929e-03 3.387747e-05 
##         2499         2500         2501         2515         2517         2518 
## 6.243134e-04 1.500288e-04 4.162089e-04 4.839639e-06 4.355675e-04 9.969655e-04 
##         2525         2526         2527         2532         2535         2536 
## 1.935855e-05 3.871711e-05 1.935855e-05 2.032648e-03 1.548684e-03 2.081045e-04 
##         2537         2539         2540         2541         2543         2545 
## 2.323027e-04 3.436143e-04 1.297023e-03 1.171193e-03 3.145765e-03 1.693874e-04 
##         2552         2555         2557         2560         2561         2567 
## 6.872287e-04 2.032648e-04 1.887459e-04 1.277665e-03 1.113117e-03 4.839639e-04 
##         2569         2572         2575         2577         2581         2582 
## 2.274630e-04 1.451892e-04 3.290954e-04 4.065296e-04 8.227386e-05 4.839639e-06 
##         2583         2586         2587         2588         2590         2592 
## 3.290954e-04 9.679277e-06 1.209910e-04 5.226810e-04 3.140925e-03 2.516612e-04 
##         2615         2626         2629         2634         2635         2638 
## 4.839639e-06 5.130017e-04 2.032648e-04 3.203841e-03 1.945535e-03 2.066526e-03 
##         2639         2641         2663         2664         2669         2670 
## 3.924947e-03 2.826349e-03 1.645477e-04 9.679277e-06 2.081045e-03 6.339927e-04 
##         2671         2672         2699         2705         2734         2766 
## 2.419819e-05 2.323027e-04 6.001152e-04 3.678125e-04 8.711349e-05 1.355099e-04 
##         2896         2902         2903         2940         2958         2959 
## 2.903783e-04 9.679277e-06 1.935855e-05 2.366583e-03 4.839639e-06 9.679277e-05 
##         3032         3033         3041         3042         3051         3053 
## 1.258306e-04 2.371423e-04 3.871711e-04 1.161513e-04 2.903783e-04 4.839639e-06 
##         3054         3063         3065         3067         3073         3075 
## 1.451892e-04 4.839639e-06 6.872287e-04 2.003610e-03 2.129441e-04 1.451892e-04 
##         3076         3077         3078         3081         3082         3083 
## 1.597081e-04 2.298828e-03 2.414980e-03 3.774918e-04 2.226234e-04 3.871711e-04 
##         3084         3085         3086         3087         3089         3091 
## 1.113117e-04 2.129441e-04 4.839639e-06 3.581333e-04 3.097369e-04 3.097369e-04 
##         3093         3094         3095         3097         3099         3100 
## 1.064720e-04 9.050124e-04 3.097369e-04 2.052007e-03 2.777953e-03 3.349030e-03 
##         3101         3103         3105         3107         3113         3114 
## 2.458536e-03 3.029614e-03 1.321221e-03 1.113117e-04 9.679277e-06 4.839639e-06 
##         3121         3122         3123         3133         3135         3136 
## 2.903783e-05 3.387747e-05 3.387747e-05 1.088919e-03 6.775494e-05 1.016324e-03 
##         3218         3219         3221         3223         3228         3229 
## 3.000576e-04 6.291530e-05 4.839639e-06 4.839639e-06 4.839639e-06 4.839639e-06 
##         3230         3231         3233         3234         3235         3236 
## 1.113117e-04 1.935855e-05 2.274630e-04 4.839639e-06 2.806990e-04 5.323602e-05 
##         3251         3257         3258         3291         3293         3294 
## 4.839639e-06 3.871711e-05 5.807566e-05 6.291530e-05 4.839639e-06 5.807566e-05 
##         3295         3296         3298         3300         3301         3304 
## 6.775494e-05 2.758594e-04 3.048972e-04 1.887459e-04 6.775494e-05 1.935855e-05 
##         3305         3440         3441         3442         3520         3521 
## 1.451892e-04 6.291530e-05 4.839639e-06 4.839639e-06 2.579527e-03 1.209910e-03 
##         3524         3528         3530         3532         3533         3536 
## 1.577722e-03 3.436143e-04 1.839063e-04 4.195967e-03 3.155444e-03 2.289149e-03 
##         3537         3538         3539         3540         3542         3543 
## 2.323027e-04 3.421624e-03 4.021740e-03 1.553524e-03 4.394392e-03 1.858421e-03 
##         3544         3550         3551         3603         3902         3903 
## 4.302439e-03 1.166353e-03 2.284309e-03 1.355099e-04 2.177837e-04 3.000576e-04 
##         3904         3905         4114         4175         4176         4177 
## 2.952180e-04 2.468216e-04 4.839639e-06 1.616439e-03 2.095564e-03 1.006645e-03 
##         4178         4179         4180         4181         4184         4185 
## 3.479700e-03 1.398656e-03 1.321221e-03 3.257077e-03 2.903783e-05 7.646629e-04 
##         4186         4187         4188         4189         4190         4207 
## 8.808142e-04 7.453043e-04 5.613981e-04 1.442212e-03 7.743422e-04 4.839639e-06 
##         4209         4210         4211         4214         4215         4216 
## 2.182677e-03 2.076205e-03 9.437295e-04 1.190551e-03 2.947340e-03 1.935855e-05 
##         4219         4220         4223         4226         4227         4228 
## 9.679277e-06 1.451892e-05 1.597081e-03 9.437295e-04 1.040522e-03 2.719877e-03 
##         4229         4230         4231         4235         4238         4272 
## 7.259458e-04 1.301863e-03 1.011484e-03 9.776070e-04 2.477895e-03 2.821509e-03 
##         4273         4274         4275         4276         4277         4278 
## 2.923142e-03 2.134281e-03 4.839639e-06 2.613405e-04 4.355675e-04 5.420395e-04 
##         4279         4280         4281         4282         4293         4304 
## 5.323602e-05 4.791242e-04 2.903783e-04 3.194161e-04 4.839639e-06 2.008450e-03 
##         4305         4307         4308         4309         4311         4313 
## 2.182677e-03 2.216554e-03 3.184482e-03 4.399231e-03 1.495448e-03 2.303668e-03 
##         4314         4315         4316         4317         4318         4319 
## 4.573458e-03 3.465181e-03 4.936431e-03 4.292759e-03 3.973343e-03 3.431304e-03 
##         4320         4322         4323         4327         4328         4333 
## 2.027809e-03 1.398656e-03 5.178413e-03 1.616439e-03 4.839639e-06 1.935855e-05 
##         4335         4336         4340         4342         4343         4345 
## 2.381102e-03 4.839639e-06 2.114922e-03 1.839063e-04 1.839063e-04 5.033224e-04 
##         4349         4351         4353         4359         4362         4363 
## 2.710198e-03 9.679277e-06 2.903783e-05 4.210486e-04 1.693874e-04 1.509967e-03 
##         4366         4367         4370         4372         4375         4376 
## 6.630305e-04 2.947340e-03 1.451892e-05 1.451892e-05 9.388899e-04 2.986057e-03 
##         4377         4379         4386         4387         4388         4389 
## 1.045362e-03 4.839639e-06 1.810025e-03 4.413750e-03 3.271596e-03 3.165124e-03 
##         4391         4399         4400         4401         4406         4407 
## 9.679277e-06 2.855387e-04 3.382907e-03 3.750720e-03 2.661801e-04 1.689034e-03 
##         4415         4430         4432         4436         4438         4442 
## 1.253466e-03 6.533512e-04 2.903783e-05 3.871711e-04 9.679277e-06 5.793047e-03 
##         4444         4453         4454         4455         4456         4458 
## 1.577722e-03 1.209910e-04 3.387747e-05 1.548684e-04 1.645477e-04 8.711349e-05 
##         4460         4461         4462         4466         4467         4469 
## 3.261916e-03 1.863261e-03 9.679277e-06 3.600691e-03 1.548684e-04 5.744651e-03 
##         4471         4475         4485         4486         4502         4503 
## 3.726522e-03 2.792471e-03 1.016324e-04 8.227386e-05 3.518417e-03 5.822085e-03 
##         4504         4505         4506         4507         4508         4509 
## 5.734972e-03 2.632763e-03 5.260687e-03 2.477895e-03 3.203841e-03 5.943076e-03 
##         4510         4511         4512         4513         4515         4516 
## 6.078586e-03 6.247973e-03 2.966698e-03 4.225004e-03 4.912233e-03 1.868100e-03 
##         4518         4519         4520         4522         4523         4524 
## 8.227386e-05 5.613981e-03 3.958824e-03 3.048972e-04 5.483311e-03 4.587977e-03 
##         4530         4531         4533         4534         4536         4537 
## 1.451892e-05 9.679277e-06 4.839639e-06 8.420971e-04 5.797887e-03 4.955790e-03 
##         4538         4539         4540         4541         4542         4543 
## 2.385942e-03 6.315728e-03 6.199577e-03 6.470597e-03 3.503898e-03 5.425235e-03 
##         4544         4545         4546         4565         4648         4653 
## 3.595851e-03 3.237718e-03 1.935855e-05 4.839639e-06 2.874745e-03 4.839639e-06 
##         4844         4870         4871         4872         4873         4874 
## 4.839639e-06 3.218360e-03 3.470021e-03 2.860226e-03 5.105819e-03 3.111888e-03 
##         4875         4876         4877         4878         4879         4891 
## 3.799116e-03 3.963664e-03 4.646053e-04 1.064720e-03 6.136662e-03 1.451892e-04 
##         4893         4894         4899         4900         4902         4903 
## 2.661801e-04 4.839639e-06 9.679277e-06 5.367159e-03 5.323602e-05 5.788208e-03 
##         4904         4905         4906         4907         4908         4909 
## 3.653927e-03 3.653927e-03 4.181448e-03 1.113117e-04 6.247973e-03 5.028384e-03 
##         4910         4911         4912         4913         4914         4915 
## 2.163318e-03 3.431304e-03 3.474860e-03 9.679277e-06 4.162089e-04 1.064720e-04 
##         4920         4923         4924         4926         4927         4928 
## 7.743422e-05 1.113117e-04 4.839639e-06 3.440983e-03 2.656962e-03 4.500864e-03 
##         4929         4930         4931         4940         5059         5060 
## 3.436143e-04 4.239523e-03 5.430074e-03 2.323027e-04 1.451892e-05 1.935855e-04 
##         5264         5269         5621         5622         5623         5624 
## 4.839639e-06 4.839639e-06 6.228615e-03 6.073746e-03 7.816016e-03 4.515383e-03 
##         5625         5633         5635         5637         5642         5645 
## 4.742846e-03 4.413750e-03 1.935855e-05 4.791242e-03 4.839639e-06 2.952180e-04 
##         5646         5657         5661         5665         5666         5670 
## 7.743422e-05 1.935855e-05 8.266103e-03 1.451892e-05 4.839639e-06 3.324832e-03 
##         5671         5673         5683         5684         5685         5686 
## 4.413750e-03 5.589783e-03 4.694449e-04 1.742270e-04 2.032648e-04 2.661801e-04 
##         5695         5697         5699         5700         5701         5702 
## 2.516612e-04 2.008450e-03 8.227386e-05 2.952180e-04 4.839639e-06 7.375609e-03 
##         5703         5708         5710         5712         5713         5715 
## 2.613405e-04 1.161513e-04 2.177837e-04 2.226234e-04 4.549260e-04 6.039869e-03 
##         5717         5718         5719         5720         5721         5723 
## 5.613981e-04 4.452467e-04 7.094910e-03 2.981217e-03 5.923718e-03 4.626694e-03 
##         5725         5726         5727         5728         5729         5730 
## 3.968504e-04 6.727098e-03 1.572883e-03 5.178413e-03 6.620626e-03 1.563203e-03 
##         5731         5732         5734         5742         5745         5746 
## 4.955790e-03 4.970309e-03 5.434914e-03 4.128212e-03 4.142731e-03 4.583138e-03 
##         5749         5750         5751         5752         5754         5755 
## 5.943076e-03 1.790666e-04 2.419819e-04 4.675091e-03 1.984252e-04 4.713808e-03 
##         5757         5758         5760         5761         5762         5763 
## 1.451892e-05 4.113693e-04 7.041674e-03 6.083426e-03 3.736201e-03 3.261916e-03 
##         5764         5765         5766         5767         5772         5774 
## 1.722911e-03 4.210486e-03 2.840868e-03 1.355099e-04 3.048972e-04 2.758594e-04 
##         5775         5776         5778         5779         5780         5781 
## 6.746456e-03 2.661801e-04 3.339351e-04 3.557134e-03 3.387747e-03 6.499635e-03 
##         5783         5785         5787         5788         5789         5795 
## 2.623084e-03 6.049548e-04 2.371423e-04 9.679277e-06 4.839639e-06 1.800346e-03 
##         5798         5799         5803         5804         5805         5806 
## 1.451892e-05 1.839063e-04 4.258882e-04 6.339927e-04 1.306702e-04 1.935855e-05 
##         5807         5808         5810         5811         5812         5813 
## 5.226810e-04 1.935855e-05 1.403495e-04 5.613981e-04 4.791242e-04 6.533512e-04 
##         5814         5816         5817         5819         5820         5821 
## 2.903783e-05 2.260111e-03 2.032648e-04 5.081621e-04 9.185634e-03 6.698060e-03 
##         5822         5823         5881         5882         5883         5885 
## 1.858421e-03 1.529326e-03 3.997541e-03 6.456078e-03 6.156020e-03 5.730132e-03 
##         5886         5887         5888         5889         5890         5894 
## 4.554100e-03 6.218936e-03 2.860226e-03 3.150605e-03 6.291530e-05 3.058652e-03 
##         5896         5917         5921         5923         5924         5930 
## 4.839639e-06 5.323602e-04 4.839639e-06 4.549260e-04 3.097369e-04 1.451892e-05 
##         5934         5936         5940         5948         5949         5962 
## 2.274630e-04 2.323027e-04 1.451892e-05 3.387747e-05 1.451892e-05 9.679277e-06 
##         5963         5964         5965         5966         5968         5969 
## 2.565008e-04 1.016324e-03 1.306702e-04 1.340580e-03 1.195391e-03 1.805185e-03 
##         5975         5982         5995         5996         5997         5998 
## 4.839639e-06 3.194161e-04 4.839639e-06 3.310313e-03 3.508738e-03 8.716189e-03 
##         6000         6001         6002         6003         6004         6005 
## 6.669022e-03 4.045938e-03 7.646629e-03 2.124601e-03 6.330247e-03 6.160860e-03 
##         6008         6030         6032         6033         6035         6037 
## 4.839639e-06 1.790666e-04 7.259458e-04 1.935855e-05 6.678701e-04 2.995736e-03 
##         6038         6039         6041         6042         6043         6044 
## 2.652122e-03 3.155444e-03 1.747110e-03 3.479700e-03 4.839639e-06 1.935855e-05 
##         6045         6046         6047         6048         6049         6051 
## 1.742270e-04 1.839063e-04 1.839063e-04 1.064720e-04 1.108277e-03 2.371423e-04 
##         6052         6053         6055         6056         6064         6065 
## 2.565008e-04 2.032648e-04 1.984252e-04 2.419819e-05 1.558364e-03 3.179643e-03 
##         6067         6080         6092 
## 2.918302e-03 1.161513e-04 1.451892e-05
table(data.quali$emergency.vehicle.type)
## 
##         AR  BEAA BSPP         CA   CCR BSPP    CD BSPP        CFS       CRAC 
##        766        307          6       1285         21          8         21 
##        CRF        DAP        DEP   EPA BSPP       EPAN       EPSA       ESAV 
##       1417          1          4       1322       1247        339          1 
##         FA       FFSS FMOGP BSPP       FNPC   FPT BSPP  FPT SSLIA  FPTL BSPP 
##       2228        126          1        856       3607          1        288 
##      OHFOM        PEV        PSE        PST       SFCB         SP       SPTT 
##         83          7      30149         17         50          3          1 
##       UMPS        VID       VLHP   VLR BSPP        VPS        VRA        VRM 
##         29        666         37       4154        151          3          6 
##  VSAV BALA  VSAV BSPP  VSAV SDIS VSAV SSLIA       VSIS       VSTI 
##          1     157393          2         20          2          1
prop.table(table(data.quali$emergency.vehicle.type))
## 
##           AR    BEAA BSPP           CA     CCR BSPP      CD BSPP          CFS 
## 3.707163e-03 1.485769e-03 2.903783e-05 6.218936e-03 1.016324e-04 3.871711e-05 
##         CRAC          CRF          DAP          DEP     EPA BSPP         EPAN 
## 1.016324e-04 6.857768e-03 4.839639e-06 1.935855e-05 6.398002e-03 6.035029e-03 
##         EPSA         ESAV           FA         FFSS   FMOGP BSPP         FNPC 
## 1.640637e-03 4.839639e-06 1.078271e-02 6.097945e-04 4.839639e-06 4.142731e-03 
##     FPT BSPP    FPT SSLIA    FPTL BSPP        OHFOM          PEV          PSE 
## 1.745658e-02 4.839639e-06 1.393816e-03 4.016900e-04 3.387747e-05 1.459103e-01 
##          PST         SFCB           SP         SPTT         UMPS          VID 
## 8.227386e-05 2.419819e-04 1.451892e-05 4.839639e-06 1.403495e-04 3.223199e-03 
##         VLHP     VLR BSPP          VPS          VRA          VRM    VSAV BALA 
## 1.790666e-04 2.010386e-02 7.307854e-04 1.451892e-05 2.903783e-05 4.839639e-06 
##    VSAV BSPP    VSAV SDIS   VSAV SSLIA         VSIS         VSTI 
## 7.617252e-01 9.679277e-06 9.679277e-05 9.679277e-06 4.839639e-06

According to the results, we need to combine several categories for the emergency vehicle type variable. However, we don’t have enough information for grouping levels of that variable.

table(data.quali$rescue.center)
## 
##   2418   2434   2435   2436   2437   2438   2439   2440   2441   2442   2443 
##     15   2325   1906   3481   3088   3079   3793   1321   4143   4219   3556 
##   2444   2445   2446   2447   2448   2449   2450   2451   2452   2454   2455 
##   3270   2851   4973   4825   3029   3713   3276   2512   2053   3123   2614 
##   2456   2457   2458   2459   2460   2462   2463   2464   2465   2467   2469 
##   2739   2489   2586   2030   4061   2630   4591   1848   2399   2756   4511 
##   2470   2471   2472   2473   2474   2475   2476   2477   2478   2479   2480 
##   2968   1076   2930   2257   4127   5607   1383   5637   3911   3201   3123 
##   2481   2482   2483   2484   2485   2486   2487   2488   2489   2490   2491 
##   2856   1416   2182   2799   2825   4472   1563   5122   2506   2186   3932 
##   2492   2493   2494   2495   2496   2497   2498   2499   2500   2501   2503 
##   2242   3136   1759   1235   1588   4443   1265   2588   2185   3037   1088 
##   2505   2506   2507   2508   2509   2510 266281 266290 266294 266296 266321 
##    622   2175   2874   1768   2620   2952      2   1150      3      1      3 
## 266323 266324 
##      6      1
prop.table(table(data.quali$rescue.center))
## 
##         2418         2434         2435         2436         2437         2438 
## 7.259458e-05 1.125216e-02 9.224351e-03 1.684678e-02 1.494480e-02 1.490125e-02 
##         2439         2440         2441         2442         2443         2444 
## 1.835675e-02 6.393163e-03 2.005062e-02 2.041844e-02 1.720975e-02 1.582562e-02 
##         2445         2446         2447         2448         2449         2450 
## 1.379781e-02 2.406752e-02 2.335126e-02 1.465927e-02 1.796958e-02 1.585466e-02 
##         2451         2452         2454         2455         2456         2457 
## 1.215717e-02 9.935778e-03 1.511419e-02 1.265082e-02 1.325577e-02 1.204586e-02 
##         2458         2459         2460         2462         2463         2464 
## 1.251531e-02 9.824466e-03 1.965377e-02 1.272825e-02 2.221878e-02 8.943652e-03 
##         2465         2467         2469         2470         2471         2472 
## 1.161029e-02 1.333804e-02 2.183161e-02 1.436405e-02 5.207451e-03 1.418014e-02 
##         2473         2474         2475         2476         2477         2478 
## 1.092306e-02 1.997319e-02 2.713585e-02 6.693220e-03 2.728104e-02 1.892783e-02 
##         2479         2480         2481         2482         2483         2484 
## 1.549168e-02 1.511419e-02 1.382201e-02 6.852928e-03 1.056009e-02 1.354615e-02 
##         2485         2486         2487         2488         2489         2490 
## 1.367198e-02 2.164286e-02 7.564355e-03 2.478863e-02 1.212813e-02 1.057945e-02 
##         2491         2492         2493         2494         2495         2496 
## 1.902946e-02 1.085047e-02 1.517711e-02 8.512924e-03 5.976954e-03 7.685346e-03 
##         2497         2498         2499         2500         2501         2503 
## 2.150251e-02 6.122143e-03 1.252498e-02 1.057461e-02 1.469798e-02 5.265527e-03 
##         2505         2506         2507         2508         2509         2510 
## 3.010255e-03 1.052621e-02 1.390912e-02 8.556481e-03 1.267985e-02 1.428661e-02 
##       266281       266290       266294       266296       266321       266323 
## 9.679277e-06 5.565584e-03 1.451892e-05 4.839639e-06 1.451892e-05 2.903783e-05 
##       266324 
## 4.839639e-06

They are several levels which could be consider as outliers and we need to garbage them out.

table(data.quali$status.preceding.selection)
## 
## Disponible    Rentré 
##       4553     202074
prop.table(table(data.quali$status.preceding.selection))
## 
## Disponible    Rentré 
## 0.02203487 0.97796513

According to the results, we need to combine several categories for status preceding selection variable. However, we don’t have enough information for grouping levels of that variable.

table(data.quali$departed.from.its.rescue.center)
## 
##      0      1 
##   4553 202074
prop.table(table(data.quali$departed.from.its.rescue.center))
## 
##          0          1 
## 0.02203487 0.97796513
3.1.2.2 Barplots
barplot(table(data.quali$alert.reason.category),horiz = F,cex.names=0.8,col="blue",main="Alert Reason Category",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$alert.reason),horiz = F,cex.names=0.8,col="blue",main="Alert Reason Category",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$intervention.on.public.roads),horiz = F,cex.names=0.8,col="blue",main="Intervention on public road",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$floor),horiz = F,cex.names=0.8,col="blue",main="Floors",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$location.of.the.event),horiz = F,cex.names=0.8,col="blue",main="Location of the event",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$emergency.vehicle),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$emergency.vehicle.type),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$rescue.center),horiz = F,cex.names=0.8,col="blue",main="Rescue center variable",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$status.preceding.selection),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)

barplot(table(data.quali$departed.from.its.rescue.center),horiz = F,cex.names=0.8,col="blue",main="Departed from its rescue center",ylab="Frequency", plot=TRUE)

3.1.2.3 Group Together Categories With Low Frequency & Manage Outliers

for the floor variable, group everything below -2 & everything after 17

table(data.clean$floor)
## 
##    -10     -6     -5     -4     -3     -2     -1      0      1      2      3 
##      1      5      5    112    140    605   3672 126288  17577  14629  12193 
##      4      5      6      7      8      9     10     11     12     13     14 
##   9705   6846   4667   2728   1673   1117    777    561    415    314    247 
##     15     16     17     18     19     20     21     22     23     24     25 
##    179    129    133     73     24     14     23     23     11     12      8 
##     26     27     28     29     30     31     32     33     37     52     79 
##     15     16      3     12     10      1      7      2      4      1      1 
##    100 
##      9
levels(data.clean$floor)<-list("-2"=c("-10","-9","-6","-5","-4","-3", "-2"),"-1"=c("-1"),"0"=c("0"),"1"=c("1"),"2"=c("2"),"3"=c("3"),"4"=c("4"),"5"=c("5"),"6"=c("6"),"7"=c("7"),"8"=c("8"),"9"=c("9"),"10"=c("10"),"11"=c("11"),"12"=c("12"),"13"=c("13"),"14"=c("14"),"15"=c("15"),"16"=c("16"),"17"=c("17","18","19","20","21","22","23","24","25","26","27","28","29", "30", "31", "32", "33", "37","52","79","100"))


table(data.clean$floor)
## 
##     -2     -1      0      1      2      3      4      5      6      7      8 
##    868   3672 126288  17577  14629  12193   9705   6846   4667   2728   1673 
##      9     10     11     12     13     14     15     16     17 
##   1117    777    561    415    314    247    179    129    402
barplot(table(data.clean$floor),horiz = F,cex.names=0.8,col="blue",main="Floors",ylab="Frequency", plot=TRUE)

data.quali<-data.clean[,c(3,4,5,6,7,10,11,12,15,17)]

Apply same for x_test

levels(x_test$floor)<-list("-2"=c("-10","-9","-6","-5","-4","-3", "-2"),"-1"=c("-1"),"0"=c("0"),"1"=c("1"),"2"=c("2"),"3"=c("3"),"4"=c("4"),"5"=c("5"),"6"=c("6"),"7"=c("7"),"8"=c("8"),"9"=c("9"),"10"=c("10"),"11"=c("11"),"12"=c("12"),"13"=c("13"),"14"=c("14"),"15"=c("15"),"16"=c("16"),"17"=c("17","18","19","20","21","22","23","24","25","26","27","28","29", "30", "31", "32", "33", "37","52","79","100"))

3.1.2.4 Clean outliers for rescue.center

table(data.quali$rescue.center)
## 
##   2418   2434   2435   2436   2437   2438   2439   2440   2441   2442   2443 
##     15   2314   1900   3466   3051   3060   3783   1306   4112   4169   3503 
##   2444   2445   2446   2447   2448   2449   2450   2451   2452   2454   2455 
##   3233   2840   4958   4780   2988   3696   3265   2487   2038   3085   2590 
##   2456   2457   2458   2459   2460   2462   2463   2464   2465   2467   2469 
##   2711   2457   2542   2010   4025   2586   4572   1819   2381   2740   4491 
##   2470   2471   2472   2473   2474   2475   2476   2477   2478   2479   2480 
##   2933   1058   2915   2207   4105   5575   1372   5607   3872   3184   3078 
##   2481   2482   2483   2484   2485   2486   2487   2488   2489   2490   2491 
##   2848   1403   2171   2787   2819   4456   1547   5113   2457   2178   3911 
##   2492   2493   2494   2495   2496   2497   2498   2499   2500   2501   2503 
##   2221   3110   1748   1213   1569   4401   1252   2577   2159   3013   1083 
##   2505   2506   2507   2508   2509   2510 266281 266290 266294 266296 266321 
##    615   2167   2845   1742   2608   2938      2   1146      2      1      2 
## 266323 266324 
##      3      1

We remove some levels (for data.clean$rescue.center variable) which are not significant for us because there are very few samples for them.

table(data.clean\(rescue.center) levels(data.clean\)rescue.center)

data.clean\(rescue.center <- subset(data.clean\)rescue.center, data.clean$rescue.center != “266281”)

data.clean <- data.clean[data.clean$rescue.center != 266281,]

data.clean <- data.clean[data.clean$rescue.center != 266294,]

data.clean <- data.clean[data.clean$rescue.center != 266296,]

data.clean <- data.clean[data.clean$rescue.center != 266321,]

data.clean <- data.clean[data.clean$rescue.center != 266323,]

data.clean <- data.clean[data.clean$rescue.center != 266324,]

table(droplevels(data.clean$rescue.center))

table(data.clean$rescue.center)

3.2. What is time difference between departure.presentation and OSRM estimated duration ?

Let’s compute time difference between departure.presentation(actual) and OSRM estimated duration (predicted). This difference can’t be taken into account in our future model.

time.diff<-data.clean$delta.departure.presentation-data.clean$OSRM.estimated.duration
time.diff %>% head
## [1] 218.2  53.8  69.6  -6.6 260.4 231.3
hist(time.diff, col="blue",main="Time diff")

summary(time.diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -640.6    27.1    89.0   110.8   169.2  1008.9

Here we can see that thre is a median 90 s (1min 30) of time diffirence between OSRM estimated duration and actual timing. One max at 1000 second. There is also min value which means that the brigate can also drive quickier than what OSRM predicted.

4. Feature engineering

In this section we build new features from the existing ones, trying to find better predictors for our target variable. We prefer to define all these new features in a single code block below and then study them in the following subsections.

4.1: Speed [km/h]

estimated speed by OSRM : distance / time

data.fe<-data.clean
data.fe$OSRM.estimated.speed<-(data.clean$OSRM.estimated.distance/1000)/(data.clean$OSRM.estimated.duration/60^2)
data.fe$OSRM.estimated.speed %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.661  26.154  30.780  31.369  35.601  69.382
hist(data.fe$OSRM.estimated.speed , col="blue",main="Estimated Speed km/h")

Same for x_test

x_test$OSRM.estimated.speed<-(x_test$OSRM.estimated.distance/1000)/(x_test$OSRM.estimated.duration/60^2)

4.1.2 How top influencer predictors and estimated speed correlate ?

data.fe.sample <- sample_n(data.fe, 100)
data.fe.sample %>%
  ggplot(aes(x=alert.reason, y=OSRM.estimated.speed,group=alert.reason)) +
  geom_boxplot()

data.fe.sample %>%
  ggplot(aes(x=rescue.center, y=OSRM.estimated.speed)) +
  geom_boxplot()

data.fe.sample %>%
  ggplot(aes(x=alert.reason.category, y=OSRM.estimated.speed)) +
  geom_boxplot()

4.2 Add month, day of week

data.fe$month <-month(as.Date(as.character(data.fe$date.key.selection), "%Y%m%d"))

data.fe$weekdays <-weekdays(as.Date(as.character(data.fe$date.key.selection), "%Y%m%d"))

Same for x_test

x_test$month <-month(as.Date(as.character(x_test$date.key.selection), "%Y%m%d"))

x_test$weekdays <-weekdays(as.Date(as.character(x_test$date.key.selection), "%Y%m%d"))
data.fe$hours <-as.hms(formatC(as.integer(data.fe$time.key.selection), big.mark = ":", big.interval = 2L))
## Warning: as.hms() is deprecated, please use as_hms().
## This warning is displayed once per session.
## Warning: Lossy cast from <character> to <hms> at position(s) 4, 5, 10, 84,
## 103, ... (and 7415 more)
data.fe$hours <- replace(data.fe$hours, is.na(data.fe$hours), "00:00:00")

Same for x_test

x_test$hours <-as.hms(formatC(as.integer(x_test$time.key.selection), big.mark = ":", big.interval = 2L))
## Warning: Lossy cast from <character> to <hms> at position(s) 36, 40, 54, 130,
## 138, ... (and 3880 more)
x_test$hours <- replace(x_test$hours, is.na(x_test$hours), "00:00:00")
data.fe$hours <- substr(data.fe$hours, 1, 2)
x_test$hours <- substr(x_test$hours, 1, 2)
table(data.fe$month)
## 
##     1     2     3     4     5     6     7     8    10    11    12 
## 19483 17123 18860 17661 18879 19098 19792 16327 19134 19065 19565
prop.table(table(data.fe$month))
## 
##          1          2          3          4          5          6          7 
## 0.09504505 0.08353213 0.09200583 0.08615668 0.09209852 0.09316688 0.09655246 
##          8         10         11         12 
## 0.07964895 0.09334250 0.09300590 0.09544508
barplot(table(data.fe$month),horiz = F,cex.names=0.8,col="blue",main="Month",ylab="Frequency", plot=TRUE)

table(data.fe$weekdays)
## 
## dimanche    jeudi    lundi    mardi mercredi   samedi vendredi 
##    28066    29257    31000    29163    29178    28093    30230
prop.table(table(data.fe$weekdays))
## 
##  dimanche     jeudi     lundi     mardi  mercredi    samedi  vendredi 
## 0.1369160 0.1427261 0.1512291 0.1422676 0.1423407 0.1370477 0.1474728
barplot(table(data.fe$weekdays),horiz = F,cex.names=0.8,col="blue",main="Day",ylab="Frequency", plot=TRUE)

table(data.fe$hours)
## 
##    00    01    02    03    04    05    06    07    08    09    10    11    12 
##  7420  6346  5153  4706  4142  3953  4376  5562  7596  9230  9871 10546 11326 
##    13    14    15    16    17    18    19    20    21    22    23 
## 11159 10737 10307 10159 10282 10684 11349 11122 10462  9767  8732
prop.table(table(data.fe$hours))
## 
##         00         01         02         03         04         05         06 
## 0.03619742 0.03095806 0.02513818 0.02295755 0.02020616 0.01928415 0.02134770 
##         07         08         09         10         11         12         13 
## 0.02713343 0.03705601 0.04502725 0.04815427 0.05144716 0.05525228 0.05443760 
##         14         15         16         17         18         19         20 
## 0.05237893 0.05028124 0.04955924 0.05015928 0.05212038 0.05536449 0.05425710 
##         21         22         23 
## 0.05103738 0.04764692 0.04259782
barplot(table(data.fe$hours),horiz = F,cex.names=0.8,col="blue",main="Hours",ylab="Frequency", plot=TRUE)

We can now remove date.key.selection and time.key.selection

data.fe<-data.fe[,-c(13,14)]

Same for x_test

x_test<-x_test[,-c(13,14)]

Convert time and date build variables into factors

data.fe$hours <- as.factor(data.fe$hours)
data.fe$weekdays <- as.factor(data.fe$weekdays)
data.fe$month <- as.factor(data.fe$month)
x_test$hours <- as.factor(x_test$hours)
x_test$weekdays <- as.factor(x_test$weekdays)
x_test$month <- as.factor(x_test$month)

4.3 Order of the brigade

For the same event, multiple brigade can leave. It is worth to take into account that it might be a correlation between the order of the brigade to leave and the time for preparation. Once we have classify those brigade, we will be able to remove the id of the intervention.

data.fe$intervention %>% as.factor() %>% str
##  Factor w/ 196480 levels "12649492","12650159",..: 111710 2131 168556 8491 173225 5521 137030 17551 143501 143369 ...
data.fe$emergency.vehicle.selection %>% str
##  int [1:204987] 5105452 4720915 5365374 4741586 5381209 4731603 5196431 4774057 5277444 5277017 ...
204987-196480
## [1] 8507
x_test$intervention %>% as.factor() %>% str
##  Factor w/ 102466 levels "12649492","12650159",..: 73849 56296 64950 86532 68610 73049 784 52060 18501 92121 ...

Plot frequency order of id intervention.

a<-data.fe %>%
        group_by(intervention)%>%
         tally()
b<-x_test %>%
        group_by(intervention)%>%
         tally()
a %>% setDT
a %>% str
## Classes 'data.table' and 'data.frame':   196480 obs. of  2 variables:
##  $ intervention: int  12649492 12650159 12651734 12651765 12651790 12651791 12651829 12651860 12651907 12651960 ...
##  $ n           : int  1 1 1 1 1 1 1 1 1 1 ...
b %>% setDT
data.fe<-merge(a,data.fe, by.x = 'intervention', by.y = 'intervention', all=FALSE)
x_test<-merge(b,x_test, by.x = 'intervention', by.y = 'intervention', all=FALSE)

Rename and convert as factor

data.fe$n<-data.fe$n %>% as.factor
x_test$n<-x_test$n %>% as.factor

Now that we have intervention frequency as factor, we can remove intervention ID

data.fe<-data.fe[,-c(1)]
x_test<-x_test[,-c(1)]

4.4 GPS

We decide to remove GPS data since we have the same data.

data.fe<-data.fe[,-c(18,19)]
x_test<-x_test[,-c(18,19)]

5. Correlation analysis

Update data.quanti and data.quali

colnames(data.fe)
##  [1] "n"                                         
##  [2] "emergency.vehicle.selection"               
##  [3] "alert.reason.category"                     
##  [4] "alert.reason"                              
##  [5] "intervention.on.public.roads"              
##  [6] "floor"                                     
##  [7] "location.of.the.event"                     
##  [8] "longitude.intervention"                    
##  [9] "latitude.intervention"                     
## [10] "emergency.vehicle"                         
## [11] "emergency.vehicle.type"                    
## [12] "rescue.center"                             
## [13] "status.preceding.selection"                
## [14] "delta.status.preceding.selection.selection"
## [15] "departed.from.its.rescue.center"           
## [16] "longitude.before.departure"                
## [17] "latitude.before.departure"                 
## [18] "OSRM.estimated.distance"                   
## [19] "OSRM.estimated.duration"                   
## [20] "delta.selection.departure"                 
## [21] "delta.departure.presentation"              
## [22] "delta.selection.presentation"              
## [23] "OSRM.estimated.speed"                      
## [24] "month"                                     
## [25] "weekdays"                                  
## [26] "hours"
str(data.fe)
## Classes 'data.table' and 'data.frame':   204987 obs. of  26 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ alert.reason                              : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ status.preceding.selection                : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ longitude.before.departure                : num  2.33 2.28 2.34 2.29 2.18 ...
##  $ latitude.before.departure                 : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ delta.departure.presentation              : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
data.quanti<-data.fe[,c(23,22,19,18,17,16,14,9,8,2)]

data.quanti.y<-data.fe[,c(23,22,21,20,19,18,17,16,14,9,8,2)]

data.quali<-data.fe[,-c(23,22,21,20,19,18,17,16,14,9,8,2)]

data.quali.y<-data.fe[,-c(23,22,19,18,17,16,14,9,8,2)]

str(data.quanti)
## Classes 'data.table' and 'data.frame':   204987 obs. of  10 variables:
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ latitude.before.departure                 : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ longitude.before.departure                : num  2.33 2.28 2.34 2.29 2.18 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(data.quanti.y)
## Classes 'data.table' and 'data.frame':   204987 obs. of  12 variables:
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ delta.departure.presentation              : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ latitude.before.departure                 : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ longitude.before.departure                : num  2.33 2.28 2.34 2.29 2.18 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(data.quali)
## Classes 'data.table' and 'data.frame':   204987 obs. of  14 variables:
##  $ n                              : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category          : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ alert.reason                   : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
##  $ intervention.on.public.roads   : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                          : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event          : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ emergency.vehicle              : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type         : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                  : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ status.preceding.selection     : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
##  $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ month                          : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                       : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                          : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(data.quali.y)
## Classes 'data.table' and 'data.frame':   204987 obs. of  16 variables:
##  $ n                              : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category          : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ alert.reason                   : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
##  $ intervention.on.public.roads   : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                          : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event          : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ emergency.vehicle              : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type         : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                  : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ status.preceding.selection     : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
##  $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ delta.selection.departure      : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ delta.departure.presentation   : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ month                          : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                       : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                          : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

After engineering new features and before starting the modeling, we have to visualize the relations between our parameters using a correlation matrix. For this, we need to change all the factor features into a numerical format. The visualization uses the corrplot function from the eponymous package. Corrplot gives us great flexibility in manipulating the style of our plot.

What we see below, are the color-coded correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation (dark blue and dark red in the plots below).

Here, we are of course interested if and how strongly our correlate with the ys. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the colinearity in our data set and improve the robustness of our prediction.

data.fe %>%
  select(-emergency.vehicle.selection) %>%
  mutate(n = as.integer(n),
         alert.reason = as.integer(alert.reason),
         floor = as.integer(floor),
         emergency.vehicle = as.integer(emergency.vehicle),
         rescue.center = as.integer(rescue.center),
         delta.selection.presentation = as.integer(delta.selection.presentation),
         month = as.integer(month),
         hours = as.integer(hours),
        weekdays = as.integer(weekdays),
         alert.reason.category = as.integer(alert.reason.category),
         intervention.on.public.roads = as.integer(intervention.on.public.roads),
         location.of.the.event = as.integer(location.of.the.event),
         emergency.vehicle.type = as.integer(emergency.vehicle.type),
         status.preceding.selection = as.integer(status.preceding.selection),
          departed.from.its.rescue.center = as.integer(departed.from.its.rescue.center))%>%

  cor(use="complete.obs", method = "spearman") %>%
  corrplot(type="lower", method="pie",order="hclust", 
         diag=FALSE)

Add coefficients

data.fe %>%
  select(-emergency.vehicle.selection) %>%
  mutate(n = as.integer(n),
         alert.reason = as.integer(alert.reason),
         floor = as.integer(floor),
         emergency.vehicle = as.integer(emergency.vehicle),
         rescue.center = as.integer(rescue.center),
         delta.selection.presentation = as.integer(delta.selection.presentation),
         month = as.integer(month),
         hours = as.integer(hours),
        weekdays = as.integer(weekdays),
         alert.reason.category = as.integer(alert.reason.category),
         intervention.on.public.roads = as.integer(intervention.on.public.roads),
         location.of.the.event = as.integer(location.of.the.event),
         emergency.vehicle.type = as.integer(emergency.vehicle.type),
         status.preceding.selection = as.integer(status.preceding.selection),
          departed.from.its.rescue.center = as.integer(departed.from.its.rescue.center))%>%

  cor(use="complete.obs", method = "spearman") %>%
  corrplot(type="lower", method="square",order="hclust", 
         addCoef.col = "black", diag=FALSE)

We find :

  • Alert reason is correlated to alert reason.category (0,65).
  • OSRM estimated speed and OSRM distance are correlated (this is quite normal). We decide to keep the 2 features.
  • Status departure before selection and departed from its rescue center are perfectly correlated. We remove status departure before selection.
  • Longitude.before.departure and longitude.intervention are highly correlated. We can remove longitude.before.departure
  • Latitude.before.departure and lat.intervention are highly correlated. We can remove lat.before.departure
  • Except for ORSM features, there is no high correlated feature.
data.fe<-data.fe[,-c("alert.reason","status.preceding.selection","longitude.before.departure","latitude.before.departure")]
x_test<-x_test[,-c("alert.reason","status.preceding.selection","longitude.before.departure","latitude.before.departure")]

6. Boosted Tree with XGB

Boosted Tree aka XGBoost. XGBoost is a well-known and efficient open source implementation of the improved gradient tree algorithm.
Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining estimates from a simpler and weaker set of models. GBoost reduces a regularized objective function (L1 and L2) that combines a convex loss function (based on the difference between predicted and target outputs) and a penalty condition for model complexity (in other words, regression tree functions). Training continues iteratively, adding new trees that predict residuals or errors from previous trees that are then combined with the previous trees to make the final prediction.

In other words we are building a tree and looks which value is predicted poorly and assign to it higher weigh in our prediction.

Let us build and predict our model with 100 maximum number of boosting iterations.

6.1 Sample

data.fe %>% str
## Classes 'data.table' and 'data.frame':   204987 obs. of  22 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ delta.departure.presentation              : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Store final dataset

dataset<-data.fe

Sample data

set.seed(123)
n_train <- (0.8 * nrow(dataset))
train_indices <- sample(1:nrow(dataset), n_train)
trainset <- dataset[train_indices, ]
testset <- dataset[-train_indices, ]

6.2 Yo : Selection - Departure

trainset %>% head
##    n emergency.vehicle.selection alert.reason.category
## 1: 1                     5387610                     3
## 2: 1                     5410147                     3
## 3: 1                     5167483                     3
## 4: 1                     5132649                     3
## 5: 3                     5315418                     1
## 6: 1                     5058200                     3
##    intervention.on.public.roads floor location.of.the.event
## 1:                            0     0                   136
## 2:                            0     0                   107
## 3:                            1     0                   228
## 4:                            1     0                   149
## 5:                            0     2                   140
## 6:                            1     0                   148
##    longitude.intervention latitude.intervention emergency.vehicle
## 1:               2.358910              48.88494              4316
## 2:               2.328205              48.86534              4305
## 3:               2.229676              48.91456              4931
## 4:               2.436133              48.85821              6065
## 5:               2.390189              48.94845              1834
## 6:               2.374644              48.88676              5885
##    emergency.vehicle.type rescue.center
## 1:              VSAV BSPP          2469
## 2:                    PSE          2493
## 3:              VSAV BSPP          2455
## 4:              VSAV BSPP          2449
## 5:                    PSE          2467
## 6:              VSAV BSPP          2439
##    delta.status.preceding.selection.selection departed.from.its.rescue.center
## 1:                                       5681                               1
## 2:                                       1625                               1
## 3:                                       1074                               1
## 4:                                       2365                               1
## 5:                                       5046                               1
## 6:                                        748                               1
##    OSRM.estimated.distance OSRM.estimated.duration delta.selection.departure
## 1:                  1533.9                   166.9                       166
## 2:                  1800.5                   219.8                       156
## 3:                  2899.7                   293.5                        65
## 4:                  2953.5                   370.9                       143
## 5:                  2755.3                   248.9                       126
## 6:                  2086.2                   295.8                       126
##    delta.departure.presentation delta.selection.presentation
## 1:                          292                          458
## 2:                          592                          748
## 3:                          354                          419
## 4:                          467                          610
## 5:                          512                          638
## 6:                          138                          264
##    OSRM.estimated.speed month weekdays hours
## 1:             33.08592    11    lundi    23
## 2:             29.48954    12    jeudi    17
## 3:             35.56702     8    lundi    18
## 4:             28.66703     7 vendredi    17
## 5:             39.85167    10    lundi    23
## 6:             25.38986     6    mardi    02

Store only y0 target value

trainset0_y0<-dataset[train_indices,delta.selection.departure]
trainset0<-trainset[,-c("delta.selection.departure","delta.departure.presentation","delta.selection.presentation")]
testset0_y0<-dataset[-train_indices,delta.selection.departure]
testset0<-testset[,-c("delta.selection.departure","delta.departure.presentation","delta.selection.presentation")]
trainset$delta.selection.departure %>% head
## [1] 166 156  65 143 126 126
dataset[train_indices,delta.selection.departure] %>% head
## [1] 166 156  65 143 126 126

Create one-hot matrix for cat variable. with sparse-matrix

sparse_matrix <- sparse.model.matrix( ~ ., data = trainset0)[,-1]

Store into xgb Matrix

dtrain1 <- xgb.DMatrix(sparse_matrix,label = trainset$delta.selection.departure)

Set parameters for our xgb tree

xgb_params <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear"
                   )

train our model

set.seed(4321)
gb_0_dt <- xgb.train(params = xgb_params,
                   data = dtrain1,
                   print_every_n = 100,
                   nrounds = 100)
## [01:26:25] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.

Plot the 30 first important features in our model

importance_matrix <- xgb.importance(model = gb_0_dt)
xgb.plot.importance(importance_matrix[1:30,])

We find : - Delta statut preceding selection - Hours - Lat intervention - Long internvetion - emergency vehicule type VSAV$VD (group is or not) - ORSRM data - Rescue center 2474,77,75,2439 (group is or not) - Alert-reason-cat-03 -departed from its rescue center1

Predict with testset0

sparse_matrix2 <- sparse.model.matrix( ~ ., data =testset0 )[,-1]

Store into xgb Matrix

dtest <- xgb.DMatrix(sparse_matrix2,label=testset$delta.selection.departure)

Predict

pred_xgboost_y0 <- predict(gb_0_dt,dtest)
pred_xgboost_y0 %>% head
## [1] 140.30222  96.39616 154.99208 115.06355  85.18041 126.75983
testset$delta.selection.departure %>% head
## [1]  47 149  97 113 124 145

Compute MSE, MAE, R2

postResample(pred = pred_xgboost_y0, obs = testset$delta.selection.departure)
##       RMSE   Rsquared        MAE 
## 46.4920917  0.2753537 34.1598139

6.3 Y1 : Departure - Presentation

Store only y1 target value

trainset1_y1<-trainset$delta.departure.presentation
trainset1<-trainset[,-c("delta.selection.departure","delta.selection.presentation","delta.departure.presentation")]
testset1_y1<-testset$delta.departure.presentation
testset1<-testset[,-c("delta.selection.departure","delta.selection.presentation","delta.departure.presentation")]

Create one-hot matrix for cat variable. with sparse-matrix

sparse_matrix <- sparse.model.matrix( ~ ., data = trainset1)[,-1]

Store into xgb Matrix

dtrain2 <- xgb.DMatrix(sparse_matrix,label = trainset$delta.departure.presentation)

Set parameters for our xgb tree

xgb_params <- list(colsample_bytree = 0.5, #variables per tree 
                   subsample = 0.5, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 3, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear",
                   seed = 4321
                   )

train our model

set.seed(4321)
gb_1_dt <- xgb.train(params = xgb_params,
                   data = dtrain2,
                   print_every_n = 100,
                   nrounds = 100)
## Warning in xgb.train(params = xgb_params, data = dtrain2, print_every_n = 100, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:26:38] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.

Plot the 30 first important features in our model

importance_matrix <- xgb.importance(model = gb_1_dt)
xgb.plot.importance(importance_matrix[1:30,])

We find : - OSRM data - Lat intervention - Long internvetion - statut.preceding-selection - emergency vehicule type VSAV$VD (group is or not)

Predict

sparse_matrix2 <- sparse.model.matrix( ~ ., data =testset1 )[,-1]

Store into xgb Matrix

dtest <- xgb.DMatrix(sparse_matrix2,label=testset$delta.departure.presentation)

Predict

pred_xgboost_y1 <- predict(gb_1_dt,dtest)
pred_xgboost_y1 %>% head
## [1] 415.4552 306.6591 403.4247 377.4039 363.4391 397.6378
testset$delta.departure.presentation %>% head
## [1] 376 268 409 678 432 236
postResample(pred = pred_xgboost_y1, obs = testset$delta.departure.presentation)
##        RMSE    Rsquared         MAE 
## 125.2175578   0.4038202  89.9848882

6.4 Ys : Compute global

pred_xgboost_ys=pred_xgboost_y0+pred_xgboost_y1
pred_xgboost=data.frame(pred_xgboost_y0,pred_xgboost_y1,pred_xgboost_ys)

pred_xgboost %>% head
##   pred_xgboost_y0 pred_xgboost_y1 pred_xgboost_ys
## 1       140.30222        415.4552        555.7574
## 2        96.39616        306.6591        403.0553
## 3       154.99208        403.4247        558.4167
## 4       115.06355        377.4039        492.4674
## 5        85.18041        363.4391        448.6195
## 6       126.75983        397.6378        524.3976
testset$delta.selection.presentation %>% head
## [1] 423 417 506 791 556 381
postResample(pred = pred_xgboost_ys, obs = testset$delta.departure.presentation)
##       RMSE   Rsquared        MAE 
## 186.759072   0.384372 165.226319

No too bad accuracy !

6.5 Use feature importance to create group for factor variables for futur models

For Y0

importance_matrix_y0 <- xgb.importance(model = gb_0_dt)
xgb.plot.importance(importance_matrix_y0[1:30,])

importance_matrix_y0[1:30,]
##                                        Feature        Gain        Cover
##  1: delta.status.preceding.selection.selection 0.156960966 0.0353014695
##  2:                                    hours03 0.072181643 0.0134732246
##  3:                                    hours04 0.070018506 0.0130817159
##  4:                                    hours05 0.069297197 0.0128591216
##  5:                                    hours02 0.068886010 0.0132399501
##  6:                                    hours01 0.060420820 0.0125660520
##  7:                                    hours06 0.037428929 0.0131693354
##  8:            emergency.vehicle.typeVSAV BSPP 0.033491678 0.0108941036
##  9:                     alert.reason.category3 0.029556868 0.0038505879
## 10:                emergency.vehicle.selection 0.025238333 0.0288528568
## 11:                      latitude.intervention 0.023163862 0.0152647111
## 12:                          rescue.center2474 0.017850762 0.0104794051
## 13:                     longitude.intervention 0.014201064 0.0083155053
## 14:                    OSRM.estimated.duration 0.013281196 0.0141529769
## 15:                          rescue.center2475 0.012587701 0.0090285526
## 16:                          rescue.center2477 0.012335731 0.0106209831
## 17:                    OSRM.estimated.distance 0.008561129 0.0026964813
## 18:                       OSRM.estimated.speed 0.008227623 0.0005511891
## 19:                  emergency.vehicle.typeVID 0.007142448 0.0054117444
## 20:                          rescue.center2439 0.007037138 0.0099891437
## 21:                  emergency.vehicle.typeCRF 0.006013882 0.0049193400
## 22:                                    hours23 0.005624843 0.0062668227
## 23:                                    hours12 0.005465856 0.0154358905
## 24:                                    hours13 0.005420156 0.0127640800
## 25:                                    hours11 0.005342021 0.0160590534
## 26:                          rescue.center2456 0.004569964 0.0105871828
## 27:           departed.from.its.rescue.center1 0.004407864 0.0054555106
## 28:                                    hours10 0.004372949 0.0143542630
## 29:                          rescue.center2493 0.003969963 0.0085054840
## 30:                                    hours14 0.003918268 0.0123038207
##                                        Feature        Gain        Cover
##       Frequency
##  1: 0.096484375
##  2: 0.006640625
##  3: 0.005468750
##  4: 0.008203125
##  5: 0.006640625
##  6: 0.009765625
##  7: 0.006640625
##  8: 0.007031250
##  9: 0.009765625
## 10: 0.095703125
## 11: 0.058984375
## 12: 0.003125000
## 13: 0.052343750
## 14: 0.032031250
## 15: 0.002343750
## 16: 0.003515625
## 17: 0.042187500
## 18: 0.037500000
## 19: 0.004296875
## 20: 0.004296875
## 21: 0.001562500
## 22: 0.003906250
## 23: 0.004296875
## 24: 0.007031250
## 25: 0.006250000
## 26: 0.002734375
## 27: 0.006640625
## 28: 0.005078125
## 29: 0.001953125
## 30: 0.004687500
##       Frequency

We find : - Delta status preceding selection - Hours - Lat intervention - Long intervention - emergency vehicle type VSAV$VD (group is or not) - emergency.vehicle.typePSE - emergency.vehicle.typeVID - emergency.vehicle.typeCRF - emergency.vehicle.typeFNPC - ORSRM data - Rescue center 2474,2477,2475,2439, 2435,2464, 2488, 2493(group is or not) - Alert-reason-cat-03 -2 -departed from its rescue center1

We regroup the types of emergency vehicle, the rescue centers and the alert reason category

trainset0.regroup <- trainset0
testset0.regroup <- testset0

trainset0.regroup = trainset0.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="PSE" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "FNPC" ~  1,
                                  TRUE ~ 0))
trainset0.regroup = trainset0.regroup %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~  1,
                                  rescue.center =="2477" ~  1,
                                  rescue.center == "2475" ~  1,
                                  rescue.center == "2439" ~  1,
                                  rescue.center == "2435" ~  1,
                                  rescue.center == "2464" ~  1,
                                  rescue.center == "2488" ~  1,
                                  rescue.center == "2493" ~  1,
                                  TRUE ~ 0))


testset0.regroup = testset0.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="PSE" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "FNPC" ~  1,
                                  TRUE ~ 0))
testset0.regroup = testset0.regroup  %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~  1,
                                  rescue.center =="2477" ~  1,
                                  rescue.center == "2475" ~  1,
                                  rescue.center == "2439" ~  1,
                                  rescue.center == "2435" ~  1,
                                  rescue.center == "2464" ~  1,
                                  rescue.center == "2488" ~  1,
                                  rescue.center == "2493" ~  1,
                                  TRUE ~ 0))

Drop not-important features in trainset0

trainset0.regroup$rescue.center<-NULL
trainset0.regroup$n<-NULL
trainset0.regroup$emergency.vehicle.type<-NULL
trainset0.regroup$floor<-NULL
trainset0.regroup$emergency.vehicle<-NULL
trainset0.regroup$weekdays<-NULL
trainset0.regroup$intervention.on.public.roads<-NULL
trainset0.regroup$location.of.the.event<-NULL
trainset0.regroup$month<-NULL
testset0.regroup$rescue.center<-NULL
testset0.regroup$emergency.vehicle.type<-NULL

testset0.regroup$n<-NULL
testset0.regroup$floor<-NULL
testset0.regroup$emergency.vehicle<-NULL
testset0.regroup$weekdays<-NULL
testset0.regroup$intervention.on.public.roads<-NULL
testset0.regroup$location.of.the.event<-NULL
testset0.regroup$month<-NULL

Same for x_test

x_test0.regroup <- x_test

x_test0.regroup = x_test0.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="PSE" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "FNPC" ~  1,
                                  TRUE ~ 0))
x_test0.regroup = x_test0.regroup %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~  1,
                                  rescue.center =="2477" ~  1,
                                  rescue.center == "2475" ~  1,
                                  rescue.center == "2439" ~  1,
                                  rescue.center == "2435" ~  1,
                                  rescue.center == "2464" ~  1,
                                  rescue.center == "2488" ~  1,
                                  rescue.center == "2493" ~  1,
                                  TRUE ~ 0))
x_test0.regroup$rescue.center<-NULL
x_test0.regroup$emergency.vehicle.type<-NULL

x_test0.regroup$n<-NULL
x_test0.regroup$floor<-NULL
x_test0.regroup$emergency.vehicle<-NULL
x_test0.regroup$weekdays<-NULL
x_test0.regroup$intervention.on.public.roads<-NULL
x_test0.regroup$location.of.the.event<-NULL
x_test0.regroup$month<-NULL

For Y1

importance_matrix_y1 <- xgb.importance(model = gb_1_dt)
xgb.plot.importance(importance_matrix_y1[1:30,])

importance_matrix_y1[1:30,]
##                                        Feature        Gain       Cover
##  1:                    OSRM.estimated.distance 0.709412629 0.058331801
##  2:                    OSRM.estimated.duration 0.126576169 0.065559114
##  3:                      latitude.intervention 0.015988622 0.037607944
##  4:                     longitude.intervention 0.011153806 0.027810403
##  5:                       OSRM.estimated.speed 0.010997592 0.016156794
##  6: delta.status.preceding.selection.selection 0.006750294 0.011458927
##  7:                emergency.vehicle.selection 0.006270702 0.030779022
##  8:                  emergency.vehicle.typeVID 0.005569648 0.009361946
##  9:                          rescue.center2506 0.004225152 0.015055506
## 10:                  emergency.vehicle.typePSE 0.003136839 0.012488449
## 11:            emergency.vehicle.typeVSAV BSPP 0.003132041 0.003476509
## 12:                   location.of.the.event139 0.002581180 0.013239570
## 13:                     alert.reason.category6 0.002392792 0.012834226
## 14:                                    hours21 0.002281380 0.011122909
## 15:                   location.of.the.event136 0.002224477 0.012620189
## 16:                                         n2 0.002174565 0.017383398
## 17:                          rescue.center2481 0.002140234 0.005750444
## 18:              intervention.on.public.roads1 0.002021465 0.002513100
## 19:                  emergency.vehicle.typeCRF 0.001972039 0.010239683
## 20:                          rescue.center2507 0.001919102 0.013192729
## 21:                                    hours22 0.001869608 0.011406353
## 22:                                    hours09 0.001794345 0.009540648
## 23:                          rescue.center2479 0.001601613 0.012979384
## 24:                          rescue.center2491 0.001499905 0.013078067
## 25:                          rescue.center2485 0.001403521 0.009811000
## 26:                          rescue.center2442 0.001329027 0.006739388
## 27:                          rescue.center2450 0.001323689 0.004130654
## 28:                          rescue.center2455 0.001275787 0.002079660
## 29:                                    hours18 0.001251206 0.009525889
## 30:                          rescue.center2498 0.001221759 0.009782415
##                                        Feature        Gain       Cover
##       Frequency
##  1: 0.093567251
##  2: 0.089181287
##  3: 0.068713450
##  4: 0.059941520
##  5: 0.036549708
##  6: 0.033625731
##  7: 0.036549708
##  8: 0.007309942
##  9: 0.007309942
## 10: 0.005847953
## 11: 0.007309942
## 12: 0.005847953
## 13: 0.008771930
## 14: 0.005847953
## 15: 0.005847953
## 16: 0.011695906
## 17: 0.004385965
## 18: 0.004385965
## 19: 0.007309942
## 20: 0.005847953
## 21: 0.007309942
## 22: 0.007309942
## 23: 0.007309942
## 24: 0.005847953
## 25: 0.005847953
## 26: 0.008771930
## 27: 0.002923977
## 28: 0.002923977
## 29: 0.004385965
## 30: 0.005847953
##       Frequency

We find : - OSRM data - Lat intervention - Long intervention - status.preceding-selection - emergency.vehicle.typeVID - emergency.vehicle.typePSE - emergency.vehicle.typeCRF - rescue.center2506, 2507, 2485, 2481, 2498, 2500 - emergency.vehicle.typeVSAV BSPP - emergency.vehicle.typeFPT BSPP - alert.reason.category6 - location.of.the.event139, 136, 259

trainset1.regroup <- trainset1
testset1.regroup <- testset1

trainset1.regroup = trainset1.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="FPT BSPP" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "PSE" ~  1,
                                  TRUE ~ 0))

trainset1.regroup = trainset1.regroup %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~  1,
                                  rescue.center =="2507" ~  1,
                                  rescue.center == "2485" ~  1,
                                  rescue.center == "2481" ~  1,
                                  rescue.center == "2498" ~  1,
                                  rescue.center == "2500" ~  1,
                                  TRUE ~ 0))
trainset1.regroup = trainset1.regroup %>%
    mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~  1,
                                  location.of.the.event =="139" ~  1,
                                  location.of.the.event == "136" ~  1,
                                  rescue.center == "259" ~  1,
                                  TRUE ~ 0))

testset1.regroup = testset1.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="FPT BSPP" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "PSE" ~  1,
                                  TRUE ~ 0))

testset1.regroup = testset1.regroup %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~  1,
                                  rescue.center =="2507" ~  1,
                                  rescue.center == "2485" ~  1,
                                  rescue.center == "2481" ~  1,
                                  rescue.center == "2498" ~  1,
                                  rescue.center == "2500" ~  1,
                                  TRUE ~ 0))
testset1.regroup = testset1.regroup%>%
    mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~  1,
                                  location.of.the.event =="139" ~  1,
                                  location.of.the.event == "136" ~  1,
                                  rescue.center == "259" ~  1,
                                  TRUE ~ 0))

Drop not important features

trainset1.regroup$rescue.center<-NULL
trainset1.regroup$emergency.vehicle.type<-NULL
trainset1.regroup$location.of.the.event<-NULL
trainset1.regroup$emergency.vehicle<-NULL
trainset1.regroup$departed.from.its.rescue.center<-NULL
testset1.regroup$rescue.center<-NULL
testset1.regroup$emergency.vehicle.type<-NULL
testset1.regroup$location.of.the.event<-NULL
testset1.regroup$emergency.vehicle<-NULL
testset1.regroup$departed.from.its.rescue.center<-NULL

Same for X_test

x_test1.regroup <- x_test

x_test1.regroup = x_test1.regroup %>%
    mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~  1,
                                  emergency.vehicle.type =="FPT BSPP" ~  1,
                                  emergency.vehicle.type == "VID" ~  1,
                                  emergency.vehicle.type == "CRF" ~  1,
                                  emergency.vehicle.type == "PSE" ~  1,
                                  TRUE ~ 0))

x_test1.regroup = x_test1.regroup %>%
    mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~  1,
                                  rescue.center =="2507" ~  1,
                                  rescue.center == "2485" ~  1,
                                  rescue.center == "2481" ~  1,
                                  rescue.center == "2498" ~  1,
                                  rescue.center == "2500" ~  1,
                                  TRUE ~ 0))
x_test1.regroup = x_test1.regroup %>%
    mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~  1,
                                  location.of.the.event =="139" ~  1,
                                  location.of.the.event == "136" ~  1,
                                  rescue.center == "259" ~  1,
                                  TRUE ~ 0))

Drop not important features

x_test1.regroup$rescue.center<-NULL
x_test1.regroup$emergency.vehicle.type<-NULL
x_test1.regroup$location.of.the.event<-NULL
x_test1.regroup$emergency.vehicle<-NULL
x_test1.regroup$departed.from.its.rescue.center<-NULL

7. OLS - Regression

7.1 Yo : Selection - Departure

Remove id variable

trainset0.regroup<-trainset0.regroup[,-c("emergency.vehicle.selection")]
trainset0.regroup$delta.selection.departure<-trainset$delta.selection.departure
testset0.regroup<-testset0.regroup[,-c("emergency.vehicle.selection")]
testset0.regroup$delta.selection.departure<-testset$delta.selection.departure

Same x_test

x_test0.regroup<-x_test0.regroup[,-c("emergency.vehicle.selection")]
#Linear Regression,

options(max.print = 10000)

lm <- lm(delta.selection.departure ~.,data = trainset0.regroup)

summary(lm)
## 
## Call:
## lm(formula = delta.selection.departure ~ ., data = trainset0.regroup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -219.10  -31.19   -4.02   26.07  901.11 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -1.742e+03  1.242e+02 -14.027
## alert.reason.category2                     -1.868e+01  9.088e-01 -20.559
## alert.reason.category3                     -1.699e+01  5.410e-01 -31.404
## alert.reason.category4                      3.479e+00  1.664e+00   2.091
## alert.reason.category5                      1.805e+01  2.334e+00   7.733
## alert.reason.category6                      1.266e+01  9.736e-01  13.000
## alert.reason.category7                      2.694e+01  2.279e+00  11.819
## alert.reason.category8                      7.039e+00  3.975e+00   1.771
## alert.reason.category9                      3.767e+00  8.048e-01   4.681
## longitude.intervention                      2.430e+01  1.432e+00  16.976
## latitude.intervention                       3.803e+01  2.540e+00  14.976
## delta.status.preceding.selection.selection  6.206e-05  3.028e-06  20.495
## departed.from.its.rescue.center1           -2.555e+00  8.275e-01  -3.087
## OSRM.estimated.distance                    -2.495e-03  5.199e-04  -4.799
## OSRM.estimated.duration                     4.455e-02  4.977e-03   8.950
## OSRM.estimated.speed                        1.124e-01  3.795e-02   2.962
## hours01                                     1.502e+01  9.266e-01  16.215
## hours02                                     2.770e+01  9.852e-01  28.115
## hours03                                     3.238e+01  1.011e+00  32.019
## hours04                                     3.531e+01  1.055e+00  33.459
## hours05                                     3.576e+01  1.070e+00  33.412
## hours06                                     1.891e+01  1.037e+00  18.230
## hours07                                    -1.944e+01  9.657e-01 -20.134
## hours08                                    -3.107e+01  8.879e-01 -34.991
## hours09                                    -2.651e+01  8.491e-01 -31.220
## hours10                                    -3.685e+01  8.353e-01 -44.115
## hours11                                    -3.918e+01  8.255e-01 -47.457
## hours12                                    -4.004e+01  8.124e-01 -49.282
## hours13                                    -3.599e+01  8.142e-01 -44.210
## hours14                                    -3.667e+01  8.209e-01 -44.671
## hours15                                    -3.574e+01  8.277e-01 -43.181
## hours16                                    -3.740e+01  8.326e-01 -44.923
## hours17                                    -3.401e+01  8.289e-01 -41.034
## hours18                                    -2.932e+01  8.229e-01 -35.628
## hours19                                    -3.356e+01  8.113e-01 -41.367
## hours20                                    -3.354e+01  8.150e-01 -41.147
## hours21                                    -3.249e+01  8.255e-01 -39.359
## hours22                                    -3.068e+01  8.364e-01 -36.686
## hours23                                    -1.832e+01  8.584e-01 -21.340
## emergency.vehicle.type.regroup             -7.772e+00  5.581e-01 -13.926
## rescue.center.regroup                       9.675e+00  3.503e-01  27.620
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## alert.reason.category2                      < 2e-16 ***
## alert.reason.category3                      < 2e-16 ***
## alert.reason.category4                      0.03656 *  
## alert.reason.category5                     1.06e-14 ***
## alert.reason.category6                      < 2e-16 ***
## alert.reason.category7                      < 2e-16 ***
## alert.reason.category8                      0.07661 .  
## alert.reason.category9                     2.86e-06 ***
## longitude.intervention                      < 2e-16 ***
## latitude.intervention                       < 2e-16 ***
## delta.status.preceding.selection.selection  < 2e-16 ***
## departed.from.its.rescue.center1            0.00202 ** 
## OSRM.estimated.distance                    1.60e-06 ***
## OSRM.estimated.duration                     < 2e-16 ***
## OSRM.estimated.speed                        0.00306 ** 
## hours01                                     < 2e-16 ***
## hours02                                     < 2e-16 ***
## hours03                                     < 2e-16 ***
## hours04                                     < 2e-16 ***
## hours05                                     < 2e-16 ***
## hours06                                     < 2e-16 ***
## hours07                                     < 2e-16 ***
## hours08                                     < 2e-16 ***
## hours09                                     < 2e-16 ***
## hours10                                     < 2e-16 ***
## hours11                                     < 2e-16 ***
## hours12                                     < 2e-16 ***
## hours13                                     < 2e-16 ***
## hours14                                     < 2e-16 ***
## hours15                                     < 2e-16 ***
## hours16                                     < 2e-16 ***
## hours17                                     < 2e-16 ***
## hours18                                     < 2e-16 ***
## hours19                                     < 2e-16 ***
## hours20                                     < 2e-16 ***
## hours21                                     < 2e-16 ***
## hours22                                     < 2e-16 ***
## hours23                                     < 2e-16 ***
## emergency.vehicle.type.regroup              < 2e-16 ***
## rescue.center.regroup                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.68 on 163948 degrees of freedom
## Multiple R-squared:  0.1989, Adjusted R-squared:  0.1987 
## F-statistic:  1018 on 40 and 163948 DF,  p-value: < 2.2e-16

All our coef are significants. We will check hypothesis in 7.2.3.

7.1.2 Y0 : Multicolinéarity

library(sandwich)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(lm)
##                                                 GVIF Df GVIF^(1/(2*Df))
## alert.reason.category                       1.537152  8        1.027235
## longitude.intervention                      1.051810  1        1.025578
## latitude.intervention                       1.068139  1        1.033508
## delta.status.preceding.selection.selection  1.181290  1        1.086872
## departed.from.its.rescue.center             1.010514  1        1.005243
## OSRM.estimated.distance                    32.339705  1        5.686801
## OSRM.estimated.duration                    23.118015  1        4.808120
## OSRM.estimated.speed                        4.662221  1        2.159218
## hours                                       1.029844 23        1.000639
## emergency.vehicle.type.regroup              1.540721  1        1.241258
## rescue.center.regroup                       1.093090  1        1.045509

Except for OSRM data, no high correlated coefficient. We decide to keep them as they are important for our model.

7.1.3 Residuals analysis

Study the residuals of the selected model What we’ve done is not enough to validate the model. We need to study the residuals if hypothesis are not validated (see course.), the test of the coefficient are false Study the residuals of the selected model

7.1.3.1 Is residuals means 0?
#mean of residuals
summary(lm$residuals) 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -219.100  -31.192   -4.021    0.000   26.069  901.115

Yes it is !

7.1.3.2 Are residuals normally distributed?

Normality test : shapiro test H0 : Normality and H1 : no normality

library(tseries)
## 
## Attaching package: 'tseries'
## The following object is masked from 'package:imputeTS':
## 
##     na.remove
jarque.bera.test(lm$residuals)
## 
##  Jarque Bera Test
## 
## data:  lm$residuals
## X-squared = 290909, df = 2, p-value < 2.2e-16

No normality because p-value is << 5%. Here, residuals are not normally distributed. NB : non normality could appear because of outliers.

qqnorm(lm$residuals)
qqline(lm$residuals)

7.1.3.3 Are residuals homoskedastic?
  1. First model which is not considering heteroskedasticity
plot(lm$residuals~lm$fitted)

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:imputeTS':
## 
##     na.locf
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Breush Pagan test H0 : homoskedasticity against H1 : heteroskedasticity 

bptest(lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm
## BP = 2239.7, df = 40, p-value < 2.2e-16

pvalue << 5%, we reject H0 residuals are heteroskedastic

In case of heteroskedasticity, we have to use a robust standard error estimator. Otherwise, all our t-tests will be wrong.

  1. Second model taking into account heteroskedasticity

We calculate the robust covariance matrix

library(sandwich)
vcov_y0 <- vcovHC(lm, type = "HC1")
coeftest(lm, vcov. = vcov_y0)
## 
## t test of coefficients:
## 
##                                               Estimate  Std. Error  t value
## (Intercept)                                -1.7421e+03  1.2710e+02 -13.7064
## alert.reason.category2                     -1.8685e+01  9.2106e-01 -20.2861
## alert.reason.category3                     -1.6990e+01  6.2786e-01 -27.0595
## alert.reason.category4                      3.4795e+00  1.6322e+00   2.1318
## alert.reason.category5                      1.8047e+01  3.7930e+00   4.7579
## alert.reason.category6                      1.2656e+01  1.0128e+00  12.4965
## alert.reason.category7                      2.6938e+01  3.7618e+00   7.1609
## alert.reason.category8                      7.0385e+00  3.8429e+00   1.8315
## alert.reason.category9                      3.7669e+00  8.6163e-01   4.3719
## longitude.intervention                      2.4304e+01  1.4147e+00  17.1794
## latitude.intervention                       3.8031e+01  2.5966e+00  14.6462
## delta.status.preceding.selection.selection  6.2062e-05  4.9377e-06  12.5691
## departed.from.its.rescue.center1           -2.5549e+00  9.3812e-01  -2.7234
## OSRM.estimated.distance                    -2.4948e-03  5.2900e-04  -4.7161
## OSRM.estimated.duration                     4.4547e-02  5.0658e-03   8.7937
## OSRM.estimated.speed                        1.1241e-01  3.8333e-02   2.9324
## hours01                                     1.5025e+01  1.0488e+00  14.3262
## hours02                                     2.7699e+01  1.1384e+00  24.3308
## hours03                                     3.2377e+01  1.1847e+00  27.3287
## hours04                                     3.5306e+01  1.2587e+00  28.0499
## hours05                                     3.5755e+01  1.2618e+00  28.3375
## hours06                                     1.8906e+01  1.2225e+00  15.4646
## hours07                                    -1.9443e+01  1.0544e+00 -18.4395
## hours08                                    -3.1067e+01  9.4702e-01 -32.8052
## hours09                                    -2.6510e+01  9.3268e-01 -28.4236
## hours10                                    -3.6849e+01  8.7700e-01 -42.0165
## hours11                                    -3.9176e+01  8.6615e-01 -45.2301
## hours12                                    -4.0035e+01  8.2627e-01 -48.4532
## hours13                                    -3.5994e+01  8.4550e-01 -42.5711
## hours14                                    -3.6671e+01  8.4597e-01 -43.3476
## hours15                                    -3.5741e+01  8.5382e-01 -41.8607
## hours16                                    -3.7403e+01  8.6637e-01 -43.1715
## hours17                                    -3.4015e+01  8.5510e-01 -39.7786
## hours18                                    -2.9320e+01  8.6035e-01 -34.0791
## hours19                                    -3.3562e+01  8.3594e-01 -40.1494
## hours20                                    -3.3535e+01  8.3792e-01 -40.0219
## hours21                                    -3.2491e+01  8.4999e-01 -38.2249
## hours22                                    -3.0684e+01  8.6388e-01 -35.5189
## hours23                                    -1.8318e+01  9.2811e-01 -19.7363
## emergency.vehicle.type.regroup             -7.7724e+00  7.4798e-01 -10.3913
## rescue.center.regroup                       9.6752e+00  3.7435e-01  25.8451
##                                             Pr(>|t|)    
## (Intercept)                                < 2.2e-16 ***
## alert.reason.category2                     < 2.2e-16 ***
## alert.reason.category3                     < 2.2e-16 ***
## alert.reason.category4                      0.033025 *  
## alert.reason.category5                     1.958e-06 ***
## alert.reason.category6                     < 2.2e-16 ***
## alert.reason.category7                     8.050e-13 ***
## alert.reason.category8                      0.067020 .  
## alert.reason.category9                     1.232e-05 ***
## longitude.intervention                     < 2.2e-16 ***
## latitude.intervention                      < 2.2e-16 ***
## delta.status.preceding.selection.selection < 2.2e-16 ***
## departed.from.its.rescue.center1            0.006462 ** 
## OSRM.estimated.distance                    2.406e-06 ***
## OSRM.estimated.duration                    < 2.2e-16 ***
## OSRM.estimated.speed                        0.003364 ** 
## hours01                                    < 2.2e-16 ***
## hours02                                    < 2.2e-16 ***
## hours03                                    < 2.2e-16 ***
## hours04                                    < 2.2e-16 ***
## hours05                                    < 2.2e-16 ***
## hours06                                    < 2.2e-16 ***
## hours07                                    < 2.2e-16 ***
## hours08                                    < 2.2e-16 ***
## hours09                                    < 2.2e-16 ***
## hours10                                    < 2.2e-16 ***
## hours11                                    < 2.2e-16 ***
## hours12                                    < 2.2e-16 ***
## hours13                                    < 2.2e-16 ***
## hours14                                    < 2.2e-16 ***
## hours15                                    < 2.2e-16 ***
## hours16                                    < 2.2e-16 ***
## hours17                                    < 2.2e-16 ***
## hours18                                    < 2.2e-16 ***
## hours19                                    < 2.2e-16 ***
## hours20                                    < 2.2e-16 ***
## hours21                                    < 2.2e-16 ***
## hours22                                    < 2.2e-16 ***
## hours23                                    < 2.2e-16 ***
## emergency.vehicle.type.regroup             < 2.2e-16 ***
## rescue.center.regroup                      < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated value of the coefficients remains, coefficients are still significant with this model

7.2.3.4 Are the residuals correlated ?

There are several tests for autocorrelation Durbin-Watson test is one of the most often used

H0 : Residuals are non autocorrelated H1 : Residuals are autocorrelated

  1. First model which is not considering heteroskedasticity

We can’t use the Durbin-Waston test since the size of our linear regression

  1. Second model taking into account heteroskedasticity
library(sandwich)

#Calculate the robust covariance matrix

vcov_y0_2 <- NeweyWest(lm)

coeftest(lm, vcov. = vcov_y0_2)
## 
## t test of coefficients:
## 
##                                               Estimate  Std. Error  t value
## (Intercept)                                -1.7421e+03  1.2711e+02 -13.7049
## alert.reason.category2                     -1.8685e+01  9.1994e-01 -20.3107
## alert.reason.category3                     -1.6990e+01  6.2878e-01 -27.0201
## alert.reason.category4                      3.4795e+00  1.6303e+00   2.1342
## alert.reason.category5                      1.8047e+01  3.7978e+00   4.7519
## alert.reason.category6                      1.2656e+01  1.0120e+00  12.5056
## alert.reason.category7                      2.6938e+01  3.7616e+00   7.1614
## alert.reason.category8                      7.0385e+00  3.8426e+00   1.8317
## alert.reason.category9                      3.7669e+00  8.6374e-01   4.3612
## longitude.intervention                      2.4304e+01  1.4121e+00  17.2107
## latitude.intervention                       3.8031e+01  2.5973e+00  14.6421
## delta.status.preceding.selection.selection  6.2062e-05  4.9610e-06  12.5100
## departed.from.its.rescue.center1           -2.5549e+00  9.3620e-01  -2.7290
## OSRM.estimated.distance                    -2.4948e-03  5.2772e-04  -4.7274
## OSRM.estimated.duration                     4.4547e-02  5.0594e-03   8.8047
## OSRM.estimated.speed                        1.1241e-01  3.8229e-02   2.9404
## hours01                                     1.5025e+01  1.0435e+00  14.3980
## hours02                                     2.7699e+01  1.1367e+00  24.3676
## hours03                                     3.2377e+01  1.1792e+00  27.4571
## hours04                                     3.5306e+01  1.2448e+00  28.3635
## hours05                                     3.5755e+01  1.2565e+00  28.4552
## hours06                                     1.8906e+01  1.2160e+00  15.5467
## hours07                                    -1.9443e+01  1.0524e+00 -18.4759
## hours08                                    -3.1067e+01  9.4355e-01 -32.9260
## hours09                                    -2.6510e+01  9.2218e-01 -28.7474
## hours10                                    -3.6849e+01  8.7357e-01 -42.1815
## hours11                                    -3.9176e+01  8.5723e-01 -45.7004
## hours12                                    -4.0035e+01  8.2020e-01 -48.8118
## hours13                                    -3.5994e+01  8.3903e-01 -42.8994
## hours14                                    -3.6671e+01  8.3854e-01 -43.7320
## hours15                                    -3.5741e+01  8.4643e-01 -42.2260
## hours16                                    -3.7403e+01  8.6307e-01 -43.3366
## hours17                                    -3.4015e+01  8.4964e-01 -40.0342
## hours18                                    -2.9320e+01  8.5407e-01 -34.3298
## hours19                                    -3.3562e+01  8.2919e-01 -40.4762
## hours20                                    -3.3535e+01  8.3080e-01 -40.3648
## hours21                                    -3.2491e+01  8.3965e-01 -38.6958
## hours22                                    -3.0684e+01  8.6078e-01 -35.6467
## hours23                                    -1.8318e+01  9.1927e-01 -19.9262
## emergency.vehicle.type.regroup             -7.7724e+00  7.5135e-01 -10.3446
## rescue.center.regroup                       9.6752e+00  3.7590e-01  25.7388
##                                             Pr(>|t|)    
## (Intercept)                                < 2.2e-16 ***
## alert.reason.category2                     < 2.2e-16 ***
## alert.reason.category3                     < 2.2e-16 ***
## alert.reason.category4                      0.032827 *  
## alert.reason.category5                     2.017e-06 ***
## alert.reason.category6                     < 2.2e-16 ***
## alert.reason.category7                     8.022e-13 ***
## alert.reason.category8                      0.066993 .  
## alert.reason.category9                     1.294e-05 ***
## longitude.intervention                     < 2.2e-16 ***
## latitude.intervention                      < 2.2e-16 ***
## delta.status.preceding.selection.selection < 2.2e-16 ***
## departed.from.its.rescue.center1            0.006354 ** 
## OSRM.estimated.distance                    2.276e-06 ***
## OSRM.estimated.duration                    < 2.2e-16 ***
## OSRM.estimated.speed                        0.003279 ** 
## hours01                                    < 2.2e-16 ***
## hours02                                    < 2.2e-16 ***
## hours03                                    < 2.2e-16 ***
## hours04                                    < 2.2e-16 ***
## hours05                                    < 2.2e-16 ***
## hours06                                    < 2.2e-16 ***
## hours07                                    < 2.2e-16 ***
## hours08                                    < 2.2e-16 ***
## hours09                                    < 2.2e-16 ***
## hours10                                    < 2.2e-16 ***
## hours11                                    < 2.2e-16 ***
## hours12                                    < 2.2e-16 ***
## hours13                                    < 2.2e-16 ***
## hours14                                    < 2.2e-16 ***
## hours15                                    < 2.2e-16 ***
## hours16                                    < 2.2e-16 ***
## hours17                                    < 2.2e-16 ***
## hours18                                    < 2.2e-16 ***
## hours19                                    < 2.2e-16 ***
## hours20                                    < 2.2e-16 ***
## hours21                                    < 2.2e-16 ***
## hours22                                    < 2.2e-16 ***
## hours23                                    < 2.2e-16 ***
## emergency.vehicle.type.regroup             < 2.2e-16 ***
## rescue.center.regroup                      < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated value of the coefficients remains, coefficients are still significant with this model.

7.1.4 Y0 : Prediction

pred_reg_y0 <- predict(lm,testset0.regroup)
pred_reg_y0 %>% head
##        1        2        3        4        5        6 
## 155.5281 153.3825 167.5792 186.7568 169.1919 165.1531
testset$delta.selection.departure %>% head
## [1]  47 149  97 113 124 145

Compute R2

postResample(pred = pred_reg_y0, obs = testset0.regroup$delta.selection.departure)
##       RMSE   Rsquared        MAE 
## 49.0785462  0.1924098 36.3520945

Pred for x_0

pred_reg_final_y0 <- predict(lm,x_test0.regroup)

7.2 Y1

Remove id variable

trainset1.regroup<-trainset1.regroup[,-c("emergency.vehicle.selection")]
trainset1.regroup$delta.departure.presentation<-trainset$delta.departure.presentation
testset1.regroup<-testset1.regroup[,-c("emergency.vehicle.selection")]
testset1.regroup$delta.departure.presentation<-testset$delta.departure.presentation

Same x_test

x_test1.regroup<-x_test1.regroup[,-c("emergency.vehicle.selection")]
#Linear Regression,

options(max.print = 10000)

lm1 <- lm(delta.departure.presentation ~.,data = trainset1.regroup)


summary(lm1)
## 
## Call:
## lm(formula = delta.departure.presentation ~ ., data = trainset1.regroup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -672.05  -80.89  -23.92   52.61  876.20 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -8.394e+03  3.271e+02 -25.660
## n2                                         -2.398e+01  1.626e+00 -14.753
## n3                                         -3.451e+01  2.954e+00 -11.685
## n4                                         -4.055e+01  5.405e+00  -7.502
## n5                                         -2.865e+01  7.910e+00  -3.622
## n6                                         -1.674e+01  9.024e+00  -1.855
## n7                                         -9.879e-02  1.125e+01  -0.009
## n8                                          2.928e+01  3.866e+01   0.757
## n10                                         9.906e+01  4.842e+01   2.046
## n11                                         1.414e+01  2.866e+01   0.493
## n12                                         4.716e+01  2.867e+01   1.645
## n14                                        -4.331e+01  3.555e+01  -1.218
## alert.reason.category2                     -3.444e+01  2.513e+00 -13.707
## alert.reason.category3                     -2.722e+01  1.580e+00 -17.232
## alert.reason.category4                      5.772e+00  4.437e+00   1.301
## alert.reason.category5                      7.400e+01  6.180e+00  11.974
## alert.reason.category6                      3.670e+01  2.629e+00  13.957
## alert.reason.category7                      7.934e+01  6.020e+00  13.179
## alert.reason.category8                      6.163e+01  1.046e+01   5.894
## alert.reason.category9                      8.487e-01  2.253e+00   0.377
## intervention.on.public.roads1              -7.392e+00  1.009e+00  -7.328
## floor-1                                     6.431e+00  5.373e+00   1.197
## floor0                                      1.644e+01  4.877e+00   3.372
## floor1                                      1.745e+01  4.987e+00   3.498
## floor2                                      2.206e+01  5.014e+00   4.401
## floor3                                      2.127e+01  5.041e+00   4.220
## floor4                                      2.268e+01  5.083e+00   4.462
## floor5                                      2.287e+01  5.171e+00   4.423
## floor6                                      2.505e+01  5.301e+00   4.725
## floor7                                      1.970e+01  5.586e+00   3.527
## floor8                                      3.346e+01  6.014e+00   5.563
## floor9                                      4.080e+01  6.537e+00   6.241
## floor10                                     2.471e+01  7.079e+00   3.491
## floor11                                     2.587e+01  7.723e+00   3.349
## floor12                                     2.201e+01  8.448e+00   2.606
## floor13                                     2.530e+01  9.374e+00   2.698
## floor14                                     4.352e+01  1.042e+01   4.177
## floor15                                     2.023e+01  1.156e+01   1.750
## floor16                                     7.719e+00  1.335e+01   0.578
## floor17                                     3.578e+01  8.688e+00   4.118
## longitude.intervention                     -5.784e+01  3.823e+00 -15.130
## latitude.intervention                       1.759e+02  6.690e+00  26.298
## delta.status.preceding.selection.selection  4.355e-05  8.013e-06   5.436
## OSRM.estimated.distance                     1.608e-02  1.366e-03  11.773
## OSRM.estimated.duration                     5.828e-01  1.308e-02  44.565
## OSRM.estimated.speed                        2.515e+00  9.941e-02  25.295
## month2                                      1.091e+01  1.497e+00   7.283
## month3                                      3.130e+00  1.461e+00   2.142
## month4                                     -3.234e+00  1.487e+00  -2.175
## month5                                     -2.860e+00  1.463e+00  -1.955
## month6                                      4.627e+00  1.458e+00   3.174
## month7                                      3.177e+00  1.445e+00   2.198
## month8                                     -1.017e+01  1.519e+00  -6.695
## month10                                     9.886e+00  1.457e+00   6.787
## month11                                     1.607e+01  1.460e+00  11.007
## month12                                     1.195e+01  1.448e+00   8.251
## weekdaysjeudi                               1.631e+01  1.197e+00  13.623
## weekdayslundi                               1.101e+01  1.182e+00   9.314
## weekdaysmardi                               1.706e+01  1.199e+00  14.238
## weekdaysmercredi                            1.737e+01  1.198e+00  14.503
## weekdayssamedi                              5.466e+00  1.208e+00   4.525
## weekdaysvendredi                            1.691e+01  1.188e+00  14.234
## hours01                                     1.080e+01  2.435e+00   4.434
## hours02                                     1.605e+01  2.589e+00   6.199
## hours03                                     1.976e+01  2.658e+00   7.434
## hours04                                     2.467e+01  2.773e+00   8.896
## hours05                                     2.657e+01  2.813e+00   9.446
## hours06                                     2.054e+01  2.726e+00   7.533
## hours07                                     1.539e+01  2.540e+00   6.060
## hours08                                     2.229e+01  2.336e+00   9.545
## hours09                                     2.714e+01  2.234e+00  12.147
## hours10                                     1.002e+01  2.198e+00   4.557
## hours11                                     8.148e+00  2.173e+00   3.750
## hours12                                    -4.206e+00  2.138e+00  -1.967
## hours13                                    -4.523e+00  2.143e+00  -2.110
## hours14                                     1.598e+00  2.161e+00   0.739
## hours15                                     9.796e+00  2.179e+00   4.496
## hours16                                     8.533e+00  2.191e+00   3.894
## hours17                                     2.195e+01  2.180e+00  10.069
## hours18                                     2.327e+01  2.164e+00  10.752
## hours19                                     9.492e+00  2.134e+00   4.449
## hours20                                    -5.096e+00  2.142e+00  -2.378
## hours21                                    -1.355e+01  2.170e+00  -6.244
## hours22                                    -1.204e+01  2.198e+00  -5.478
## hours23                                    -8.427e+00  2.256e+00  -3.736
## emergency.vehicle.type.regroup              6.517e+00  1.550e+00   4.205
## rescue.center.regroup                      -5.469e+00  1.280e+00  -4.274
## location.of.the.event.regroup               1.549e+01  7.811e-01  19.831
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## n2                                          < 2e-16 ***
## n3                                          < 2e-16 ***
## n4                                         6.32e-14 ***
## n5                                         0.000293 ***
## n6                                         0.063603 .  
## n7                                         0.992993    
## n8                                         0.448771    
## n10                                        0.040778 *  
## n11                                        0.621810    
## n12                                        0.099993 .  
## n14                                        0.223106    
## alert.reason.category2                      < 2e-16 ***
## alert.reason.category3                      < 2e-16 ***
## alert.reason.category4                     0.193300    
## alert.reason.category5                      < 2e-16 ***
## alert.reason.category6                      < 2e-16 ***
## alert.reason.category7                      < 2e-16 ***
## alert.reason.category8                     3.78e-09 ***
## alert.reason.category9                     0.706379    
## intervention.on.public.roads1              2.34e-13 ***
## floor-1                                    0.231304    
## floor0                                     0.000747 ***
## floor1                                     0.000468 ***
## floor2                                     1.08e-05 ***
## floor3                                     2.45e-05 ***
## floor4                                     8.14e-06 ***
## floor5                                     9.74e-06 ***
## floor6                                     2.31e-06 ***
## floor7                                     0.000420 ***
## floor8                                     2.66e-08 ***
## floor9                                     4.35e-10 ***
## floor10                                    0.000482 ***
## floor11                                    0.000810 ***
## floor12                                    0.009161 ** 
## floor13                                    0.006966 ** 
## floor14                                    2.96e-05 ***
## floor15                                    0.080091 .  
## floor16                                    0.563094    
## floor17                                    3.83e-05 ***
## longitude.intervention                      < 2e-16 ***
## latitude.intervention                       < 2e-16 ***
## delta.status.preceding.selection.selection 5.47e-08 ***
## OSRM.estimated.distance                     < 2e-16 ***
## OSRM.estimated.duration                     < 2e-16 ***
## OSRM.estimated.speed                        < 2e-16 ***
## month2                                     3.27e-13 ***
## month3                                     0.032179 *  
## month4                                     0.029646 *  
## month5                                     0.050628 .  
## month6                                     0.001504 ** 
## month7                                     0.027930 *  
## month8                                     2.16e-11 ***
## month10                                    1.15e-11 ***
## month11                                     < 2e-16 ***
## month12                                     < 2e-16 ***
## weekdaysjeudi                               < 2e-16 ***
## weekdayslundi                               < 2e-16 ***
## weekdaysmardi                               < 2e-16 ***
## weekdaysmercredi                            < 2e-16 ***
## weekdayssamedi                             6.05e-06 ***
## weekdaysvendredi                            < 2e-16 ***
## hours01                                    9.25e-06 ***
## hours02                                    5.69e-10 ***
## hours03                                    1.06e-13 ***
## hours04                                     < 2e-16 ***
## hours05                                     < 2e-16 ***
## hours06                                    4.99e-14 ***
## hours07                                    1.36e-09 ***
## hours08                                     < 2e-16 ***
## hours09                                     < 2e-16 ***
## hours10                                    5.19e-06 ***
## hours11                                    0.000177 ***
## hours12                                    0.049232 *  
## hours13                                    0.034819 *  
## hours14                                    0.459723    
## hours15                                    6.93e-06 ***
## hours16                                    9.85e-05 ***
## hours17                                     < 2e-16 ***
## hours18                                     < 2e-16 ***
## hours19                                    8.64e-06 ***
## hours20                                    0.017385 *  
## hours21                                    4.26e-10 ***
## hours22                                    4.30e-08 ***
## hours23                                    0.000187 ***
## emergency.vehicle.type.regroup             2.61e-05 ***
## rescue.center.regroup                      1.92e-05 ***
## location.of.the.event.regroup               < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127.9 on 163901 degrees of freedom
## Multiple R-squared:  0.3732, Adjusted R-squared:  0.3728 
## F-statistic:  1122 on 87 and 163901 DF,  p-value: < 2.2e-16

All our coefficients are significants #### 7.2.2 Y1 : Multicolinéarity

library(sandwich)
library(car)

vif(lm1)
##                                                 GVIF Df GVIF^(1/(2*Df))
## n                                           1.714904 11        1.024819
## alert.reason.category                       2.084703  8        1.046985
## intervention.on.public.roads                1.342675  1        1.158739
## floor                                       1.418573 19        1.009244
## longitude.intervention                      1.086565  1        1.042384
## latitude.intervention                       1.073770  1        1.036229
## delta.status.preceding.selection.selection  1.198188  1        1.094618
## OSRM.estimated.distance                    32.351598  1        5.687847
## OSRM.estimated.duration                    23.116192  1        4.807930
## OSRM.estimated.speed                        4.634243  1        2.152729
## month                                       1.030278 10        1.001493
## weekdays                                    1.023036  6        1.001900
## hours                                       1.079047 23        1.001655
## emergency.vehicle.type.regroup              1.452641  1        1.205255
## rescue.center.regroup                       1.055779  1        1.027511
## location.of.the.event.regroup               1.519869  1        1.232830
#variables qualitatives 

Except for OSRM data, no high correlated coefficient. We decide to keep them as they are important for our model.

7.2.3 Y1 : Residuals analysis

Study the residuals of the selected model What we’ve done is not enough to validate the model. We need to study the residuals if hypothesis are not validated (see course.), the test of the coefficient are false Study the residuals of the selected model

7.2.3.1 Is residuals means 0?
#mean of residuals
summary(lm1$residuals) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -672.05  -80.89  -23.92    0.00   52.61  876.20

Yes it is !

7.2.3.2 Are residuals normally distributed?

Normality test : shapiro test H0 : Normality and H1 : no normality

library(tseries)

jarque.bera.test(lm1$residuals)
## 
##  Jarque Bera Test
## 
## data:  lm1$residuals
## X-squared = 165384, df = 2, p-value < 2.2e-16
#install.packages("nortest")
library(nortest)
ad.test(lm1$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  lm1$residuals
## A = 3622.5, p-value < 2.2e-16

No normality because p-value is << 5%. Here, residuals are not normally distributed. NB : non normality could appear because of outliers.

qqnorm(lm1$residuals)
qqline(lm1$residuals)

7.2.3.3 Are residuals homoskedastic?
  1. First model which is not considering heteroskedasticity
plot(lm1$residuals~lm$fitted)

library(lmtest)

# Breush Pagan test H0 : homoskedasticity against H1 : heteroskedasticity 

bptest(lm1)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 1115.7, df = 87, p-value < 2.2e-16

pvalue << 5%, we reject H0 residuals are heteroskedastic

In case of heteroskedasticity, we have to use a robust standard error estimator. Otherwise, all our t-tests will be wrong.

  1. Second model taking into account heteroskedasticity
library(sandwich)

#Calculate the robust covariance matrix

vcov_y1 <- vcovHC(lm1, type = "HC1")

coeftest(lm, vcov. = vcov_y1)
## 
## t test of coefficients:
## 
##                                               Estimate  Std. Error  t value
## (Intercept)                                -1.7421e+03  3.4100e+02  -5.1086
## alert.reason.category2                     -1.8685e+01  2.4176e+00  -7.7286
## alert.reason.category3                     -1.6990e+01  1.4315e+00 -11.8681
## alert.reason.category4                      3.4795e+00  4.3607e+00   0.7979
## alert.reason.category5                      1.8047e+01  8.2063e+00   2.1991
## alert.reason.category6                      1.2656e+01  2.3447e+00   5.3978
## alert.reason.category7                      2.6938e+01  7.9098e+00   3.4056
## alert.reason.category8                      7.0385e+00  1.0666e+01   0.6599
## alert.reason.category9                      3.7669e+00  2.1490e+00   1.7529
## longitude.intervention                      2.4304e+01  3.8166e+00   6.3680
## latitude.intervention                       3.8031e+01  6.9665e+00   5.4591
## delta.status.preceding.selection.selection  6.2062e-05  8.5934e-06   7.2220
## OSRM.estimated.distance                    -2.4948e-03  1.6595e-03  -1.5033
## OSRM.estimated.duration                     4.4547e-02  1.5425e-02   2.8879
## OSRM.estimated.speed                        1.1241e-01  1.0981e-01   1.0237
## hours01                                     1.5025e+01  2.4279e+00   6.1884
## hours02                                     2.7699e+01  2.5871e+00  10.7067
## hours03                                     3.2377e+01  2.6714e+00  12.1197
## hours04                                     3.5306e+01  2.8289e+00  12.4805
## hours05                                     3.5755e+01  2.8958e+00  12.3472
## hours06                                     1.8906e+01  2.7097e+00   6.9770
## hours07                                    -1.9443e+01  2.4860e+00  -7.8212
## hours08                                    -3.1067e+01  2.3630e+00 -13.1474
## hours09                                    -2.6510e+01  2.2702e+00 -11.6776
## hours10                                    -3.6849e+01  2.1661e+00 -17.0111
## hours11                                    -3.9176e+01  2.1373e+00 -18.3292
## hours12                                    -4.0035e+01  2.0786e+00 -19.2611
## hours13                                    -3.5994e+01  2.0945e+00 -17.1850
## hours14                                    -3.6671e+01  2.1299e+00 -17.2172
## hours15                                    -3.5741e+01  2.1729e+00 -16.4491
## hours16                                    -3.7403e+01  2.1440e+00 -17.4454
## hours17                                    -3.4015e+01  2.1808e+00 -15.5970
## hours18                                    -2.9320e+01  2.1560e+00 -13.5990
## hours19                                    -3.3562e+01  2.0890e+00 -16.0666
## hours20                                    -3.3535e+01  2.0969e+00 -15.9927
## hours21                                    -3.2491e+01  2.0905e+00 -15.5424
## hours22                                    -3.0684e+01  2.1357e+00 -14.3671
## hours23                                    -1.8318e+01  2.1904e+00  -8.3627
## emergency.vehicle.type.regroup             -7.7724e+00  1.5459e+00  -5.0279
## rescue.center.regroup                       9.6752e+00  1.2649e+00   7.6490
##                                             Pr(>|t|)    
## (Intercept)                                3.248e-07 ***
## alert.reason.category2                     1.094e-14 ***
## alert.reason.category3                     < 2.2e-16 ***
## alert.reason.category4                     0.4249280    
## alert.reason.category5                     0.0278690 *  
## alert.reason.category6                     6.757e-08 ***
## alert.reason.category7                     0.0006603 ***
## alert.reason.category8                     0.5093228    
## alert.reason.category9                     0.0796239 .  
## longitude.intervention                     1.920e-10 ***
## latitude.intervention                      4.794e-08 ***
## delta.status.preceding.selection.selection 5.144e-13 ***
## OSRM.estimated.distance                    0.1327596    
## OSRM.estimated.duration                    0.0038791 ** 
## OSRM.estimated.speed                       0.3059961    
## hours01                                    6.091e-10 ***
## hours02                                    < 2.2e-16 ***
## hours03                                    < 2.2e-16 ***
## hours04                                    < 2.2e-16 ***
## hours05                                    < 2.2e-16 ***
## hours06                                    3.027e-12 ***
## hours07                                    5.262e-15 ***
## hours08                                    < 2.2e-16 ***
## hours09                                    < 2.2e-16 ***
## hours10                                    < 2.2e-16 ***
## hours11                                    < 2.2e-16 ***
## hours12                                    < 2.2e-16 ***
## hours13                                    < 2.2e-16 ***
## hours14                                    < 2.2e-16 ***
## hours15                                    < 2.2e-16 ***
## hours16                                    < 2.2e-16 ***
## hours17                                    < 2.2e-16 ***
## hours18                                    < 2.2e-16 ***
## hours19                                    < 2.2e-16 ***
## hours20                                    < 2.2e-16 ***
## hours21                                    < 2.2e-16 ***
## hours22                                    < 2.2e-16 ***
## hours23                                    < 2.2e-16 ***
## emergency.vehicle.type.regroup             4.964e-07 ***
## rescue.center.regroup                      2.036e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated value of the coefficients remains, coefficients are still significant with this model

7.2.3.4 Are the residuals correlated ?
  1. First model taking into account heteroskedasticity
library(sandwich)

#Calculate the robust covariance matrix

vcov_y1_2 <- NeweyWest(lm1)

coeftest(lm1, vcov. = vcov_y1_2)
## 
## t test of coefficients:
## 
##                                               Estimate  Std. Error  t value
## (Intercept)                                -8.3937e+03  3.4466e+02 -24.3538
## n2                                         -2.3983e+01  1.6322e+00 -14.6940
## n3                                         -3.4513e+01  2.5354e+00 -13.6122
## n4                                         -4.0549e+01  4.8942e+00  -8.2852
## n5                                         -2.8647e+01  7.2386e+00  -3.9576
## n6                                         -1.6739e+01  8.3987e+00  -1.9931
## n7                                         -9.8787e-02  8.3199e+00  -0.0119
## n8                                          2.9284e+01  2.4819e+01   1.1799
## n10                                         9.9059e+01  3.4526e+01   2.8691
## n11                                         1.4139e+01  2.5688e+01   0.5504
## n12                                         4.7163e+01  4.9617e+01   0.9505
## n14                                        -4.3307e+01  3.8384e+01  -1.1283
## alert.reason.category2                     -3.4441e+01  2.4043e+00 -14.3250
## alert.reason.category3                     -2.7220e+01  1.4337e+00 -18.9863
## alert.reason.category4                      5.7721e+00  4.3378e+00   1.3306
## alert.reason.category5                      7.4002e+01  8.2640e+00   8.9548
## alert.reason.category6                      3.6699e+01  2.3381e+00  15.6959
## alert.reason.category7                      7.9337e+01  7.9784e+00   9.9440
## alert.reason.category8                      6.1634e+01  1.0676e+01   5.7733
## alert.reason.category9                      8.4869e-01  2.1946e+00   0.3867
## intervention.on.public.roads1              -7.3918e+00  1.0015e+00  -7.3806
## floor-1                                     6.4313e+00  5.0683e+00   1.2689
## floor0                                      1.6445e+01  4.6849e+00   3.5102
## floor1                                      1.7446e+01  4.8070e+00   3.6293
## floor2                                      2.2062e+01  4.8516e+00   4.5474
## floor3                                      2.1271e+01  4.8851e+00   4.3543
## floor4                                      2.2677e+01  4.9151e+00   4.6139
## floor5                                      2.2869e+01  4.9865e+00   4.5862
## floor6                                      2.5045e+01  5.1395e+00   4.8731
## floor7                                      1.9702e+01  5.4958e+00   3.5850
## floor8                                      3.3456e+01  5.9426e+00   5.6298
## floor9                                      4.0799e+01  6.7665e+00   6.0296
## floor10                                     2.4712e+01  6.8837e+00   3.5899
## floor11                                     2.5868e+01  7.5053e+00   3.4466
## floor12                                     2.2015e+01  8.3297e+00   2.6429
## floor13                                     2.5296e+01  8.7284e+00   2.8982
## floor14                                     4.3515e+01  1.1507e+01   3.7817
## floor15                                     2.0228e+01  1.2042e+01   1.6798
## floor16                                     7.7191e+00  1.2224e+01   0.6315
## floor17                                     3.5776e+01  9.3643e+00   3.8205
## longitude.intervention                     -5.7841e+01  3.7733e+00 -15.3293
## latitude.intervention                       1.7592e+02  7.0420e+00  24.9822
## delta.status.preceding.selection.selection  4.3553e-05  8.7107e-06   4.9999
## OSRM.estimated.distance                     1.6084e-02  1.6395e-03   9.8102
## OSRM.estimated.duration                     5.8277e-01  1.5192e-02  38.3612
## OSRM.estimated.speed                        2.5145e+00  1.0942e-01  22.9806
## month2                                      1.0905e+01  1.4728e+00   7.4046
## month3                                      3.1297e+00  1.4355e+00   2.1802
## month4                                     -3.2340e+00  1.4559e+00  -2.2212
## month5                                     -2.8598e+00  1.4199e+00  -2.0141
## month6                                      4.6272e+00  1.4426e+00   3.2076
## month7                                      3.1767e+00  1.4115e+00   2.2505
## month8                                     -1.0168e+01  1.4589e+00  -6.9697
## month10                                     9.8865e+00  1.4550e+00   6.7950
## month11                                     1.6068e+01  1.4423e+00  11.1403
## month12                                     1.1947e+01  1.4177e+00   8.4266
## weekdaysjeudi                               1.6307e+01  1.1947e+00  13.6493
## weekdayslundi                               1.1009e+01  1.1815e+00   9.3172
## weekdaysmardi                               1.7064e+01  1.1997e+00  14.2237
## weekdaysmercredi                            1.7372e+01  1.1760e+00  14.7712
## weekdayssamedi                              5.4657e+00  1.1899e+00   4.5933
## weekdaysvendredi                            1.6909e+01  1.1882e+00  14.2314
## hours01                                     1.0796e+01  2.4133e+00   4.4736
## hours02                                     1.6051e+01  2.5992e+00   6.1751
## hours03                                     1.9761e+01  2.6616e+00   7.4247
## hours04                                     2.4673e+01  2.8772e+00   8.5753
## hours05                                     2.6569e+01  2.8667e+00   9.2682
## hours06                                     2.0536e+01  2.7127e+00   7.5704
## hours07                                     1.5391e+01  2.5025e+00   6.1502
## hours08                                     2.2293e+01  2.3568e+00   9.4587
## hours09                                     2.7140e+01  2.2702e+00  11.9548
## hours10                                     1.0016e+01  2.1807e+00   4.5930
## hours11                                     8.1477e+00  2.1549e+00   3.7810
## hours12                                    -4.2055e+00  2.0920e+00  -2.0103
## hours13                                    -4.5227e+00  2.0872e+00  -2.1669
## hours14                                     1.5978e+00  2.1512e+00   0.7428
## hours15                                     9.7961e+00  2.2087e+00   4.4352
## hours16                                     8.5334e+00  2.1426e+00   3.9828
## hours17                                     2.1954e+01  2.2034e+00   9.9634
## hours18                                     2.3269e+01  2.1767e+00  10.6900
## hours19                                     9.4924e+00  2.0881e+00   4.5460
## hours20                                    -5.0956e+00  2.0868e+00  -2.4419
## hours21                                    -1.3548e+01  2.0773e+00  -6.5222
## hours22                                    -1.2043e+01  2.1409e+00  -5.6250
## hours23                                    -8.4270e+00  2.1788e+00  -3.8678
## emergency.vehicle.type.regroup              6.5173e+00  1.5492e+00   4.2069
## rescue.center.regroup                      -5.4685e+00  1.2975e+00  -4.2147
## location.of.the.event.regroup               1.5491e+01  7.7500e-01  19.9884
##                                             Pr(>|t|)    
## (Intercept)                                < 2.2e-16 ***
## n2                                         < 2.2e-16 ***
## n3                                         < 2.2e-16 ***
## n4                                         < 2.2e-16 ***
## n5                                         7.574e-05 ***
## n6                                         0.0462563 *  
## n7                                         0.9905264    
## n8                                         0.2380499    
## n10                                        0.0041171 ** 
## n11                                        0.5820385    
## n12                                        0.3418390    
## n14                                        0.2592041    
## alert.reason.category2                     < 2.2e-16 ***
## alert.reason.category3                     < 2.2e-16 ***
## alert.reason.category4                     0.1833083    
## alert.reason.category5                     < 2.2e-16 ***
## alert.reason.category6                     < 2.2e-16 ***
## alert.reason.category7                     < 2.2e-16 ***
## alert.reason.category8                     7.788e-09 ***
## alert.reason.category9                     0.6989664    
## intervention.on.public.roads1              1.584e-13 ***
## floor-1                                    0.2044681    
## floor0                                     0.0004479 ***
## floor1                                     0.0002843 ***
## floor2                                     5.435e-06 ***
## floor3                                     1.336e-05 ***
## floor4                                     3.956e-06 ***
## floor5                                     4.517e-06 ***
## floor6                                     1.100e-06 ***
## floor7                                     0.0003372 ***
## floor8                                     1.807e-08 ***
## floor9                                     1.647e-09 ***
## floor10                                    0.0003309 ***
## floor11                                    0.0005678 ***
## floor12                                    0.0082201 ** 
## floor13                                    0.0037541 ** 
## floor14                                    0.0001558 ***
## floor15                                    0.0930038 .  
## floor16                                    0.5277364    
## floor17                                    0.0001332 ***
## longitude.intervention                     < 2.2e-16 ***
## latitude.intervention                      < 2.2e-16 ***
## delta.status.preceding.selection.selection 5.741e-07 ***
## OSRM.estimated.distance                    < 2.2e-16 ***
## OSRM.estimated.duration                    < 2.2e-16 ***
## OSRM.estimated.speed                       < 2.2e-16 ***
## month2                                     1.322e-13 ***
## month3                                     0.0292475 *  
## month4                                     0.0263359 *  
## month5                                     0.0439966 *  
## month6                                     0.0013386 ** 
## month7                                     0.0244171 *  
## month8                                     3.189e-12 ***
## month10                                    1.087e-11 ***
## month11                                    < 2.2e-16 ***
## month12                                    < 2.2e-16 ***
## weekdaysjeudi                              < 2.2e-16 ***
## weekdayslundi                              < 2.2e-16 ***
## weekdaysmardi                              < 2.2e-16 ***
## weekdaysmercredi                           < 2.2e-16 ***
## weekdayssamedi                             4.366e-06 ***
## weekdaysvendredi                           < 2.2e-16 ***
## hours01                                    7.696e-06 ***
## hours02                                    6.628e-10 ***
## hours03                                    1.136e-13 ***
## hours04                                    < 2.2e-16 ***
## hours05                                    < 2.2e-16 ***
## hours06                                    3.741e-14 ***
## hours07                                    7.758e-10 ***
## hours08                                    < 2.2e-16 ***
## hours09                                    < 2.2e-16 ***
## hours10                                    4.373e-06 ***
## hours11                                    0.0001563 ***
## hours12                                    0.0444024 *  
## hours13                                    0.0302411 *  
## hours14                                    0.4576334    
## hours15                                    9.203e-06 ***
## hours16                                    6.813e-05 ***
## hours17                                    < 2.2e-16 ***
## hours18                                    < 2.2e-16 ***
## hours19                                    5.473e-06 ***
## hours20                                    0.0146132 *  
## hours21                                    6.950e-11 ***
## hours22                                    1.858e-08 ***
## hours23                                    0.0001099 ***
## emergency.vehicle.type.regroup             2.591e-05 ***
## rescue.center.regroup                      2.502e-05 ***
## location.of.the.event.regroup              < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated value of the coefficients remains, coefficients are still significant with this model

7.2.4 Y1 : Prediction

pred_reg_y1 <- predict(lm1,testset1.regroup)
pred_reg_y1 %>% head
##        1        2        3        4        5        6 
## 407.3334 298.8634 397.5383 374.2344 381.2628 386.3228
testset$delta.departure.presentation %>% head
## [1] 376 268 409 678 432 236

Compute MSE

postResample(pred = pred_reg_y1, obs = testset1.regroup$delta.departure.presentation)
##        RMSE    Rsquared         MAE 
## 128.1085800   0.3756295  92.7618495
Pred for x_1
testset1.regroup %>% str
## Classes 'data.table' and 'data.frame':   40998 obs. of  17 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 3 3 1 3 3 3 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 6 3 4 3 10 6 3 3 3 3 ...
##  $ longitude.intervention                    : num  2.28 2.3 2.2 2.5 2.4 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 49 48.9 ...
##  $ delta.status.preceding.selection.selection: int  16251 606 4693 86 1485 917 17 161 2061 1022 ...
##  $ OSRM.estimated.distance                   : num  2347 1812 2586 2442 2314 ...
##  $ OSRM.estimated.duration                   : num  218 198 280 247 315 ...
##  $ OSRM.estimated.speed                      : num  38.8 33 33.2 35.5 26.4 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 10 1 2 1 1 1 2 2 2 2 ...
##  $ emergency.vehicle.type.regroup            : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ rescue.center.regroup                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ location.of.the.event.regroup             : num  0 0 1 0 1 0 0 0 1 1 ...
##  $ delta.departure.presentation              : int  376 268 409 678 432 236 302 225 404 294 ...
##  - attr(*, ".internal.selfref")=<externalptr>
x_test1.regroup %>% str
## Classes 'data.table' and 'data.frame':   108033 obs. of  16 variables:
##  $ n                                         : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.28 2.34 2.41 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ delta.status.preceding.selection.selection: int  2636 16243 597 1834 1341 2197 16 1312 263 437 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1078 1791 1451 ...
##  $ OSRM.estimated.duration                   : num  214 218 120 250 199 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 32.4 25.7 26.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
##  $ emergency.vehicle.type.regroup            : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ rescue.center.regroup                     : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ location.of.the.event.regroup             : num  1 0 1 0 0 0 0 0 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#pred_reg_final_y1 <- predict(lm1,x_test1.regroup)

New levels in x_test

testset1.regroup$n %>% levels
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "10" "11" "12" "14"
x_test1.regroup$n %>% levels
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"

Re_train model without n-levels

#Linear Regression,

options(max.print = 10000)

lm1 <- lm(delta.departure.presentation ~.,data = trainset1.regroup[,-c("n")])


summary(lm1)
## 
## Call:
## lm(formula = delta.departure.presentation ~ ., data = trainset1.regroup[, 
##     -c("n")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -667.10  -81.01  -23.86   52.64  876.52 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -8.438e+03  3.273e+02 -25.782
## alert.reason.category2                     -2.348e+01  2.420e+00  -9.701
## alert.reason.category3                     -1.525e+01  1.397e+00 -10.917
## alert.reason.category4                      1.739e+01  4.381e+00   3.968
## alert.reason.category5                      8.633e+01  6.140e+00  14.059
## alert.reason.category6                      4.437e+01  2.562e+00  17.320
## alert.reason.category7                      8.787e+01  5.999e+00  14.648
## alert.reason.category8                      6.160e+01  1.046e+01   5.890
## alert.reason.category9                      1.235e+01  2.137e+00   5.780
## intervention.on.public.roads1              -6.499e+00  1.008e+00  -6.447
## floor-1                                     7.882e+00  5.373e+00   1.467
## floor0                                      1.967e+01  4.873e+00   4.037
## floor1                                      2.022e+01  4.984e+00   4.056
## floor2                                      2.480e+01  5.010e+00   4.950
## floor3                                      2.392e+01  5.037e+00   4.748
## floor4                                      2.517e+01  5.080e+00   4.955
## floor5                                      2.528e+01  5.168e+00   4.891
## floor6                                      2.749e+01  5.299e+00   5.188
## floor7                                      2.217e+01  5.583e+00   3.970
## floor8                                      3.566e+01  6.013e+00   5.931
## floor9                                      4.360e+01  6.537e+00   6.669
## floor10                                     2.747e+01  7.080e+00   3.880
## floor11                                     2.870e+01  7.725e+00   3.715
## floor12                                     2.525e+01  8.451e+00   2.988
## floor13                                     2.732e+01  9.382e+00   2.912
## floor14                                     4.616e+01  1.042e+01   4.428
## floor15                                     2.384e+01  1.157e+01   2.062
## floor16                                     1.061e+01  1.336e+01   0.794
## floor17                                     3.850e+01  8.693e+00   4.429
## longitude.intervention                     -5.811e+01  3.826e+00 -15.189
## latitude.intervention                       1.764e+02  6.693e+00  26.354
## delta.status.preceding.selection.selection  3.819e-05  8.007e-06   4.769
## OSRM.estimated.distance                     1.618e-02  1.367e-03  11.832
## OSRM.estimated.duration                     5.829e-01  1.309e-02  44.537
## OSRM.estimated.speed                        2.514e+00  9.949e-02  25.273
## month2                                      1.074e+01  1.498e+00   7.166
## month3                                      3.002e+00  1.462e+00   2.053
## month4                                     -3.308e+00  1.488e+00  -2.223
## month5                                     -2.923e+00  1.464e+00  -1.996
## month6                                      4.654e+00  1.459e+00   3.190
## month7                                      3.321e+00  1.446e+00   2.297
## month8                                     -1.029e+01  1.520e+00  -6.770
## month10                                     9.880e+00  1.458e+00   6.776
## month11                                     1.602e+01  1.461e+00  10.966
## month12                                     1.217e+01  1.449e+00   8.398
## weekdaysjeudi                               1.642e+01  1.198e+00  13.706
## weekdayslundi                               1.111e+01  1.183e+00   9.388
## weekdaysmardi                               1.713e+01  1.200e+00  14.277
## weekdaysmercredi                            1.752e+01  1.199e+00  14.616
## weekdayssamedi                              5.480e+00  1.209e+00   4.533
## weekdaysvendredi                            1.695e+01  1.189e+00  14.255
## hours01                                     1.089e+01  2.437e+00   4.466
## hours02                                     1.621e+01  2.592e+00   6.256
## hours03                                     1.940e+01  2.660e+00   7.293
## hours04                                     2.445e+01  2.776e+00   8.810
## hours05                                     2.640e+01  2.815e+00   9.376
## hours06                                     2.025e+01  2.729e+00   7.423
## hours07                                     1.508e+01  2.542e+00   5.935
## hours08                                     2.235e+01  2.338e+00   9.560
## hours09                                     2.706e+01  2.237e+00  12.101
## hours10                                     1.002e+01  2.200e+00   4.557
## hours11                                     8.252e+00  2.175e+00   3.795
## hours12                                    -4.245e+00  2.141e+00  -1.983
## hours13                                    -4.487e+00  2.145e+00  -2.092
## hours14                                     1.603e+00  2.163e+00   0.741
## hours15                                     9.746e+00  2.181e+00   4.469
## hours16                                     8.646e+00  2.193e+00   3.942
## hours17                                     2.194e+01  2.182e+00  10.054
## hours18                                     2.330e+01  2.166e+00  10.756
## hours19                                     9.505e+00  2.136e+00   4.450
## hours20                                    -5.059e+00  2.144e+00  -2.359
## hours21                                    -1.364e+01  2.172e+00  -6.280
## hours22                                    -1.189e+01  2.200e+00  -5.404
## hours23                                    -8.281e+00  2.258e+00  -3.668
## emergency.vehicle.type.regroup              1.252e+01  1.507e+00   8.308
## rescue.center.regroup                      -5.379e+00  1.280e+00  -4.201
## location.of.the.event.regroup               1.563e+01  7.817e-01  20.001
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## alert.reason.category2                      < 2e-16 ***
## alert.reason.category3                      < 2e-16 ***
## alert.reason.category4                     7.24e-05 ***
## alert.reason.category5                      < 2e-16 ***
## alert.reason.category6                      < 2e-16 ***
## alert.reason.category7                      < 2e-16 ***
## alert.reason.category8                     3.86e-09 ***
## alert.reason.category9                     7.48e-09 ***
## intervention.on.public.roads1              1.14e-10 ***
## floor-1                                    0.142399    
## floor0                                     5.41e-05 ***
## floor1                                     4.99e-05 ***
## floor2                                     7.42e-07 ***
## floor3                                     2.05e-06 ***
## floor4                                     7.22e-07 ***
## floor5                                     1.00e-06 ***
## floor6                                     2.12e-07 ***
## floor7                                     7.19e-05 ***
## floor8                                     3.01e-09 ***
## floor9                                     2.58e-11 ***
## floor10                                    0.000105 ***
## floor11                                    0.000203 ***
## floor12                                    0.002812 ** 
## floor13                                    0.003587 ** 
## floor14                                    9.52e-06 ***
## floor15                                    0.039231 *  
## floor16                                    0.426916    
## floor17                                    9.47e-06 ***
## longitude.intervention                      < 2e-16 ***
## latitude.intervention                       < 2e-16 ***
## delta.status.preceding.selection.selection 1.85e-06 ***
## OSRM.estimated.distance                     < 2e-16 ***
## OSRM.estimated.duration                     < 2e-16 ***
## OSRM.estimated.speed                        < 2e-16 ***
## month2                                     7.73e-13 ***
## month3                                     0.040057 *  
## month4                                     0.026251 *  
## month5                                     0.045951 *  
## month6                                     0.001425 ** 
## month7                                     0.021639 *  
## month8                                     1.29e-11 ***
## month10                                    1.24e-11 ***
## month11                                     < 2e-16 ***
## month12                                     < 2e-16 ***
## weekdaysjeudi                               < 2e-16 ***
## weekdayslundi                               < 2e-16 ***
## weekdaysmardi                               < 2e-16 ***
## weekdaysmercredi                            < 2e-16 ***
## weekdayssamedi                             5.83e-06 ***
## weekdaysvendredi                            < 2e-16 ***
## hours01                                    7.97e-06 ***
## hours02                                    3.95e-10 ***
## hours03                                    3.05e-13 ***
## hours04                                     < 2e-16 ***
## hours05                                     < 2e-16 ***
## hours06                                    1.15e-13 ***
## hours07                                    2.95e-09 ***
## hours08                                     < 2e-16 ***
## hours09                                     < 2e-16 ***
## hours10                                    5.20e-06 ***
## hours11                                    0.000148 ***
## hours12                                    0.047361 *  
## hours13                                    0.036482 *  
## hours14                                    0.458627    
## hours15                                    7.88e-06 ***
## hours16                                    8.08e-05 ***
## hours17                                     < 2e-16 ***
## hours18                                     < 2e-16 ***
## hours19                                    8.59e-06 ***
## hours20                                    0.018311 *  
## hours21                                    3.39e-10 ***
## hours22                                    6.52e-08 ***
## hours23                                    0.000245 ***
## emergency.vehicle.type.regroup              < 2e-16 ***
## rescue.center.regroup                      2.66e-05 ***
## location.of.the.event.regroup               < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128 on 163912 degrees of freedom
## Multiple R-squared:  0.3719, Adjusted R-squared:  0.3716 
## F-statistic:  1277 on 76 and 163912 DF,  p-value: < 2.2e-16
pred_reg_final_y1 <- predict(lm1,x_test1.regroup[,-c("n")])

7.4 Ys : Compute global

pred_reg_ys=pred_reg_y0+pred_reg_y1
pred_reg=data.frame(pred_reg_y0,pred_reg_y1,pred_reg_ys)

pred_reg %>% head
##   pred_reg_y0 pred_reg_y1 pred_reg_ys
## 1    155.5281    407.3334    562.8615
## 2    153.3825    298.8634    452.2458
## 3    167.5792    397.5383    565.1175
## 4    186.7568    374.2344    560.9912
## 5    169.1919    381.2628    550.4547
## 6    165.1531    386.3228    551.4758
testset$delta.selection.presentation %>% head
## [1] 423 417 506 791 556 381
postResample(pred = pred_reg_ys, obs = testset$delta.departure.presentation)
##       RMSE   Rsquared        MAE 
## 188.733277   0.357252 166.213062

No too bad accuracy !

8. Actual predict : Compute Global for x_test

pred_reg_final_ys=pred_reg_final_y0+pred_reg_final_y1
pred_final=data.frame(pred_reg_final_y0,pred_reg_final_y1,pred_reg_final_ys)
pred_final %>% head
##   pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys
## 1          150.7865          322.0503          472.8368
## 2          163.3000          381.8173          545.1174
## 3          151.4585          270.3706          421.8291
## 4          166.5357          319.1771          485.7128
## 5          172.7904          283.1569          455.9473
## 6          172.5030          282.4379          454.9409
pred_final$id<-x_test$emergency.vehicle.selection

Change columns order

pred_final <- pred_final[, c(4, 1, 2, 3)]

Convert into integer

pred_final$pred_reg_final_y0=pred_final$pred_reg_final_y0 %>% as.integer()
pred_final$pred_reg_final_y1=pred_final$pred_reg_final_y1 %>% as.integer()
pred_final$pred_reg_final_ys=pred_final$pred_reg_final_ys %>% as.integer()

Retrieve order from original

pred_final %>% head
##        id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys
## 1 4715068               150               322               472
## 2 4714816               163               381               545
## 3 4713710               151               270               421
## 4 4713748               166               319               485
## 5 4713778               172               283               455
## 6 4713812               172               282               454
id_order %>% head
##   x_test.emergency.vehicle.selection
## 1                            5271704
## 2                            5092931
## 3                            5153756
## 4                            5355572
## 5                            5178915
## 6                            5206885
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
##        id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys  rank
## 1 4713710               151               270               421 14082
## 2 4713748               166               319               485 25965
## 3 4713778               172               283               455 14465
## 4 4713812               172               282               454 79530
## 5 4713821               195               583               779 47824
## 6 4713863               166               250               417  2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
##            id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys rank
## 77897 5271704               117               427               545    1
## 59352 5092931               119               373               493    2
## 68483 5153756               119               216               336    3
## 91242 5355572               187               285               472    4
## 72358 5178915               131               278               410    5
## 77062 5206885               120               263               384    6
y_final %>% setDT
y_final=y_final[,-c("rank")]
sum(is.na(y_final))
## [1] 2
which(is.na(y_final))
## [1] 298933 406966
y_final[is.na(y_final)] <- 0

Write csv file. Go to pyhton script for generate good csv.

fwrite(y_final, "y_test.csv",sep=",")
First R2 score from public score

First R2 score from public score

9. Xgboost : x_test

9.1 reduced x_test

Keeping some factors

OSRM data internvetion floor alter.reason.category long internvetion lat internvetion delta.status Weekdays Month hours departed from

data.fe %>% str
## Classes 'data.table' and 'data.frame':   204987 obs. of  22 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ delta.departure.presentation              : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
x_test %>% str
## Classes 'data.table' and 'data.frame':   108033 obs. of  19 variables:
##  $ n                                         : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4715068 4714816 4713710 4713748 4713778 4713812 4713821 4713863 4713872 4713878 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
##  $ location.of.the.event                     : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.28 2.34 2.41 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
##  $ emergency.vehicle.type                    : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
##  $ rescue.center                             : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
##  $ delta.status.preceding.selection.selection: int  2636 16243 597 1834 1341 2197 16 1312 263 437 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1078 1791 1451 ...
##  $ OSRM.estimated.duration                   : num  214 218 120 250 199 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 32.4 25.7 26.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
data.fe.reduced.0<-data.fe[,-c("delta.departure.presentation","delta.selection.presentation","rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]
x_test.reduced<-x_test[,-c("rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]

Y0

Validation

set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.0$delta.selection.departure, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.0[trainIndex,]
valid=data.fe.reduced.0[-trainIndex,]
data.fe.reduced.0 %>% str
## Classes 'data.table' and 'data.frame':   204987 obs. of  14 variables:
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
x_test.reduced %>% str
## Classes 'data.table' and 'data.frame':   108033 obs. of  13 variables:
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.28 2.34 2.41 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ delta.status.preceding.selection.selection: int  2636 16243 597 1834 1341 2197 16 1312 263 437 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1078 1791 1451 ...
##  $ OSRM.estimated.duration                   : num  214 218 120 250 199 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 32.4 25.7 26.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
foo <- train %>% select(-delta.selection.departure)
bar <- valid %>% select(-delta.selection.departure)

dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.selection.departure)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.selection.departure)
dtest <- xgb.DMatrix(data.matrix(x_test.reduced))
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear",
                   seed = 4321
                   )

watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
                   data = dtrain,
                   print_every_n = 50,
                   watchlist = watchlist,
                   nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 50, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:36:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:110.043587   valid-rmse:110.268089 
## [51] train-rmse:47.278492    valid-rmse:47.751270 
## [101]    train-rmse:46.977036    valid-rmse:47.623722 
## [151]    train-rmse:46.771313    valid-rmse:47.570621 
## [201]    train-rmse:46.627319    valid-rmse:47.560299 
## [251]    train-rmse:46.464745    valid-rmse:47.514023 
## [300]    train-rmse:46.327545    valid-rmse:47.511181

After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.

xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=200)
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:109.420416+0.495150  test-rmse:109.430615+0.488056 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:84.520085+0.587442   test-rmse:84.538504+0.502799 
## [3]  train-rmse:69.349644+0.893923   test-rmse:69.372398+0.827324 
## [4]  train-rmse:60.300552+1.127178   test-rmse:60.337537+0.937264 
## [5]  train-rmse:54.740454+0.546901   test-rmse:54.771004+0.331087 
## [6]  train-rmse:51.905637+0.651854   test-rmse:51.932079+0.465446 
## [7]  train-rmse:50.213223+0.412485   test-rmse:50.245101+0.387258 
## [8]  train-rmse:49.340348+0.339408   test-rmse:49.366396+0.407458 
## [9]  train-rmse:48.797874+0.160656   test-rmse:48.818684+0.373239 
## [10] train-rmse:48.417364+0.083016   test-rmse:48.444783+0.350240 
## [11] train-rmse:48.228524+0.120281   test-rmse:48.266144+0.313282 
## [12] train-rmse:48.087869+0.124237   test-rmse:48.133390+0.329473 
## [13] train-rmse:47.986265+0.089676   test-rmse:48.033182+0.353658 
## [14] train-rmse:47.892427+0.079926   test-rmse:47.947552+0.375428 
## [15] train-rmse:47.832483+0.083574   test-rmse:47.895967+0.375571 
## [16] train-rmse:47.790473+0.080415   test-rmse:47.863293+0.379066 
## [17] train-rmse:47.751828+0.081082   test-rmse:47.831606+0.380974 
## [18] train-rmse:47.727420+0.086136   test-rmse:47.814307+0.383756 
## [19] train-rmse:47.691869+0.078602   test-rmse:47.785546+0.387505 
## [20] train-rmse:47.646491+0.112707   test-rmse:47.743008+0.361829 
## [21] train-rmse:47.611533+0.127790   test-rmse:47.718449+0.351909 
## [22] train-rmse:47.581001+0.129603   test-rmse:47.690350+0.355777 
## [23] train-rmse:47.566931+0.128581   test-rmse:47.679336+0.358482 
## [24] train-rmse:47.552747+0.128625   test-rmse:47.670377+0.360645 
## [25] train-rmse:47.533769+0.126592   test-rmse:47.653292+0.359117 
## [26] train-rmse:47.517961+0.129046   test-rmse:47.642613+0.353710 
## [27] train-rmse:47.504838+0.130470   test-rmse:47.633099+0.350346 
## [28] train-rmse:47.488182+0.128669   test-rmse:47.623568+0.352864 
## [29] train-rmse:47.478754+0.128524   test-rmse:47.618341+0.354121 
## [30] train-rmse:47.464933+0.125357   test-rmse:47.608382+0.358899 
## [31] train-rmse:47.447864+0.128427   test-rmse:47.598231+0.354538 
## [32] train-rmse:47.437128+0.127135   test-rmse:47.593299+0.353727 
## [33] train-rmse:47.424672+0.128209   test-rmse:47.584892+0.352185 
## [34] train-rmse:47.412052+0.128769   test-rmse:47.579828+0.351704 
## [35] train-rmse:47.395421+0.135913   test-rmse:47.570622+0.346784 
## [36] train-rmse:47.386800+0.135271   test-rmse:47.567609+0.352190 
## [37] train-rmse:47.377648+0.135101   test-rmse:47.569548+0.354002 
## [38] train-rmse:47.366985+0.133217   test-rmse:47.563750+0.356662 
## [39] train-rmse:47.354906+0.133437   test-rmse:47.554203+0.356047 
## [40] train-rmse:47.347677+0.132827   test-rmse:47.550345+0.357539 
## [41] train-rmse:47.340453+0.131505   test-rmse:47.549147+0.359843 
## [42] train-rmse:47.324946+0.130718   test-rmse:47.540761+0.359252 
## [43] train-rmse:47.309410+0.125738   test-rmse:47.532930+0.362823 
## [44] train-rmse:47.298247+0.123099   test-rmse:47.528691+0.364755 
## [45] train-rmse:47.286745+0.120426   test-rmse:47.520033+0.368465 
## [46] train-rmse:47.280306+0.119586   test-rmse:47.518392+0.366660 
## [47] train-rmse:47.271176+0.116132   test-rmse:47.512527+0.368884 
## [48] train-rmse:47.264273+0.116093   test-rmse:47.513126+0.369346 
## [49] train-rmse:47.254261+0.115051   test-rmse:47.507814+0.370803 
## [50] train-rmse:47.244592+0.113336   test-rmse:47.501458+0.372071 
## [51] train-rmse:47.238686+0.114972   test-rmse:47.499967+0.370398 
## [52] train-rmse:47.227146+0.119079   test-rmse:47.490928+0.366837 
## [53] train-rmse:47.222829+0.117205   test-rmse:47.490529+0.367030 
## [54] train-rmse:47.219044+0.117141   test-rmse:47.488692+0.365355 
## [55] train-rmse:47.212611+0.115741   test-rmse:47.485481+0.366452 
## [56] train-rmse:47.207226+0.116960   test-rmse:47.485974+0.363875 
## [57] train-rmse:47.201480+0.116104   test-rmse:47.486849+0.365250 
## [58] train-rmse:47.196257+0.114605   test-rmse:47.487884+0.364872 
## [59] train-rmse:47.190012+0.115193   test-rmse:47.484396+0.360961 
## [60] train-rmse:47.185732+0.114597   test-rmse:47.484931+0.361209 
## [61] train-rmse:47.181973+0.113139   test-rmse:47.487944+0.361337 
## [62] train-rmse:47.179081+0.112625   test-rmse:47.488368+0.360251 
## [63] train-rmse:47.172899+0.113509   test-rmse:47.486959+0.361472 
## [64] train-rmse:47.166243+0.114017   test-rmse:47.484556+0.356938 
## [65] train-rmse:47.161686+0.114000   test-rmse:47.483697+0.356843 
## [66] train-rmse:47.156447+0.114541   test-rmse:47.484074+0.357742 
## [67] train-rmse:47.150737+0.114146   test-rmse:47.483385+0.357810 
## [68] train-rmse:47.144992+0.113556   test-rmse:47.479279+0.357175 
## [69] train-rmse:47.141460+0.113787   test-rmse:47.478683+0.354675 
## [70] train-rmse:47.132628+0.113724   test-rmse:47.473325+0.357615 
## [71] train-rmse:47.125207+0.114018   test-rmse:47.472322+0.357278 
## [72] train-rmse:47.112384+0.122296   test-rmse:47.461309+0.348387 
## [73] train-rmse:47.108363+0.121871   test-rmse:47.461949+0.351203 
## [74] train-rmse:47.097643+0.118741   test-rmse:47.454028+0.355086 
## [75] train-rmse:47.086817+0.121844   test-rmse:47.446861+0.352881 
## [76] train-rmse:47.083521+0.121895   test-rmse:47.448671+0.352664 
## [77] train-rmse:47.079192+0.121195   test-rmse:47.450802+0.352935 
## [78] train-rmse:47.070845+0.118291   test-rmse:47.448384+0.357312 
## [79] train-rmse:47.067351+0.118595   test-rmse:47.449017+0.358714 
## [80] train-rmse:47.063701+0.120803   test-rmse:47.447142+0.357918 
## [81] train-rmse:47.057875+0.123965   test-rmse:47.442821+0.355277 
## [82] train-rmse:47.052542+0.122954   test-rmse:47.443634+0.355643 
## [83] train-rmse:47.047643+0.121431   test-rmse:47.441110+0.357339 
## [84] train-rmse:47.040910+0.118579   test-rmse:47.436395+0.360167 
## [85] train-rmse:47.036364+0.118335   test-rmse:47.433598+0.361451 
## [86] train-rmse:47.030484+0.115679   test-rmse:47.430134+0.364509 
## [87] train-rmse:47.026946+0.114941   test-rmse:47.431322+0.363012 
## [88] train-rmse:47.023735+0.115903   test-rmse:47.432223+0.362066 
## [89] train-rmse:47.020015+0.115515   test-rmse:47.432826+0.366036 
## [90] train-rmse:47.015485+0.117276   test-rmse:47.432697+0.364722 
## [91] train-rmse:47.010017+0.119581   test-rmse:47.433095+0.362567 
## [92] train-rmse:47.007132+0.119785   test-rmse:47.431719+0.363207 
## [93] train-rmse:47.003623+0.122485   test-rmse:47.431351+0.360890 
## [94] train-rmse:46.998022+0.120368   test-rmse:47.429740+0.361636 
## [95] train-rmse:46.991802+0.119639   test-rmse:47.429117+0.363401 
## [96] train-rmse:46.988226+0.120151   test-rmse:47.427172+0.362112 
## [97] train-rmse:46.983573+0.118707   test-rmse:47.426582+0.361713 
## [98] train-rmse:46.979624+0.117132   test-rmse:47.425566+0.363019 
## [99] train-rmse:46.972733+0.116868   test-rmse:47.422822+0.362955 
## [100]    train-rmse:46.967827+0.117163   test-rmse:47.423319+0.364356 
## [101]    train-rmse:46.962444+0.117891   test-rmse:47.418350+0.364594 
## [102]    train-rmse:46.957264+0.116050   test-rmse:47.418996+0.364431 
## [103]    train-rmse:46.954103+0.115924   test-rmse:47.419227+0.362320 
## [104]    train-rmse:46.950755+0.117457   test-rmse:47.418195+0.362387 
## [105]    train-rmse:46.944431+0.115788   test-rmse:47.418155+0.362797 
## [106]    train-rmse:46.939888+0.116501   test-rmse:47.418310+0.360590 
## [107]    train-rmse:46.927883+0.113873   test-rmse:47.408280+0.362281 
## [108]    train-rmse:46.921961+0.112111   test-rmse:47.406422+0.362239 
## [109]    train-rmse:46.918815+0.112391   test-rmse:47.407284+0.362622 
## [110]    train-rmse:46.914907+0.112182   test-rmse:47.407691+0.362662 
## [111]    train-rmse:46.912280+0.112097   test-rmse:47.407097+0.361399 
## [112]    train-rmse:46.907418+0.111966   test-rmse:47.406506+0.362504 
## [113]    train-rmse:46.899776+0.110950   test-rmse:47.406089+0.362396 
## [114]    train-rmse:46.894140+0.111513   test-rmse:47.405962+0.361742 
## [115]    train-rmse:46.887976+0.117877   test-rmse:47.402118+0.355749 
## [116]    train-rmse:46.883464+0.120656   test-rmse:47.401146+0.352255 
## [117]    train-rmse:46.878951+0.121676   test-rmse:47.400244+0.349570 
## [118]    train-rmse:46.872687+0.122247   test-rmse:47.396672+0.350451 
## [119]    train-rmse:46.869488+0.122241   test-rmse:47.397435+0.350093 
## [120]    train-rmse:46.865553+0.121685   test-rmse:47.396568+0.349413 
## [121]    train-rmse:46.862210+0.121903   test-rmse:47.394488+0.348301 
## [122]    train-rmse:46.856686+0.120815   test-rmse:47.391507+0.348383 
## [123]    train-rmse:46.848929+0.120998   test-rmse:47.389822+0.350380 
## [124]    train-rmse:46.844705+0.120386   test-rmse:47.391432+0.350353 
## [125]    train-rmse:46.839776+0.117789   test-rmse:47.390014+0.351678 
## [126]    train-rmse:46.831003+0.114048   test-rmse:47.381560+0.354198 
## [127]    train-rmse:46.825771+0.115639   test-rmse:47.382244+0.351833 
## [128]    train-rmse:46.821711+0.114687   test-rmse:47.382464+0.352091 
## [129]    train-rmse:46.817688+0.112751   test-rmse:47.381277+0.354069 
## [130]    train-rmse:46.815636+0.111954   test-rmse:47.383003+0.352803 
## [131]    train-rmse:46.811386+0.112408   test-rmse:47.381155+0.354327 
## [132]    train-rmse:46.807091+0.113886   test-rmse:47.383242+0.352926 
## [133]    train-rmse:46.805931+0.114077   test-rmse:47.384218+0.352648 
## [134]    train-rmse:46.802422+0.114779   test-rmse:47.383825+0.353306 
## [135]    train-rmse:46.799052+0.114900   test-rmse:47.382378+0.351551 
## [136]    train-rmse:46.793046+0.114889   test-rmse:47.378210+0.349983 
## [137]    train-rmse:46.787291+0.118686   test-rmse:47.375133+0.347623 
## [138]    train-rmse:46.783862+0.118465   test-rmse:47.375242+0.346744 
## [139]    train-rmse:46.778656+0.117090   test-rmse:47.371983+0.348917 
## [140]    train-rmse:46.772878+0.116186   test-rmse:47.370814+0.350075 
## [141]    train-rmse:46.768692+0.114241   test-rmse:47.370957+0.353833 
## [142]    train-rmse:46.765939+0.114770   test-rmse:47.370976+0.354461 
## [143]    train-rmse:46.759020+0.114503   test-rmse:47.367326+0.354104 
## [144]    train-rmse:46.754312+0.114235   test-rmse:47.365516+0.354381 
## [145]    train-rmse:46.751513+0.113673   test-rmse:47.365415+0.353358 
## [146]    train-rmse:46.748609+0.112894   test-rmse:47.366879+0.353079 
## [147]    train-rmse:46.744514+0.113231   test-rmse:47.365620+0.353462 
## [148]    train-rmse:46.741552+0.113294   test-rmse:47.364305+0.355637 
## [149]    train-rmse:46.738111+0.113935   test-rmse:47.366345+0.354083 
## [150]    train-rmse:46.734845+0.113496   test-rmse:47.366410+0.355514 
## [151]    train-rmse:46.732237+0.113066   test-rmse:47.365012+0.355768 
## [152]    train-rmse:46.729189+0.113296   test-rmse:47.368489+0.355938 
## [153]    train-rmse:46.726626+0.114061   test-rmse:47.370652+0.356829 
## [154]    train-rmse:46.723274+0.113612   test-rmse:47.373837+0.358335 
## [155]    train-rmse:46.720256+0.113988   test-rmse:47.372818+0.360519 
## [156]    train-rmse:46.716094+0.114141   test-rmse:47.374747+0.362379 
## [157]    train-rmse:46.713908+0.113619   test-rmse:47.377541+0.364460 
## [158]    train-rmse:46.709586+0.115113   test-rmse:47.376457+0.364729 
## Stopping. Best iteration:
## [148]    train-rmse:46.741552+0.113294   test-rmse:47.364305+0.355637
importance_matrix_y0 <- xgb.importance(model = gb_dt_final)
xgb.plot.importance(importance_matrix_y0[1:30,])

importance_matrix_y0[1:30,]
##                                        Feature        Gain       Cover
##  1:                                      hours 0.467613973 0.077723717
##  2: delta.status.preceding.selection.selection 0.238602999 0.177045631
##  3:                      latitude.intervention 0.056169580 0.155942974
##  4:                      alert.reason.category 0.052726177 0.033613103
##  5:                     longitude.intervention 0.050120650 0.133622233
##  6:                       OSRM.estimated.speed 0.033660641 0.113168119
##  7:                    OSRM.estimated.duration 0.031845807 0.118622759
##  8:                    OSRM.estimated.distance 0.028495466 0.119362661
##  9:                                      month 0.013864727 0.019638328
## 10:                                   weekdays 0.011525886 0.022758270
## 11:                                      floor 0.006960137 0.017369730
## 12:            departed.from.its.rescue.center 0.006033114 0.007403547
## 13:               intervention.on.public.roads 0.002380842 0.003728929
## 14:                                       <NA>          NA          NA
## 15:                                       <NA>          NA          NA
## 16:                                       <NA>          NA          NA
## 17:                                       <NA>          NA          NA
## 18:                                       <NA>          NA          NA
## 19:                                       <NA>          NA          NA
## 20:                                       <NA>          NA          NA
## 21:                                       <NA>          NA          NA
## 22:                                       <NA>          NA          NA
## 23:                                       <NA>          NA          NA
## 24:                                       <NA>          NA          NA
## 25:                                       <NA>          NA          NA
## 26:                                       <NA>          NA          NA
## 27:                                       <NA>          NA          NA
## 28:                                       <NA>          NA          NA
## 29:                                       <NA>          NA          NA
## 30:                                       <NA>          NA          NA
##                                        Feature        Gain       Cover
##       Frequency
##  1: 0.079628725
##  2: 0.161700049
##  3: 0.144113337
##  4: 0.038104543
##  5: 0.130923302
##  6: 0.113336590
##  7: 0.098680997
##  8: 0.117733268
##  9: 0.040058622
## 10: 0.029311187
## 11: 0.032242306
## 12: 0.007327797
## 13: 0.006839277
## 14:          NA
## 15:          NA
## 16:          NA
## 17:          NA
## 18:          NA
## 19:          NA
## 20:          NA
## 21:          NA
## 22:          NA
## 23:          NA
## 24:          NA
## 25:          NA
## 26:          NA
## 27:          NA
## 28:          NA
## 29:          NA
## 30:          NA
##       Frequency
yo_preds <- predict(gb_dt_final,dtest)

Working !

yo_preds %>% head
## [1] 139.9449 136.5910 120.4235 158.9106 152.7684 163.7142

Y1

data.fe.reduced.1<-data.fe[,-c("delta.selection.departure","delta.selection.presentation","rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]

Validation

set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.1$delta.departure.presentation, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.1[trainIndex,]
valid=data.fe.reduced.1[-trainIndex,]
foo <- train %>% select(-delta.departure.presentation)
bar <- valid %>% select(-delta.departure.presentation)

dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.departure.presentation)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.departure.presentation)
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear",
                   seed = 4321
                   )

watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
                   data = dtrain,
                   print_every_n = 5,
                   watchlist = watchlist,
                   nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 5, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:37:21] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:281.581818   valid-rmse:281.096405 
## [6]  train-rmse:137.590942   valid-rmse:137.396545 
## [11] train-rmse:129.172226   valid-rmse:129.130997 
## [16] train-rmse:128.197983   valid-rmse:128.248779 
## [21] train-rmse:127.834221   valid-rmse:127.959030 
## [26] train-rmse:127.529221   valid-rmse:127.764519 
## [31] train-rmse:127.313118   valid-rmse:127.610802 
## [36] train-rmse:127.108261   valid-rmse:127.415588 
## [41] train-rmse:126.971756   valid-rmse:127.301468 
## [46] train-rmse:126.766457   valid-rmse:127.153137 
## [51] train-rmse:126.650963   valid-rmse:127.076904 
## [56] train-rmse:126.551292   valid-rmse:127.025551 
## [61] train-rmse:126.418633   valid-rmse:126.859032 
## [66] train-rmse:126.260788   valid-rmse:126.729599 
## [71] train-rmse:126.178810   valid-rmse:126.681183 
## [76] train-rmse:126.072586   valid-rmse:126.593849 
## [81] train-rmse:126.011520   valid-rmse:126.566483 
## [86] train-rmse:125.930672   valid-rmse:126.507385 
## [91] train-rmse:125.846268   valid-rmse:126.471138 
## [96] train-rmse:125.752846   valid-rmse:126.437157 
## [101]    train-rmse:125.673805   valid-rmse:126.377579 
## [106]    train-rmse:125.605606   valid-rmse:126.320541 
## [111]    train-rmse:125.545403   valid-rmse:126.266739 
## [116]    train-rmse:125.478317   valid-rmse:126.240616 
## [121]    train-rmse:125.421829   valid-rmse:126.227203 
## [126]    train-rmse:125.356369   valid-rmse:126.183113 
## [131]    train-rmse:125.319839   valid-rmse:126.145607 
## [136]    train-rmse:125.266129   valid-rmse:126.131760 
## [141]    train-rmse:125.217155   valid-rmse:126.135361 
## [146]    train-rmse:125.175476   valid-rmse:126.117546 
## [151]    train-rmse:125.113052   valid-rmse:126.080421 
## [156]    train-rmse:125.049934   valid-rmse:126.043907 
## [161]    train-rmse:125.003387   valid-rmse:126.022049 
## [166]    train-rmse:124.929024   valid-rmse:126.002174 
## [171]    train-rmse:124.896721   valid-rmse:126.006470 
## [176]    train-rmse:124.847977   valid-rmse:126.005798 
## [181]    train-rmse:124.791206   valid-rmse:125.986488 
## [186]    train-rmse:124.735176   valid-rmse:125.968216 
## [191]    train-rmse:124.691116   valid-rmse:125.955620 
## [196]    train-rmse:124.627815   valid-rmse:125.921906 
## [201]    train-rmse:124.601868   valid-rmse:125.902725 
## [206]    train-rmse:124.550392   valid-rmse:125.891167 
## [211]    train-rmse:124.520157   valid-rmse:125.899101 
## [216]    train-rmse:124.463165   valid-rmse:125.874977 
## [221]    train-rmse:124.418488   valid-rmse:125.842552 
## [226]    train-rmse:124.382301   valid-rmse:125.846573 
## [231]    train-rmse:124.358192   valid-rmse:125.866714 
## [236]    train-rmse:124.314766   valid-rmse:125.854462 
## [241]    train-rmse:124.280281   valid-rmse:125.850349 
## [246]    train-rmse:124.253571   valid-rmse:125.844170 
## [251]    train-rmse:124.199898   valid-rmse:125.850151 
## [256]    train-rmse:124.154457   valid-rmse:125.849022 
## [261]    train-rmse:124.123108   valid-rmse:125.879662 
## [266]    train-rmse:124.091957   valid-rmse:125.901802 
## [271]    train-rmse:124.066185   valid-rmse:125.906349 
## [276]    train-rmse:124.027557   valid-rmse:125.895996 
## [281]    train-rmse:123.982994   valid-rmse:125.882622 
## [286]    train-rmse:123.938988   valid-rmse:125.872154 
## [291]    train-rmse:123.906761   valid-rmse:125.844421 
## [296]    train-rmse:123.872627   valid-rmse:125.858078 
## [300]    train-rmse:123.847771   valid-rmse:125.851135

After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.

xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=150)
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:281.523804+0.414619  test-rmse:281.544348+0.724674 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:219.081528+1.673523  test-rmse:219.131796+1.536183 
## [3]  train-rmse:179.514246+1.093519  test-rmse:179.601877+1.104861 
## [4]  train-rmse:156.713403+0.879909  test-rmse:156.862521+0.670584 
## [5]  train-rmse:144.033722+0.454121  test-rmse:144.197638+0.724830 
## [6]  train-rmse:137.140607+0.336102  test-rmse:137.323874+0.934584 
## [7]  train-rmse:133.426883+0.241605  test-rmse:133.607410+0.751621 
## [8]  train-rmse:131.365689+0.216542  test-rmse:131.546515+0.666350 
## [9]  train-rmse:130.265457+0.190730  test-rmse:130.463403+0.659083 
## [10] train-rmse:129.582663+0.219886  test-rmse:129.809027+0.613893 
## [11] train-rmse:129.167074+0.236014  test-rmse:129.406039+0.602272 
## [12] train-rmse:128.885593+0.230563  test-rmse:129.133887+0.602641 
## [13] train-rmse:128.619315+0.219849  test-rmse:128.877890+0.613587 
## [14] train-rmse:128.430792+0.220800  test-rmse:128.709418+0.614585 
## [15] train-rmse:128.271570+0.204786  test-rmse:128.550174+0.623793 
## [16] train-rmse:128.143535+0.196471  test-rmse:128.434557+0.630546 
## [17] train-rmse:128.022392+0.179914  test-rmse:128.328053+0.639083 
## [18] train-rmse:127.944290+0.166722  test-rmse:128.267088+0.650849 
## [19] train-rmse:127.849617+0.174530  test-rmse:128.181015+0.649259 
## [20] train-rmse:127.775432+0.177344  test-rmse:128.119374+0.644230 
## [21] train-rmse:127.717990+0.177802  test-rmse:128.076634+0.648400 
## [22] train-rmse:127.653108+0.184064  test-rmse:128.018414+0.642460 
## [23] train-rmse:127.581145+0.175926  test-rmse:127.952537+0.653128 
## [24] train-rmse:127.511911+0.166312  test-rmse:127.899674+0.654002 
## [25] train-rmse:127.467285+0.164262  test-rmse:127.861557+0.655209 
## [26] train-rmse:127.411963+0.153371  test-rmse:127.827664+0.664448 
## [27] train-rmse:127.357196+0.160077  test-rmse:127.780997+0.658795 
## [28] train-rmse:127.295271+0.182832  test-rmse:127.730334+0.637080 
## [29] train-rmse:127.252402+0.181873  test-rmse:127.709767+0.640776 
## [30] train-rmse:127.216164+0.179629  test-rmse:127.683695+0.632499 
## [31] train-rmse:127.172676+0.186846  test-rmse:127.651163+0.633243 
## [32] train-rmse:127.133798+0.180836  test-rmse:127.622701+0.637106 
## [33] train-rmse:127.089110+0.177977  test-rmse:127.591965+0.629584 
## [34] train-rmse:127.058234+0.185618  test-rmse:127.570425+0.620966 
## [35] train-rmse:127.022231+0.179270  test-rmse:127.537398+0.628395 
## [36] train-rmse:126.977962+0.182050  test-rmse:127.499119+0.632896 
## [37] train-rmse:126.953825+0.178533  test-rmse:127.479462+0.647071 
## [38] train-rmse:126.914813+0.191309  test-rmse:127.444212+0.639214 
## [39] train-rmse:126.872835+0.185809  test-rmse:127.411391+0.651816 
## [40] train-rmse:126.833978+0.178700  test-rmse:127.380124+0.661537 
## [41] train-rmse:126.796616+0.176480  test-rmse:127.353807+0.668091 
## [42] train-rmse:126.767587+0.175553  test-rmse:127.331142+0.666778 
## [43] train-rmse:126.737492+0.179754  test-rmse:127.313023+0.666859 
## [44] train-rmse:126.713512+0.175745  test-rmse:127.294949+0.670462 
## [45] train-rmse:126.689593+0.167029  test-rmse:127.281517+0.681222 
## [46] train-rmse:126.651088+0.179210  test-rmse:127.260617+0.673548 
## [47] train-rmse:126.624667+0.169826  test-rmse:127.252069+0.685589 
## [48] train-rmse:126.601259+0.171284  test-rmse:127.239157+0.676947 
## [49] train-rmse:126.568806+0.174550  test-rmse:127.220949+0.669404 
## [50] train-rmse:126.553667+0.179115  test-rmse:127.215584+0.667684 
## [51] train-rmse:126.526750+0.180720  test-rmse:127.192891+0.672631 
## [52] train-rmse:126.505202+0.174346  test-rmse:127.176755+0.677513 
## [53] train-rmse:126.490799+0.179564  test-rmse:127.173900+0.674205 
## [54] train-rmse:126.460991+0.176703  test-rmse:127.157059+0.673474 
## [55] train-rmse:126.437930+0.175104  test-rmse:127.141040+0.670218 
## [56] train-rmse:126.410649+0.167153  test-rmse:127.129221+0.676371 
## [57] train-rmse:126.393768+0.168644  test-rmse:127.124442+0.677805 
## [58] train-rmse:126.373703+0.176651  test-rmse:127.115929+0.666780 
## [59] train-rmse:126.347380+0.174560  test-rmse:127.097687+0.667820 
## [60] train-rmse:126.330393+0.174953  test-rmse:127.090921+0.673389 
## [61] train-rmse:126.320430+0.178680  test-rmse:127.089604+0.672929 
## [62] train-rmse:126.306044+0.182754  test-rmse:127.087215+0.673506 
## [63] train-rmse:126.286730+0.181635  test-rmse:127.079835+0.673497 
## [64] train-rmse:126.259903+0.191361  test-rmse:127.066576+0.670915 
## [65] train-rmse:126.246524+0.198716  test-rmse:127.056656+0.666946 
## [66] train-rmse:126.226643+0.193043  test-rmse:127.045021+0.678763 
## [67] train-rmse:126.215774+0.194754  test-rmse:127.039389+0.683730 
## [68] train-rmse:126.196167+0.196994  test-rmse:127.025725+0.679628 
## [69] train-rmse:126.176482+0.198582  test-rmse:127.013895+0.679368 
## [70] train-rmse:126.151678+0.192077  test-rmse:127.006059+0.688230 
## [71] train-rmse:126.134698+0.195938  test-rmse:127.001901+0.681984 
## [72] train-rmse:126.108124+0.196697  test-rmse:126.991725+0.678874 
## [73] train-rmse:126.089851+0.198317  test-rmse:126.982728+0.680452 
## [74] train-rmse:126.072423+0.197131  test-rmse:126.972639+0.679232 
## [75] train-rmse:126.056810+0.190282  test-rmse:126.962967+0.686312 
## [76] train-rmse:126.024744+0.198343  test-rmse:126.939092+0.675856 
## [77] train-rmse:126.000297+0.192010  test-rmse:126.919551+0.688342 
## [78] train-rmse:125.983716+0.192478  test-rmse:126.911766+0.688660 
## [79] train-rmse:125.968008+0.190795  test-rmse:126.903821+0.691191 
## [80] train-rmse:125.951915+0.188925  test-rmse:126.898318+0.691721 
## [81] train-rmse:125.938568+0.189416  test-rmse:126.891689+0.686983 
## [82] train-rmse:125.927505+0.190296  test-rmse:126.893272+0.689856 
## [83] train-rmse:125.911993+0.195796  test-rmse:126.887067+0.686134 
## [84] train-rmse:125.885525+0.191317  test-rmse:126.872099+0.690821 
## [85] train-rmse:125.875354+0.192467  test-rmse:126.874057+0.689955 
## [86] train-rmse:125.858795+0.192809  test-rmse:126.867731+0.696176 
## [87] train-rmse:125.844838+0.193044  test-rmse:126.872694+0.697327 
## [88] train-rmse:125.831219+0.192605  test-rmse:126.870186+0.697128 
## [89] train-rmse:125.815009+0.193552  test-rmse:126.864787+0.696106 
## [90] train-rmse:125.798550+0.195410  test-rmse:126.852177+0.694531 
## [91] train-rmse:125.789036+0.196389  test-rmse:126.855126+0.688573 
## [92] train-rmse:125.771703+0.196521  test-rmse:126.847206+0.693832 
## [93] train-rmse:125.761838+0.196215  test-rmse:126.843057+0.691604 
## [94] train-rmse:125.748201+0.197899  test-rmse:126.841521+0.697092 
## [95] train-rmse:125.731314+0.196684  test-rmse:126.838306+0.696396 
## [96] train-rmse:125.716672+0.195217  test-rmse:126.828497+0.697162 
## [97] train-rmse:125.702783+0.194351  test-rmse:126.824394+0.700524 
## [98] train-rmse:125.689615+0.193632  test-rmse:126.818106+0.700841 
## [99] train-rmse:125.674144+0.196865  test-rmse:126.812309+0.697225 
## [100]    train-rmse:125.658310+0.198158  test-rmse:126.807248+0.700850 
## [101]    train-rmse:125.622958+0.191371  test-rmse:126.783844+0.708908 
## [102]    train-rmse:125.610875+0.189268  test-rmse:126.783942+0.708402 
## [103]    train-rmse:125.602299+0.187095  test-rmse:126.778578+0.711936 
## [104]    train-rmse:125.587270+0.183929  test-rmse:126.769690+0.714653 
## [105]    train-rmse:125.573859+0.185173  test-rmse:126.763026+0.716550 
## [106]    train-rmse:125.555656+0.182182  test-rmse:126.760545+0.720544 
## [107]    train-rmse:125.542724+0.186330  test-rmse:126.755475+0.718355 
## [108]    train-rmse:125.531265+0.185470  test-rmse:126.757770+0.717427 
## [109]    train-rmse:125.520441+0.183724  test-rmse:126.755988+0.719492 
## [110]    train-rmse:125.502649+0.185467  test-rmse:126.749359+0.715524 
## [111]    train-rmse:125.487388+0.189911  test-rmse:126.750496+0.709536 
## [112]    train-rmse:125.465802+0.182059  test-rmse:126.743807+0.713259 
## [113]    train-rmse:125.444939+0.184126  test-rmse:126.736319+0.708900 
## [114]    train-rmse:125.429964+0.178903  test-rmse:126.731203+0.709725 
## [115]    train-rmse:125.418277+0.177989  test-rmse:126.730649+0.715230 
## [116]    train-rmse:125.406596+0.178234  test-rmse:126.722790+0.717361 
## [117]    train-rmse:125.392983+0.175356  test-rmse:126.710738+0.723223 
## [118]    train-rmse:125.380408+0.172372  test-rmse:126.705493+0.727824 
## [119]    train-rmse:125.367438+0.176569  test-rmse:126.701080+0.724513 
## [120]    train-rmse:125.350941+0.178758  test-rmse:126.695539+0.722733 
## [121]    train-rmse:125.340952+0.180879  test-rmse:126.691469+0.720498 
## [122]    train-rmse:125.331378+0.181274  test-rmse:126.684915+0.722644 
## [123]    train-rmse:125.321429+0.181038  test-rmse:126.684486+0.717763 
## [124]    train-rmse:125.306696+0.179692  test-rmse:126.670096+0.720394 
## [125]    train-rmse:125.287822+0.172014  test-rmse:126.661996+0.728285 
## [126]    train-rmse:125.277008+0.169052  test-rmse:126.655484+0.728027 
## [127]    train-rmse:125.266467+0.167109  test-rmse:126.652710+0.732913 
## [128]    train-rmse:125.247157+0.170217  test-rmse:126.639624+0.729516 
## [129]    train-rmse:125.238914+0.169107  test-rmse:126.641750+0.728962 
## [130]    train-rmse:125.230951+0.169467  test-rmse:126.638780+0.725616 
## [131]    train-rmse:125.213278+0.169051  test-rmse:126.636588+0.725191 
## [132]    train-rmse:125.198699+0.167815  test-rmse:126.632719+0.721715 
## [133]    train-rmse:125.188091+0.168962  test-rmse:126.628691+0.716397 
## [134]    train-rmse:125.174799+0.172753  test-rmse:126.626431+0.711312 
## [135]    train-rmse:125.162801+0.176972  test-rmse:126.627159+0.711524 
## [136]    train-rmse:125.150217+0.177558  test-rmse:126.627708+0.709469 
## [137]    train-rmse:125.140280+0.173189  test-rmse:126.624734+0.712262 
## [138]    train-rmse:125.130373+0.168764  test-rmse:126.624202+0.714721 
## [139]    train-rmse:125.120879+0.166527  test-rmse:126.627045+0.716779 
## [140]    train-rmse:125.110635+0.164745  test-rmse:126.623828+0.717670 
## [141]    train-rmse:125.102263+0.161268  test-rmse:126.618884+0.717977 
## [142]    train-rmse:125.092708+0.158324  test-rmse:126.621751+0.715553 
## [143]    train-rmse:125.082019+0.155436  test-rmse:126.617722+0.718351 
## [144]    train-rmse:125.070909+0.159067  test-rmse:126.614075+0.716664 
## [145]    train-rmse:125.059610+0.156242  test-rmse:126.611324+0.711578 
## [146]    train-rmse:125.050227+0.154634  test-rmse:126.607147+0.715556 
## [147]    train-rmse:125.034125+0.151428  test-rmse:126.596477+0.715755 
## [148]    train-rmse:125.009386+0.147483  test-rmse:126.583774+0.720781 
## [149]    train-rmse:124.998654+0.148988  test-rmse:126.578406+0.716137 
## [150]    train-rmse:124.989412+0.150454  test-rmse:126.579814+0.712278
y1_preds <- predict(gb_dt_final,dtest)
y1_preds %>% head
## [1] 313.6901 397.7925 262.8186 314.9897 281.3379 263.2837

Compute global

pred_ys=yo_preds+y1_preds
pred_final=data.frame(yo_preds,y1_preds,pred_ys)
pred_final %>% head
##   yo_preds y1_preds  pred_ys
## 1 139.9449 313.6901 453.6350
## 2 136.5910 397.7925 534.3835
## 3 120.4235 262.8186 383.2422
## 4 158.9106 314.9897 473.9003
## 5 152.7684 281.3379 434.1063
## 6 163.7142 263.2837 426.9979
pred_final$id<-x_test$emergency.vehicle.selection

Change columns order

pred_final <- pred_final[, c(4, 1, 2, 3)]

Retrieve order from original

pred_final %>% head
##        id yo_preds y1_preds  pred_ys
## 1 4715068 139.9449 313.6901 453.6350
## 2 4714816 136.5910 397.7925 534.3835
## 3 4713710 120.4235 262.8186 383.2422
## 4 4713748 158.9106 314.9897 473.9003
## 5 4713778 152.7684 281.3379 434.1063
## 6 4713812 163.7142 263.2837 426.9979
pred_final %>% head
##        id yo_preds y1_preds  pred_ys
## 1 4715068 139.9449 313.6901 453.6350
## 2 4714816 136.5910 397.7925 534.3835
## 3 4713710 120.4235 262.8186 383.2422
## 4 4713748 158.9106 314.9897 473.9003
## 5 4713778 152.7684 281.3379 434.1063
## 6 4713812 163.7142 263.2837 426.9979
id_order %>% head
##   x_test.emergency.vehicle.selection rank
## 1                            5271704    1
## 2                            5092931    2
## 3                            5153756    3
## 4                            5355572    4
## 5                            5178915    5
## 6                            5206885    6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
##   x_test.emergency.vehicle.selection rank rank
## 1                            5271704    1    1
## 2                            5092931    2    2
## 3                            5153756    3    3
## 4                            5355572    4    4
## 5                            5178915    5    5
## 6                            5206885    6    6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
##        id yo_preds y1_preds  pred_ys  rank rank.1
## 1 4713710 120.4235 262.8186 383.2422 14082  14082
## 2 4713748 158.9106 314.9897 473.9003 25965  25965
## 3 4713778 152.7684 281.3379 434.1063 14465  14465
## 4 4713812 163.7142 263.2837 426.9979 79530  79530
## 5 4713821 133.7922 488.3494 622.1416 47824  47824
## 6 4713863 144.9620 233.3339 378.2959  2072   2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
##            id yo_preds y1_preds  pred_ys rank rank.1
## 77897 5271704 116.8953 414.5863 531.4816    1      1
## 59352 5092931 118.0171 374.9510 492.9681    2      2
## 68483 5153756 116.5670 207.0299 323.5969    3      3
## 91242 5355572 210.9534 302.8663 513.8197    4      4
## 72358 5178915 131.1625 273.5441 404.7067    5      5
## 77062 5206885 116.8149 273.0493 389.8642    6      6
y_final %>% setDT
y_final=y_final[,-c("rank.1")]
y_final %>% head
##         id yo_preds y1_preds  pred_ys rank
## 1: 5271704 116.8953 414.5863 531.4816    1
## 2: 5092931 118.0171 374.9510 492.9681    2
## 3: 5153756 116.5670 207.0299 323.5969    3
## 4: 5355572 210.9534 302.8663 513.8197    4
## 5: 5178915 131.1625 273.5441 404.7067    5
## 6: 5206885 116.8149 273.0493 389.8642    6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0

Write csv file. Go to pyhton script for generate good csv.

fwrite(y_final, "y_test.csv",sep=",")
R2

R2

9.2 all x_test

Keeping some factors

OSRM data internvetion floor alter.reason.category long internvetion lat internvetion delta.status Weekdays Month hours departed from

data.fe %>% str
## Classes 'data.table' and 'data.frame':   204987 obs. of  22 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ delta.departure.presentation              : int  174 376 214 268 409 678 98 187 623 181 ...
##  $ delta.selection.presentation              : int  413 423 332 417 506 791 162 307 757 275 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
x_test %>% str
## Classes 'data.table' and 'data.frame':   108033 obs. of  19 variables:
##  $ n                                         : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ emergency.vehicle.selection               : int  4715068 4714816 4713710 4713748 4713778 4713812 4713821 4713863 4713872 4713878 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
##  $ location.of.the.event                     : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.28 2.34 2.41 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
##  $ emergency.vehicle.type                    : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
##  $ rescue.center                             : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
##  $ delta.status.preceding.selection.selection: int  2636 16243 597 1834 1341 2197 16 1312 263 437 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1078 1791 1451 ...
##  $ OSRM.estimated.duration                   : num  214 218 120 250 199 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 32.4 25.7 26.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
data.fe.reduced.0<-data.fe[,-c("delta.departure.presentation","delta.selection.presentation","emergency.vehicle.selection")]
x_test.reduced<-x_test[,-c("emergency.vehicle.selection")]

Y0

Validation

set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.0$delta.selection.departure, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.0[trainIndex,]
valid=data.fe.reduced.0[-trainIndex,]
data.fe.reduced.0 %>% str
## Classes 'data.table' and 'data.frame':   204987 obs. of  19 variables:
##  $ n                                         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
##  $ location.of.the.event                     : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.33 2.3 2.2 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
##  $ emergency.vehicle.type                    : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
##  $ rescue.center                             : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
##  $ delta.status.preceding.selection.selection: int  8293 16251 875 606 4693 86 7 1382 2062 968 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1525 1812 2586 ...
##  $ OSRM.estimated.duration                   : num  214 218 173 198 280 ...
##  $ delta.selection.departure                 : int  239 47 118 149 97 113 64 120 134 94 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 31.7 33 33.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
x_test.reduced %>% str
## Classes 'data.table' and 'data.frame':   108033 obs. of  18 variables:
##  $ n                                         : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alert.reason.category                     : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
##  $ intervention.on.public.roads              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ floor                                     : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
##  $ location.of.the.event                     : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
##  $ longitude.intervention                    : num  2.34 2.28 2.28 2.34 2.41 ...
##  $ latitude.intervention                     : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ emergency.vehicle                         : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
##  $ emergency.vehicle.type                    : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
##  $ rescue.center                             : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
##  $ delta.status.preceding.selection.selection: int  2636 16243 597 1834 1341 2197 16 1312 263 437 ...
##  $ departed.from.its.rescue.center           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ OSRM.estimated.distance                   : num  1283 2347 1078 1791 1451 ...
##  $ OSRM.estimated.duration                   : num  214 218 120 250 199 ...
##  $ OSRM.estimated.speed                      : num  21.6 38.8 32.4 25.7 26.2 ...
##  $ month                                     : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekdays                                  : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hours                                     : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
foo <- train %>% select(-delta.selection.departure)
bar <- valid %>% select(-delta.selection.departure)

dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.selection.departure)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.selection.departure)
dtest <- xgb.DMatrix(data.matrix(x_test.reduced))
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear",
                   seed = 4321
                   )

watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
                   data = dtrain,
                   print_every_n = 100,
                   watchlist = watchlist,
                   nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 100, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:38:34] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:108.764305   valid-rmse:108.917198 
## [101]    train-rmse:45.565170    valid-rmse:46.235214 
## [201]    train-rmse:45.151196    valid-rmse:46.161350 
## [300]    train-rmse:44.833618    valid-rmse:46.085770

After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.

xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=300)
## [01:39:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:109.438702+0.559431  test-rmse:109.440683+0.619834 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:84.894603+0.718813   test-rmse:84.902556+0.904187 
## [3]  train-rmse:69.485803+1.021398   test-rmse:69.501114+1.134613 
## [4]  train-rmse:59.894737+0.571888   test-rmse:59.907986+0.585138 
## [5]  train-rmse:54.314114+0.344400   test-rmse:54.325171+0.397595 
## [6]  train-rmse:51.589322+0.310478   test-rmse:51.613922+0.443222 
## [7]  train-rmse:49.777984+0.278846   test-rmse:49.819305+0.389955 
## [8]  train-rmse:48.825447+0.318711   test-rmse:48.873666+0.379909 
## [9]  train-rmse:48.303604+0.389986   test-rmse:48.357178+0.307877 
## [10] train-rmse:47.881085+0.246625   test-rmse:47.935491+0.347936 
## [11] train-rmse:47.613717+0.145970   test-rmse:47.675505+0.423193 
## [12] train-rmse:47.447366+0.118329   test-rmse:47.516430+0.447960 
## [13] train-rmse:47.315423+0.147158   test-rmse:47.392815+0.430028 
## [14] train-rmse:47.224849+0.149517   test-rmse:47.312397+0.425389 
## [15] train-rmse:47.139268+0.137270   test-rmse:47.236465+0.429832 
## [16] train-rmse:47.072858+0.107285   test-rmse:47.175311+0.446108 
## [17] train-rmse:47.012226+0.104342   test-rmse:47.124238+0.448828 
## [18] train-rmse:46.952984+0.088795   test-rmse:47.069958+0.459836 
## [19] train-rmse:46.878547+0.099424   test-rmse:46.995417+0.444797 
## [20] train-rmse:46.829338+0.092866   test-rmse:46.949340+0.446384 
## [21] train-rmse:46.792914+0.090098   test-rmse:46.918788+0.449279 
## [22] train-rmse:46.741657+0.092691   test-rmse:46.875304+0.449124 
## [23] train-rmse:46.704360+0.102778   test-rmse:46.846227+0.443588 
## [24] train-rmse:46.672100+0.105100   test-rmse:46.825277+0.445101 
## [25] train-rmse:46.635297+0.107106   test-rmse:46.797864+0.437146 
## [26] train-rmse:46.596321+0.094237   test-rmse:46.762611+0.439128 
## [27] train-rmse:46.560267+0.094102   test-rmse:46.733967+0.432636 
## [28] train-rmse:46.534351+0.094057   test-rmse:46.713133+0.433678 
## [29] train-rmse:46.513647+0.091461   test-rmse:46.698022+0.437434 
## [30] train-rmse:46.481928+0.091837   test-rmse:46.662455+0.428521 
## [31] train-rmse:46.458511+0.087206   test-rmse:46.644716+0.432821 
## [32] train-rmse:46.436820+0.086933   test-rmse:46.629818+0.434153 
## [33] train-rmse:46.410554+0.080262   test-rmse:46.604999+0.438643 
## [34] train-rmse:46.375812+0.099476   test-rmse:46.576962+0.415402 
## [35] train-rmse:46.350275+0.111620   test-rmse:46.560944+0.400779 
## [36] train-rmse:46.325918+0.119989   test-rmse:46.540186+0.394581 
## [37] train-rmse:46.307871+0.118590   test-rmse:46.526142+0.393106 
## [38] train-rmse:46.279480+0.113912   test-rmse:46.502864+0.389999 
## [39] train-rmse:46.263217+0.111436   test-rmse:46.495633+0.391198 
## [40] train-rmse:46.247558+0.115388   test-rmse:46.485359+0.385311 
## [41] train-rmse:46.218600+0.113302   test-rmse:46.460198+0.388392 
## [42] train-rmse:46.201891+0.111633   test-rmse:46.448116+0.391217 
## [43] train-rmse:46.177470+0.114666   test-rmse:46.426009+0.385887 
## [44] train-rmse:46.169138+0.115109   test-rmse:46.423231+0.383833 
## [45] train-rmse:46.152372+0.115892   test-rmse:46.414399+0.378098 
## [46] train-rmse:46.138344+0.108031   test-rmse:46.405788+0.385344 
## [47] train-rmse:46.119640+0.108681   test-rmse:46.387619+0.386052 
## [48] train-rmse:46.096558+0.104960   test-rmse:46.367865+0.393450 
## [49] train-rmse:46.070429+0.089682   test-rmse:46.343572+0.406688 
## [50] train-rmse:46.053349+0.084951   test-rmse:46.328449+0.409164 
## [51] train-rmse:46.033454+0.096135   test-rmse:46.310957+0.398398 
## [52] train-rmse:46.005325+0.102338   test-rmse:46.289720+0.393850 
## [53] train-rmse:45.996831+0.102862   test-rmse:46.285611+0.391098 
## [54] train-rmse:45.981652+0.096508   test-rmse:46.272963+0.394636 
## [55] train-rmse:45.972262+0.096199   test-rmse:46.268758+0.395324 
## [56] train-rmse:45.955748+0.096589   test-rmse:46.265627+0.393099 
## [57] train-rmse:45.943023+0.103073   test-rmse:46.256781+0.388549 
## [58] train-rmse:45.930495+0.102585   test-rmse:46.247333+0.389796 
## [59] train-rmse:45.921631+0.101210   test-rmse:46.243295+0.392078 
## [60] train-rmse:45.910559+0.099743   test-rmse:46.233862+0.393873 
## [61] train-rmse:45.897163+0.095237   test-rmse:46.225727+0.397722 
## [62] train-rmse:45.890328+0.095729   test-rmse:46.227280+0.398735 
## [63] train-rmse:45.882620+0.095649   test-rmse:46.227895+0.402963 
## [64] train-rmse:45.874048+0.099698   test-rmse:46.226121+0.398159 
## [65] train-rmse:45.867832+0.100948   test-rmse:46.224503+0.398578 
## [66] train-rmse:45.858866+0.096844   test-rmse:46.219268+0.402189 
## [67] train-rmse:45.851366+0.094488   test-rmse:46.214951+0.402451 
## [68] train-rmse:45.835413+0.094237   test-rmse:46.203290+0.401946 
## [69] train-rmse:45.826470+0.091611   test-rmse:46.198814+0.404163 
## [70] train-rmse:45.817159+0.093187   test-rmse:46.193200+0.404034 
## [71] train-rmse:45.804537+0.095164   test-rmse:46.188594+0.403349 
## [72] train-rmse:45.793393+0.092860   test-rmse:46.183155+0.406127 
## [73] train-rmse:45.782345+0.098632   test-rmse:46.175396+0.397158 
## [74] train-rmse:45.770280+0.094341   test-rmse:46.168652+0.401127 
## [75] train-rmse:45.765028+0.094839   test-rmse:46.167232+0.397421 
## [76] train-rmse:45.753767+0.092104   test-rmse:46.160619+0.397639 
## [77] train-rmse:45.746474+0.093464   test-rmse:46.157053+0.398226 
## [78] train-rmse:45.740511+0.094070   test-rmse:46.157086+0.395058 
## [79] train-rmse:45.727780+0.087412   test-rmse:46.145632+0.401226 
## [80] train-rmse:45.722180+0.090232   test-rmse:46.142101+0.400225 
## [81] train-rmse:45.706887+0.087424   test-rmse:46.132990+0.403812 
## [82] train-rmse:45.694583+0.087184   test-rmse:46.122985+0.403369 
## [83] train-rmse:45.687182+0.088302   test-rmse:46.124319+0.405468 
## [84] train-rmse:45.677532+0.095221   test-rmse:46.118461+0.398551 
## [85] train-rmse:45.669465+0.092271   test-rmse:46.113824+0.399925 
## [86] train-rmse:45.660502+0.090982   test-rmse:46.110171+0.402120 
## [87] train-rmse:45.649854+0.093473   test-rmse:46.102174+0.399998 
## [88] train-rmse:45.642434+0.095430   test-rmse:46.095586+0.397447 
## [89] train-rmse:45.635644+0.097198   test-rmse:46.089239+0.394088 
## [90] train-rmse:45.628621+0.098191   test-rmse:46.084952+0.390051 
## [91] train-rmse:45.621129+0.096414   test-rmse:46.084302+0.396479 
## [92] train-rmse:45.614175+0.094711   test-rmse:46.081956+0.397250 
## [93] train-rmse:45.608479+0.096503   test-rmse:46.075993+0.397114 
## [94] train-rmse:45.600951+0.096525   test-rmse:46.073662+0.396759 
## [95] train-rmse:45.593455+0.099667   test-rmse:46.070815+0.394864 
## [96] train-rmse:45.588517+0.101237   test-rmse:46.070628+0.395557 
## [97] train-rmse:45.580947+0.102931   test-rmse:46.068772+0.394096 
## [98] train-rmse:45.575790+0.104361   test-rmse:46.067573+0.394005 
## [99] train-rmse:45.567061+0.106474   test-rmse:46.062012+0.388595 
## [100]    train-rmse:45.560835+0.105843   test-rmse:46.061743+0.387334 
## [101]    train-rmse:45.553590+0.107416   test-rmse:46.063715+0.382370 
## [102]    train-rmse:45.545368+0.109430   test-rmse:46.059541+0.379140 
## [103]    train-rmse:45.537080+0.110874   test-rmse:46.058841+0.380323 
## [104]    train-rmse:45.523985+0.106184   test-rmse:46.047968+0.384215 
## [105]    train-rmse:45.519158+0.106633   test-rmse:46.047930+0.384796 
## [106]    train-rmse:45.511317+0.105058   test-rmse:46.045711+0.384642 
## [107]    train-rmse:45.502442+0.110509   test-rmse:46.041018+0.383337 
## [108]    train-rmse:45.495631+0.111530   test-rmse:46.040226+0.381890 
## [109]    train-rmse:45.489730+0.111894   test-rmse:46.038902+0.382750 
## [110]    train-rmse:45.480850+0.112727   test-rmse:46.035070+0.384782 
## [111]    train-rmse:45.472446+0.111709   test-rmse:46.030100+0.387983 
## [112]    train-rmse:45.467838+0.111339   test-rmse:46.029602+0.389600 
## [113]    train-rmse:45.464172+0.110385   test-rmse:46.030675+0.390175 
## [114]    train-rmse:45.461120+0.110866   test-rmse:46.029289+0.389483 
## [115]    train-rmse:45.453600+0.114780   test-rmse:46.024248+0.384036 
## [116]    train-rmse:45.447788+0.115955   test-rmse:46.021204+0.380277 
## [117]    train-rmse:45.442711+0.116291   test-rmse:46.017592+0.378985 
## [118]    train-rmse:45.436595+0.116952   test-rmse:46.016781+0.377853 
## [119]    train-rmse:45.431622+0.118288   test-rmse:46.014191+0.377601 
## [120]    train-rmse:45.424943+0.117829   test-rmse:46.010490+0.378689 
## [121]    train-rmse:45.415434+0.119387   test-rmse:46.012794+0.375269 
## [122]    train-rmse:45.408988+0.122145   test-rmse:46.008726+0.372006 
## [123]    train-rmse:45.403700+0.121970   test-rmse:46.005849+0.374101 
## [124]    train-rmse:45.396910+0.121602   test-rmse:46.005533+0.374079 
## [125]    train-rmse:45.391816+0.121186   test-rmse:46.009264+0.376061 
## [126]    train-rmse:45.384869+0.119488   test-rmse:46.007458+0.375526 
## [127]    train-rmse:45.380136+0.117342   test-rmse:46.006281+0.376821 
## [128]    train-rmse:45.374454+0.116486   test-rmse:46.004682+0.374953 
## [129]    train-rmse:45.367886+0.116091   test-rmse:46.000495+0.376947 
## [130]    train-rmse:45.362262+0.116303   test-rmse:46.002694+0.377098 
## [131]    train-rmse:45.356953+0.114891   test-rmse:46.000604+0.378858 
## [132]    train-rmse:45.349868+0.113813   test-rmse:45.995538+0.379066 
## [133]    train-rmse:45.345124+0.113126   test-rmse:45.997368+0.376386 
## [134]    train-rmse:45.339609+0.112229   test-rmse:45.994669+0.373599 
## [135]    train-rmse:45.334460+0.113855   test-rmse:45.994750+0.373977 
## [136]    train-rmse:45.329550+0.112310   test-rmse:45.992655+0.374772 
## [137]    train-rmse:45.325485+0.111417   test-rmse:45.997994+0.376721 
## [138]    train-rmse:45.318647+0.112062   test-rmse:45.993290+0.374061 
## [139]    train-rmse:45.313020+0.112085   test-rmse:45.994011+0.373814 
## [140]    train-rmse:45.309931+0.111983   test-rmse:45.992916+0.377432 
## [141]    train-rmse:45.303262+0.109068   test-rmse:45.989247+0.379591 
## [142]    train-rmse:45.297079+0.107220   test-rmse:45.981874+0.382213 
## [143]    train-rmse:45.293496+0.107134   test-rmse:45.984377+0.386339 
## [144]    train-rmse:45.288324+0.107523   test-rmse:45.984695+0.384871 
## [145]    train-rmse:45.284608+0.108439   test-rmse:45.984770+0.384042 
## [146]    train-rmse:45.280708+0.107661   test-rmse:45.985532+0.382651 
## [147]    train-rmse:45.277760+0.107509   test-rmse:45.982391+0.381723 
## [148]    train-rmse:45.271954+0.109191   test-rmse:45.982878+0.382488 
## [149]    train-rmse:45.265923+0.107376   test-rmse:45.981250+0.380495 
## [150]    train-rmse:45.262325+0.106957   test-rmse:45.980341+0.379965 
## [151]    train-rmse:45.257532+0.107106   test-rmse:45.982121+0.379787 
## [152]    train-rmse:45.253326+0.105689   test-rmse:45.979599+0.381825 
## [153]    train-rmse:45.245923+0.103926   test-rmse:45.976024+0.385324 
## [154]    train-rmse:45.238992+0.104069   test-rmse:45.979707+0.388182 
## [155]    train-rmse:45.232452+0.106430   test-rmse:45.976478+0.384663 
## [156]    train-rmse:45.228419+0.106536   test-rmse:45.976254+0.383833 
## [157]    train-rmse:45.220896+0.107421   test-rmse:45.974044+0.385273 
## [158]    train-rmse:45.215640+0.104557   test-rmse:45.969373+0.385891 
## [159]    train-rmse:45.211986+0.104489   test-rmse:45.972313+0.384611 
## [160]    train-rmse:45.206991+0.104936   test-rmse:45.970402+0.381735 
## [161]    train-rmse:45.203828+0.105907   test-rmse:45.971630+0.381604 
## [162]    train-rmse:45.200187+0.106739   test-rmse:45.972697+0.381342 
## [163]    train-rmse:45.194403+0.104078   test-rmse:45.970814+0.381737 
## [164]    train-rmse:45.190692+0.106271   test-rmse:45.969705+0.379677 
## [165]    train-rmse:45.185835+0.105541   test-rmse:45.969140+0.379014 
## [166]    train-rmse:45.183518+0.105773   test-rmse:45.966428+0.377351 
## [167]    train-rmse:45.178522+0.107149   test-rmse:45.963833+0.377128 
## [168]    train-rmse:45.171958+0.108809   test-rmse:45.956678+0.379922 
## [169]    train-rmse:45.167024+0.107581   test-rmse:45.955903+0.381604 
## [170]    train-rmse:45.162596+0.108407   test-rmse:45.954861+0.380690 
## [171]    train-rmse:45.156151+0.105399   test-rmse:45.954975+0.382678 
## [172]    train-rmse:45.151859+0.106585   test-rmse:45.959971+0.383140 
## [173]    train-rmse:45.147759+0.105442   test-rmse:45.957592+0.383399 
## [174]    train-rmse:45.143932+0.105284   test-rmse:45.956418+0.383989 
## [175]    train-rmse:45.140186+0.104551   test-rmse:45.953863+0.384547 
## [176]    train-rmse:45.132823+0.103931   test-rmse:45.954833+0.382546 
## [177]    train-rmse:45.128508+0.103835   test-rmse:45.954583+0.382270 
## [178]    train-rmse:45.124426+0.104018   test-rmse:45.954285+0.382404 
## [179]    train-rmse:45.118513+0.104674   test-rmse:45.952400+0.383345 
## [180]    train-rmse:45.114346+0.102893   test-rmse:45.950305+0.388413 
## [181]    train-rmse:45.110950+0.102957   test-rmse:45.949064+0.390227 
## [182]    train-rmse:45.105926+0.103420   test-rmse:45.944723+0.390256 
## [183]    train-rmse:45.101180+0.103310   test-rmse:45.947958+0.388871 
## [184]    train-rmse:45.096868+0.102308   test-rmse:45.947850+0.389898 
## [185]    train-rmse:45.094074+0.102612   test-rmse:45.949082+0.387970 
## [186]    train-rmse:45.090502+0.102335   test-rmse:45.950592+0.384964 
## [187]    train-rmse:45.087356+0.102407   test-rmse:45.950372+0.387534 
## [188]    train-rmse:45.082444+0.101214   test-rmse:45.952331+0.387627 
## [189]    train-rmse:45.077558+0.100186   test-rmse:45.949451+0.388117 
## [190]    train-rmse:45.072720+0.102014   test-rmse:45.950515+0.387894 
## [191]    train-rmse:45.069019+0.103195   test-rmse:45.949208+0.386652 
## [192]    train-rmse:45.065781+0.102559   test-rmse:45.950189+0.386796 
## Stopping. Best iteration:
## [182]    train-rmse:45.105926+0.103420   test-rmse:45.944723+0.390256
yo_preds <- predict(gb_dt_final,dtest)

Working !

yo_preds %>% head
## [1] 135.6260 155.6035 161.2986 121.7719 152.7737 200.6128

Y1

data.fe.reduced.1<-data.fe[,-c("delta.selection.departure","delta.selection.presentation","emergency.vehicle.selection")]

Validation

set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.1$delta.departure.presentation, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.1[trainIndex,]
valid=data.fe.reduced.1[-trainIndex,]
foo <- train %>% select(-delta.departure.presentation)
bar <- valid %>% select(-delta.departure.presentation)

dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.departure.presentation)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.departure.presentation)
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = "rmse", 
                   objective = "reg:linear",
                   seed = 4321
                   )

watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
                   data = dtrain,
                   print_every_n = 500,
                   watchlist = watchlist,
                   nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 500, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:39:59] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:281.581818   valid-rmse:281.096405 
## [300]    train-rmse:121.854195   valid-rmse:123.892708

After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.

xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=1500)
## [01:40:25] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:27] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:282.703515+2.044958  test-rmse:282.702063+1.886475 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:218.929465+1.463350  test-rmse:218.955466+1.366829 
## [3]  train-rmse:179.388782+0.922182  test-rmse:179.439578+1.124537 
## [4]  train-rmse:156.524942+0.601086  test-rmse:156.619220+1.412488 
## [5]  train-rmse:143.643131+0.368277  test-rmse:143.771579+1.294501 
## [6]  train-rmse:136.784097+0.374349  test-rmse:136.925977+1.336595 
## [7]  train-rmse:133.034320+0.245124  test-rmse:133.173413+1.127329 
## [8]  train-rmse:131.106418+0.238518  test-rmse:131.271310+1.080970 
## [9]  train-rmse:129.921249+0.203774  test-rmse:130.094870+0.971673 
## [10] train-rmse:129.221500+0.212954  test-rmse:129.418523+0.924257 
## [11] train-rmse:128.797443+0.186786  test-rmse:129.030980+0.916174 
## [12] train-rmse:128.471445+0.166120  test-rmse:128.712482+0.925961 
## [13] train-rmse:128.235010+0.168303  test-rmse:128.484287+0.915703 
## [14] train-rmse:128.029650+0.155856  test-rmse:128.283845+0.936030 
## [15] train-rmse:127.870262+0.168881  test-rmse:128.135141+0.928278 
## [16] train-rmse:127.731982+0.178498  test-rmse:128.017876+0.908642 
## [17] train-rmse:127.591865+0.195014  test-rmse:127.892786+0.891753 
## [18] train-rmse:127.458642+0.161967  test-rmse:127.761642+0.934695 
## [19] train-rmse:127.325392+0.184251  test-rmse:127.630139+0.902832 
## [20] train-rmse:127.199688+0.200809  test-rmse:127.518605+0.897174 
## [21] train-rmse:127.122252+0.198951  test-rmse:127.448506+0.898978 
## [22] train-rmse:127.042662+0.207185  test-rmse:127.381604+0.887989 
## [23] train-rmse:126.952083+0.203202  test-rmse:127.308566+0.892925 
## [24] train-rmse:126.841679+0.223267  test-rmse:127.190930+0.865984 
## [25] train-rmse:126.771108+0.234352  test-rmse:127.133368+0.847293 
## [26] train-rmse:126.697027+0.245487  test-rmse:127.072899+0.835085 
## [27] train-rmse:126.624141+0.256402  test-rmse:127.001717+0.816570 
## [28] train-rmse:126.558125+0.242914  test-rmse:126.950719+0.814846 
## [29] train-rmse:126.486377+0.241149  test-rmse:126.893077+0.808848 
## [30] train-rmse:126.433842+0.225320  test-rmse:126.861282+0.817011 
## [31] train-rmse:126.374431+0.221572  test-rmse:126.801906+0.818437 
## [32] train-rmse:126.326585+0.226684  test-rmse:126.765683+0.801218 
## [33] train-rmse:126.267470+0.211786  test-rmse:126.715581+0.818008 
## [34] train-rmse:126.203683+0.230844  test-rmse:126.670879+0.806378 
## [35] train-rmse:126.155077+0.238875  test-rmse:126.640695+0.798791 
## [36] train-rmse:126.115436+0.241824  test-rmse:126.611569+0.797617 
## [37] train-rmse:126.061203+0.230213  test-rmse:126.565912+0.816225 
## [38] train-rmse:126.010155+0.229980  test-rmse:126.525464+0.818805 
## [39] train-rmse:125.970544+0.229664  test-rmse:126.497513+0.818607 
## [40] train-rmse:125.923646+0.213149  test-rmse:126.461865+0.833388 
## [41] train-rmse:125.880949+0.203930  test-rmse:126.427788+0.833575 
## [42] train-rmse:125.838098+0.207238  test-rmse:126.383000+0.842389 
## [43] train-rmse:125.785091+0.195117  test-rmse:126.340544+0.838548 
## [44] train-rmse:125.747461+0.181158  test-rmse:126.309015+0.844894 
## [45] train-rmse:125.691655+0.192677  test-rmse:126.259523+0.830306 
## [46] train-rmse:125.639765+0.188534  test-rmse:126.218234+0.832650 
## [47] train-rmse:125.605250+0.197371  test-rmse:126.191342+0.827306 
## [48] train-rmse:125.569556+0.193846  test-rmse:126.163077+0.831638 
## [49] train-rmse:125.534003+0.199433  test-rmse:126.138173+0.829639 
## [50] train-rmse:125.498117+0.198589  test-rmse:126.116321+0.836811 
## [51] train-rmse:125.469231+0.193087  test-rmse:126.093979+0.837198 
## [52] train-rmse:125.424967+0.188518  test-rmse:126.067302+0.838145 
## [53] train-rmse:125.386867+0.205427  test-rmse:126.031917+0.811170 
## [54] train-rmse:125.345363+0.199397  test-rmse:126.001122+0.814600 
## [55] train-rmse:125.305042+0.203959  test-rmse:125.961874+0.802213 
## [56] train-rmse:125.277211+0.201368  test-rmse:125.943977+0.805934 
## [57] train-rmse:125.226792+0.190045  test-rmse:125.901520+0.826776 
## [58] train-rmse:125.190959+0.189146  test-rmse:125.875571+0.834384 
## [59] train-rmse:125.169673+0.185201  test-rmse:125.861639+0.835608 
## [60] train-rmse:125.144896+0.195900  test-rmse:125.841919+0.826982 
## [61] train-rmse:125.106685+0.193277  test-rmse:125.820224+0.831770 
## [62] train-rmse:125.081052+0.189589  test-rmse:125.802141+0.834579 
## [63] train-rmse:125.050079+0.197859  test-rmse:125.788423+0.834169 
## [64] train-rmse:125.010675+0.210347  test-rmse:125.762556+0.830253 
## [65] train-rmse:124.986939+0.216104  test-rmse:125.749217+0.819989 
## [66] train-rmse:124.960171+0.217233  test-rmse:125.733715+0.820735 
## [67] train-rmse:124.939699+0.215158  test-rmse:125.716045+0.819372 
## [68] train-rmse:124.915331+0.219360  test-rmse:125.696565+0.822305 
## [69] train-rmse:124.897646+0.217097  test-rmse:125.687781+0.827521 
## [70] train-rmse:124.876439+0.210361  test-rmse:125.676044+0.838766 
## [71] train-rmse:124.856493+0.214544  test-rmse:125.663767+0.836737 
## [72] train-rmse:124.818239+0.209000  test-rmse:125.633214+0.846154 
## [73] train-rmse:124.782941+0.198264  test-rmse:125.595021+0.858412 
## [74] train-rmse:124.749162+0.206003  test-rmse:125.573341+0.852352 
## [75] train-rmse:124.716267+0.216702  test-rmse:125.548451+0.838944 
## [76] train-rmse:124.697811+0.209454  test-rmse:125.536493+0.845908 
## [77] train-rmse:124.673810+0.203714  test-rmse:125.520654+0.854118 
## [78] train-rmse:124.652530+0.201940  test-rmse:125.503529+0.857069 
## [79] train-rmse:124.638992+0.202696  test-rmse:125.497584+0.853056 
## [80] train-rmse:124.616330+0.207956  test-rmse:125.490674+0.838532 
## [81] train-rmse:124.601074+0.207526  test-rmse:125.484125+0.840142 
## [82] train-rmse:124.574916+0.206030  test-rmse:125.470949+0.845956 
## [83] train-rmse:124.554855+0.211355  test-rmse:125.467580+0.836485 
## [84] train-rmse:124.532333+0.228419  test-rmse:125.455859+0.817522 
## [85] train-rmse:124.516373+0.226317  test-rmse:125.446210+0.815057 
## [86] train-rmse:124.489075+0.216611  test-rmse:125.431657+0.820862 
## [87] train-rmse:124.460928+0.223182  test-rmse:125.411769+0.815325 
## [88] train-rmse:124.441197+0.220566  test-rmse:125.404826+0.820964 
## [89] train-rmse:124.414539+0.223098  test-rmse:125.388025+0.821629 
## [90] train-rmse:124.390385+0.209885  test-rmse:125.379727+0.833791 
## [91] train-rmse:124.368117+0.202071  test-rmse:125.366159+0.839067 
## [92] train-rmse:124.346083+0.205148  test-rmse:125.357337+0.845494 
## [93] train-rmse:124.329237+0.201805  test-rmse:125.348677+0.844546 
## [94] train-rmse:124.300706+0.208897  test-rmse:125.334479+0.834123 
## [95] train-rmse:124.280487+0.203953  test-rmse:125.321832+0.844934 
## [96] train-rmse:124.268350+0.203069  test-rmse:125.318417+0.842680 
## [97] train-rmse:124.244969+0.212763  test-rmse:125.296135+0.834938 
## [98] train-rmse:124.219591+0.212004  test-rmse:125.273372+0.834482 
## [99] train-rmse:124.198354+0.207942  test-rmse:125.254198+0.841005 
## [100]    train-rmse:124.169476+0.213721  test-rmse:125.235422+0.838064 
## [101]    train-rmse:124.150229+0.210763  test-rmse:125.230492+0.842126 
## [102]    train-rmse:124.136728+0.209839  test-rmse:125.226146+0.848534 
## [103]    train-rmse:124.114752+0.209701  test-rmse:125.215111+0.854951 
## [104]    train-rmse:124.094069+0.212273  test-rmse:125.192407+0.854568 
## [105]    train-rmse:124.071657+0.221118  test-rmse:125.190526+0.851523 
## [106]    train-rmse:124.056296+0.217983  test-rmse:125.184055+0.853253 
## [107]    train-rmse:124.037360+0.214577  test-rmse:125.179395+0.856461 
## [108]    train-rmse:124.023198+0.217073  test-rmse:125.174644+0.856415 
## [109]    train-rmse:123.994272+0.215885  test-rmse:125.152083+0.858963 
## [110]    train-rmse:123.977367+0.223131  test-rmse:125.146132+0.849085 
## [111]    train-rmse:123.954877+0.232329  test-rmse:125.129021+0.843463 
## [112]    train-rmse:123.928111+0.236565  test-rmse:125.118430+0.835341 
## [113]    train-rmse:123.907115+0.238853  test-rmse:125.112030+0.834479 
## [114]    train-rmse:123.890954+0.240151  test-rmse:125.107970+0.828642 
## [115]    train-rmse:123.873549+0.233621  test-rmse:125.103181+0.836711 
## [116]    train-rmse:123.859955+0.236917  test-rmse:125.102930+0.839444 
## [117]    train-rmse:123.848044+0.237979  test-rmse:125.097391+0.837442 
## [118]    train-rmse:123.836887+0.231834  test-rmse:125.093994+0.843417 
## [119]    train-rmse:123.815416+0.237891  test-rmse:125.080812+0.839786 
## [120]    train-rmse:123.795982+0.235941  test-rmse:125.077800+0.839567 
## [121]    train-rmse:123.778194+0.230650  test-rmse:125.071588+0.841475 
## [122]    train-rmse:123.755693+0.222981  test-rmse:125.055562+0.850295 
## [123]    train-rmse:123.736705+0.219992  test-rmse:125.050154+0.843424 
## [124]    train-rmse:123.718704+0.216090  test-rmse:125.043178+0.845905 
## [125]    train-rmse:123.700588+0.210073  test-rmse:125.027925+0.853678 
## [126]    train-rmse:123.675894+0.210793  test-rmse:125.007852+0.849176 
## [127]    train-rmse:123.664406+0.210704  test-rmse:125.004581+0.848155 
## [128]    train-rmse:123.641428+0.213283  test-rmse:124.986501+0.850398 
## [129]    train-rmse:123.622099+0.215967  test-rmse:124.979970+0.854711 
## [130]    train-rmse:123.609581+0.211986  test-rmse:124.983769+0.860617 
## [131]    train-rmse:123.593855+0.214852  test-rmse:124.979529+0.859937 
## [132]    train-rmse:123.582040+0.213258  test-rmse:124.976103+0.857515 
## [133]    train-rmse:123.561973+0.217257  test-rmse:124.961093+0.855444 
## [134]    train-rmse:123.549158+0.217683  test-rmse:124.952711+0.858713 
## [135]    train-rmse:123.536905+0.216968  test-rmse:124.951578+0.862451 
## [136]    train-rmse:123.520061+0.217543  test-rmse:124.943797+0.860185 
## [137]    train-rmse:123.508148+0.219179  test-rmse:124.939552+0.860109 
## [138]    train-rmse:123.491881+0.223549  test-rmse:124.936464+0.851047 
## [139]    train-rmse:123.476079+0.231766  test-rmse:124.931679+0.840744 
## [140]    train-rmse:123.460965+0.237687  test-rmse:124.926697+0.838108 
## [141]    train-rmse:123.446237+0.243254  test-rmse:124.922507+0.835648 
## [142]    train-rmse:123.432312+0.241219  test-rmse:124.918463+0.836149 
## [143]    train-rmse:123.418791+0.241527  test-rmse:124.913319+0.835665 
## [144]    train-rmse:123.406230+0.240045  test-rmse:124.911740+0.837834 
## [145]    train-rmse:123.386075+0.243063  test-rmse:124.905865+0.834632 
## [146]    train-rmse:123.373619+0.248370  test-rmse:124.900865+0.832215 
## [147]    train-rmse:123.357594+0.248850  test-rmse:124.890523+0.831822 
## [148]    train-rmse:123.340611+0.247387  test-rmse:124.877173+0.828970 
## [149]    train-rmse:123.329089+0.244405  test-rmse:124.875403+0.829551 
## [150]    train-rmse:123.315657+0.244560  test-rmse:124.868761+0.826650 
## [151]    train-rmse:123.299795+0.239596  test-rmse:124.860689+0.826716 
## [152]    train-rmse:123.286383+0.232553  test-rmse:124.858412+0.826691 
## [153]    train-rmse:123.275934+0.238177  test-rmse:124.856793+0.825227 
## [154]    train-rmse:123.260172+0.238620  test-rmse:124.853363+0.823940 
## [155]    train-rmse:123.251682+0.237573  test-rmse:124.846019+0.825654 
## [156]    train-rmse:123.236238+0.243332  test-rmse:124.833725+0.814806 
## [157]    train-rmse:123.223924+0.240597  test-rmse:124.831728+0.815028 
## [158]    train-rmse:123.211646+0.234858  test-rmse:124.827231+0.815885 
## [159]    train-rmse:123.198674+0.235413  test-rmse:124.826764+0.817849 
## [160]    train-rmse:123.186759+0.237144  test-rmse:124.826462+0.815489 
## [161]    train-rmse:123.176599+0.237559  test-rmse:124.820580+0.814578 
## [162]    train-rmse:123.162610+0.237635  test-rmse:124.815803+0.816051 
## [163]    train-rmse:123.150671+0.240134  test-rmse:124.815997+0.816690 
## [164]    train-rmse:123.134381+0.235918  test-rmse:124.805919+0.818836 
## [165]    train-rmse:123.123396+0.237924  test-rmse:124.802463+0.817056 
## [166]    train-rmse:123.113959+0.237513  test-rmse:124.803487+0.819214 
## [167]    train-rmse:123.102180+0.238519  test-rmse:124.796207+0.815642 
## [168]    train-rmse:123.088808+0.236735  test-rmse:124.790979+0.813379 
## [169]    train-rmse:123.077006+0.237550  test-rmse:124.791929+0.821566 
## [170]    train-rmse:123.068076+0.235357  test-rmse:124.794110+0.821590 
## [171]    train-rmse:123.056482+0.241593  test-rmse:124.794679+0.819583 
## [172]    train-rmse:123.046887+0.243126  test-rmse:124.793047+0.816887 
## [173]    train-rmse:123.032726+0.241747  test-rmse:124.786713+0.813126 
## [174]    train-rmse:123.013573+0.236626  test-rmse:124.775322+0.816170 
## [175]    train-rmse:123.004489+0.239485  test-rmse:124.770921+0.819393 
## [176]    train-rmse:122.988657+0.243757  test-rmse:124.763905+0.823959 
## [177]    train-rmse:122.977400+0.247604  test-rmse:124.766617+0.818279 
## [178]    train-rmse:122.965173+0.249178  test-rmse:124.765701+0.818603 
## [179]    train-rmse:122.947548+0.247722  test-rmse:124.751733+0.817962 
## [180]    train-rmse:122.930499+0.249827  test-rmse:124.754097+0.817266 
## [181]    train-rmse:122.918335+0.249450  test-rmse:124.747942+0.818912 
## [182]    train-rmse:122.910393+0.250046  test-rmse:124.752158+0.815445 
## [183]    train-rmse:122.896339+0.254423  test-rmse:124.747087+0.809309 
## [184]    train-rmse:122.881923+0.253999  test-rmse:124.737135+0.805836 
## [185]    train-rmse:122.866689+0.253076  test-rmse:124.732588+0.806601 
## [186]    train-rmse:122.855983+0.253636  test-rmse:124.730281+0.808322 
## [187]    train-rmse:122.843066+0.254177  test-rmse:124.725697+0.812037 
## [188]    train-rmse:122.833920+0.257494  test-rmse:124.719104+0.806738 
## [189]    train-rmse:122.824617+0.256561  test-rmse:124.718858+0.802002 
## [190]    train-rmse:122.808149+0.263662  test-rmse:124.711884+0.795042 
## [191]    train-rmse:122.797507+0.261261  test-rmse:124.703532+0.798595 
## [192]    train-rmse:122.786642+0.264055  test-rmse:124.704935+0.801269 
## [193]    train-rmse:122.778674+0.262511  test-rmse:124.703882+0.804858 
## [194]    train-rmse:122.771313+0.261746  test-rmse:124.704663+0.800943 
## [195]    train-rmse:122.762102+0.262220  test-rmse:124.705229+0.799998 
## [196]    train-rmse:122.751213+0.261713  test-rmse:124.703248+0.803419 
## [197]    train-rmse:122.742424+0.257413  test-rmse:124.707344+0.809145 
## [198]    train-rmse:122.729915+0.252411  test-rmse:124.704372+0.812882 
## [199]    train-rmse:122.718350+0.249875  test-rmse:124.698671+0.822985 
## [200]    train-rmse:122.706691+0.251270  test-rmse:124.695094+0.814665 
## [201]    train-rmse:122.696220+0.252974  test-rmse:124.686610+0.814117 
## [202]    train-rmse:122.684596+0.252984  test-rmse:124.688141+0.814081 
## [203]    train-rmse:122.671252+0.250797  test-rmse:124.685690+0.815344 
## [204]    train-rmse:122.660486+0.249585  test-rmse:124.685730+0.815407 
## [205]    train-rmse:122.648993+0.250871  test-rmse:124.680603+0.817994 
## [206]    train-rmse:122.638028+0.251300  test-rmse:124.676453+0.818420 
## [207]    train-rmse:122.626321+0.248848  test-rmse:124.672256+0.822444 
## [208]    train-rmse:122.613440+0.250855  test-rmse:124.669411+0.823279 
## [209]    train-rmse:122.606397+0.250544  test-rmse:124.665654+0.820984 
## [210]    train-rmse:122.595605+0.251522  test-rmse:124.667932+0.818849 
## [211]    train-rmse:122.582585+0.251372  test-rmse:124.661792+0.816192 
## [212]    train-rmse:122.572668+0.250903  test-rmse:124.660811+0.814114 
## [213]    train-rmse:122.558652+0.251068  test-rmse:124.658711+0.814343 
## [214]    train-rmse:122.548652+0.250852  test-rmse:124.653522+0.820415 
## [215]    train-rmse:122.536250+0.252358  test-rmse:124.651098+0.824412 
## [216]    train-rmse:122.519780+0.253022  test-rmse:124.641399+0.824554 
## [217]    train-rmse:122.510254+0.251972  test-rmse:124.637253+0.824870 
## [218]    train-rmse:122.491554+0.253490  test-rmse:124.629941+0.823772 
## [219]    train-rmse:122.480890+0.250413  test-rmse:124.630271+0.823457 
## [220]    train-rmse:122.470212+0.249207  test-rmse:124.632410+0.824940 
## [221]    train-rmse:122.461147+0.249431  test-rmse:124.628890+0.825197 
## [222]    train-rmse:122.448788+0.252908  test-rmse:124.625536+0.822702 
## [223]    train-rmse:122.435811+0.256006  test-rmse:124.625096+0.825732 
## [224]    train-rmse:122.423877+0.258507  test-rmse:124.616591+0.824761 
## [225]    train-rmse:122.406162+0.258602  test-rmse:124.602289+0.826366 
## [226]    train-rmse:122.393289+0.256130  test-rmse:124.599245+0.827483 
## [227]    train-rmse:122.385548+0.255267  test-rmse:124.595605+0.824456 
## [228]    train-rmse:122.377731+0.254012  test-rmse:124.597319+0.827041 
## [229]    train-rmse:122.364464+0.255340  test-rmse:124.587914+0.825821 
## [230]    train-rmse:122.352924+0.251895  test-rmse:124.586563+0.830881 
## [231]    train-rmse:122.340653+0.251020  test-rmse:124.579807+0.839325 
## [232]    train-rmse:122.326439+0.251651  test-rmse:124.573164+0.840034 
## [233]    train-rmse:122.312049+0.252558  test-rmse:124.571487+0.846668 
## [234]    train-rmse:122.305028+0.252084  test-rmse:124.571297+0.844620 
## [235]    train-rmse:122.293327+0.251807  test-rmse:124.562296+0.843526 
## [236]    train-rmse:122.282838+0.251742  test-rmse:124.564496+0.845161 
## [237]    train-rmse:122.272408+0.250095  test-rmse:124.559239+0.849590 
## [238]    train-rmse:122.257593+0.253910  test-rmse:124.553262+0.846872 
## [239]    train-rmse:122.244575+0.252006  test-rmse:124.548497+0.843821 
## [240]    train-rmse:122.236809+0.253924  test-rmse:124.551383+0.840719 
## [241]    train-rmse:122.227168+0.250516  test-rmse:124.552479+0.847094 
## [242]    train-rmse:122.216953+0.252109  test-rmse:124.551027+0.845204 
## [243]    train-rmse:122.206653+0.251993  test-rmse:124.551749+0.846246 
## [244]    train-rmse:122.198625+0.250489  test-rmse:124.547787+0.853991 
## [245]    train-rmse:122.191705+0.249759  test-rmse:124.548543+0.854197 
## [246]    train-rmse:122.184835+0.251527  test-rmse:124.549274+0.855004 
## [247]    train-rmse:122.175757+0.250663  test-rmse:124.547740+0.853821 
## [248]    train-rmse:122.167264+0.249044  test-rmse:124.541863+0.851845 
## [249]    train-rmse:122.160173+0.249516  test-rmse:124.543366+0.849160 
## [250]    train-rmse:122.150351+0.249389  test-rmse:124.534178+0.847989 
## [251]    train-rmse:122.143443+0.246605  test-rmse:124.530742+0.847613 
## [252]    train-rmse:122.132130+0.248239  test-rmse:124.525398+0.847810 
## [253]    train-rmse:122.120404+0.252135  test-rmse:124.525087+0.844577 
## [254]    train-rmse:122.111832+0.251076  test-rmse:124.527005+0.846833 
## [255]    train-rmse:122.102901+0.252172  test-rmse:124.527394+0.844938 
## [256]    train-rmse:122.094453+0.250904  test-rmse:124.524959+0.848794 
## [257]    train-rmse:122.080596+0.252071  test-rmse:124.521792+0.857851 
## [258]    train-rmse:122.066800+0.253042  test-rmse:124.516440+0.857129 
## [259]    train-rmse:122.055365+0.254517  test-rmse:124.514583+0.853380 
## [260]    train-rmse:122.044949+0.255907  test-rmse:124.516748+0.854487 
## [261]    train-rmse:122.036412+0.257214  test-rmse:124.518443+0.850077 
## [262]    train-rmse:122.025475+0.257732  test-rmse:124.514955+0.845710 
## [263]    train-rmse:122.018311+0.258120  test-rmse:124.513653+0.847132 
## [264]    train-rmse:122.010173+0.255542  test-rmse:124.512569+0.850095 
## [265]    train-rmse:122.001848+0.251652  test-rmse:124.510217+0.851443 
## [266]    train-rmse:121.989041+0.250583  test-rmse:124.502580+0.856601 
## [267]    train-rmse:121.979224+0.254724  test-rmse:124.499126+0.855581 
## [268]    train-rmse:121.965256+0.250613  test-rmse:124.493661+0.860872 
## [269]    train-rmse:121.955314+0.250351  test-rmse:124.493501+0.857668 
## [270]    train-rmse:121.943658+0.249594  test-rmse:124.492421+0.860339 
## [271]    train-rmse:121.934070+0.246624  test-rmse:124.489186+0.861841 
## [272]    train-rmse:121.926930+0.245577  test-rmse:124.487033+0.860466 
## [273]    train-rmse:121.916984+0.244214  test-rmse:124.486890+0.864847 
## [274]    train-rmse:121.908665+0.244890  test-rmse:124.488899+0.866079 
## [275]    train-rmse:121.899449+0.246839  test-rmse:124.488402+0.862195 
## [276]    train-rmse:121.889856+0.246889  test-rmse:124.487773+0.870973 
## [277]    train-rmse:121.881825+0.248588  test-rmse:124.488286+0.869108 
## [278]    train-rmse:121.874648+0.249380  test-rmse:124.485077+0.871577 
## [279]    train-rmse:121.862349+0.249979  test-rmse:124.478630+0.876256 
## [280]    train-rmse:121.853583+0.245501  test-rmse:124.474550+0.879420 
## [281]    train-rmse:121.843787+0.245976  test-rmse:124.477582+0.877575 
## [282]    train-rmse:121.831903+0.245720  test-rmse:124.479935+0.882763 
## [283]    train-rmse:121.825481+0.248103  test-rmse:124.480264+0.876562 
## [284]    train-rmse:121.816826+0.249538  test-rmse:124.483537+0.876783 
## [285]    train-rmse:121.809236+0.250408  test-rmse:124.486954+0.872295 
## [286]    train-rmse:121.802962+0.251660  test-rmse:124.486531+0.870028 
## [287]    train-rmse:121.791768+0.249469  test-rmse:124.481787+0.869768 
## [288]    train-rmse:121.784113+0.250063  test-rmse:124.478834+0.865037 
## [289]    train-rmse:121.773612+0.251047  test-rmse:124.475958+0.866396 
## [290]    train-rmse:121.759785+0.247887  test-rmse:124.471977+0.865374 
## [291]    train-rmse:121.749579+0.247080  test-rmse:124.472113+0.866959 
## [292]    train-rmse:121.742268+0.242696  test-rmse:124.472479+0.868902 
## [293]    train-rmse:121.733507+0.239906  test-rmse:124.470796+0.873836 
## [294]    train-rmse:121.724316+0.239329  test-rmse:124.471655+0.875208 
## [295]    train-rmse:121.716275+0.239074  test-rmse:124.475843+0.874817 
## [296]    train-rmse:121.706418+0.235611  test-rmse:124.474968+0.874960 
## [297]    train-rmse:121.697996+0.235786  test-rmse:124.475877+0.878080 
## [298]    train-rmse:121.690814+0.237220  test-rmse:124.476650+0.878388 
## [299]    train-rmse:121.681207+0.236652  test-rmse:124.478523+0.876216 
## [300]    train-rmse:121.672929+0.235002  test-rmse:124.480212+0.877682 
## [301]    train-rmse:121.662836+0.234977  test-rmse:124.480942+0.882108 
## [302]    train-rmse:121.652294+0.235627  test-rmse:124.480798+0.883366 
## [303]    train-rmse:121.643468+0.234400  test-rmse:124.483426+0.884490 
## Stopping. Best iteration:
## [293]    train-rmse:121.733507+0.239906  test-rmse:124.470796+0.873836
y1_preds <- predict(gb_dt_final,dtest)
y1_preds %>% head
## [1] 356.4333 361.2431 286.2780 266.6343 268.5483 345.5236

Compute global

pred_ys=yo_preds+y1_preds
pred_final=data.frame(yo_preds,y1_preds,pred_ys)
pred_final %>% head
##   yo_preds y1_preds  pred_ys
## 1 135.6260 356.4333 492.0593
## 2 155.6035 361.2431 516.8466
## 3 161.2986 286.2780 447.5766
## 4 121.7719 266.6343 388.4062
## 5 152.7737 268.5483 421.3221
## 6 200.6128 345.5236 546.1363
pred_final$id<-x_test$emergency.vehicle.selection

Change columns order

pred_final <- pred_final[, c(4, 1, 2, 3)]

Retrieve order from original

pred_final %>% head
##        id yo_preds y1_preds  pred_ys
## 1 4715068 135.6260 356.4333 492.0593
## 2 4714816 155.6035 361.2431 516.8466
## 3 4713710 161.2986 286.2780 447.5766
## 4 4713748 121.7719 266.6343 388.4062
## 5 4713778 152.7737 268.5483 421.3221
## 6 4713812 200.6128 345.5236 546.1363
pred_final %>% head
##        id yo_preds y1_preds  pred_ys
## 1 4715068 135.6260 356.4333 492.0593
## 2 4714816 155.6035 361.2431 516.8466
## 3 4713710 161.2986 286.2780 447.5766
## 4 4713748 121.7719 266.6343 388.4062
## 5 4713778 152.7737 268.5483 421.3221
## 6 4713812 200.6128 345.5236 546.1363
id_order %>% head
##   x_test.emergency.vehicle.selection rank rank
## 1                            5271704    1    1
## 2                            5092931    2    2
## 3                            5153756    3    3
## 4                            5355572    4    4
## 5                            5178915    5    5
## 6                            5206885    6    6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
##   x_test.emergency.vehicle.selection rank rank rank
## 1                            5271704    1    1    1
## 2                            5092931    2    2    2
## 3                            5153756    3    3    3
## 4                            5355572    4    4    4
## 5                            5178915    5    5    5
## 6                            5206885    6    6    6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
##        id yo_preds y1_preds  pred_ys  rank rank.1 rank.2
## 1 4713710 161.2986 286.2780 447.5766 14082  14082  14082
## 2 4713748 121.7719 266.6343 388.4062 25965  25965  25965
## 3 4713778 152.7737 268.5483 421.3221 14465  14465  14465
## 4 4713812 200.6128 345.5236 546.1363 79530  79530  79530
## 5 4713821 135.7495 531.8088 667.5584 47824  47824  47824
## 6 4713863 152.5117 234.3796 386.8913  2072   2072   2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
##            id yo_preds y1_preds  pred_ys rank rank.1 rank.2
## 77897 5271704 116.7321 407.9039 524.6360    1      1      1
## 59352 5092931 134.7929 401.6206 536.4135    2      2      2
## 68483 5153756 108.1999 238.5139 346.7138    3      3      3
## 91242 5355572 220.5143 296.4528 516.9671    4      4      4
## 72358 5178915 126.3587 364.0570 490.4157    5      5      5
## 77062 5206885 125.2095 376.1835 501.3930    6      6      6
y_final %>% setDT
y_final=y_final[,-c("rank")]
y_final %>% head
##         id yo_preds y1_preds  pred_ys rank.1 rank.2
## 1: 5271704 116.7321 407.9039 524.6360      1      1
## 2: 5092931 134.7929 401.6206 536.4135      2      2
## 3: 5153756 108.1999 238.5139 346.7138      3      3
## 4: 5355572 220.5143 296.4528 516.9671      4      4
## 5: 5178915 126.3587 364.0570 490.4157      5      5
## 6: 5206885 125.2095 376.1835 501.3930      6      6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0

Write csv file. Go to pyhton script for generate good csv.

fwrite(y_final, "y_test.csv",sep=",")
R2

R2

Drop in R2 ## 10. Neural Net

Y0

library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 21 minutes 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    1 month and 5 days  
##     H2O cluster name:           H2O_started_from_R_swp_kqp692 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.55 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.3 (2020-10-10)
h2o.train <- as.h2o(data.fe.reduced.0)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.test <- as.h2o(x_test.reduced)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.model <-  h2o.deeplearning(x=setdiff(names(data.fe.reduced.0),c("delta.selection.departure")),y ="delta.selection.departure",training_frame = h2o.train,standardize = TRUE,hidden = c(100, 100,100),rate = 0.01,epochs = 1000,seed = 1234)
## Warning in .h2o.processResponseWarnings(res): rate cannot be specified if adaptive_rate is enabled..
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |======================================================================| 100%
h2o.prediction.y0 <- as.data.frame(h2o.predict(h2o.model, h2o.test))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle' has levels not trained on: ["1862", "1933",
## "1935", "1976", "2000", "2046", "2061", "2162", "2163", "2214", "2299", "2490",
## "2497", "2510", "2516", "2530", "2538", "2553", "2559", "2564", "2603", "2606",
## "2610", "2617", "2620", "2667", "2702", "2706", "2707", "2713", "2714", "2725",
## "2827", "2858", "2937", "3013", "3023", "3035", "3061", "3062", "3063", "3072",
## "3116", "3120", "3124", "3297", "3391", "3392", "3407", "3417", "3425", "3427",
## "3548", "3552", "4217", "4352", "4358", "4385", "4431", "4440", "4459", "4472",
## "4487", "4490", "4500", "4531", "4561", "4562", "4864", "4881", "4885", "4925",
## "5261", "5262", "5271", "5278", "5655", "5656", "5668", "5769", "5777", "5782",
## "5784", "5786", "5790", "5818", "5824", "5826", "5832", "5835", "5838", "5841",
## "5844", "5847", "5855", "5856", "5868", "5872", "5873", "5874", "5878", "5922",
## "5926", "5939", "5946", "5950", "5961", "5983", "5993", "5994", "6034", "6066"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'location.of.the.event' has levels not trained on: ["152", "253",
## "302", "314"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'rescue.center' has levels not trained on: ["2453", "266267",
## "266268", "266269", "266270", "266276", "266278", "266279", "266295", "266298",
## "266320", "266322", "266326"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle.type' has levels not trained on: ["CESD",
## "CSP", "ESAVI", "SP BALLON", "UMH", "UMH 75", "UMH 92", "UMH 93", "UMH 94", "UMH
## BEAUJ", "UMH BOBI", "UMH DEBREPED", "UMH DIEU", "UMH GARC", "UMH LARIB", "UMH
## MONDOR", "UMH NECK", "UMH PITIE", "VE2I", "VEC", "VELD", "VES", "VIGI", "VIMP",
## "VPB", "VPC GFIS", "VPC GIS", "VRCH BSPP", "VRCP", "VRSD", "VSTI"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'n' has levels not trained on: ["13", "9"]
h2o.prediction.y0 %>% head
##    predict
## 1 263.7359
## 2 112.7732
## 3 130.5915
## 4 166.9196
## 5 131.5314
## 6 165.8459

Y1

h2o.train.1 <- as.h2o(data.fe.reduced.1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.model <-  h2o.deeplearning(x=setdiff(names(data.fe.reduced.1),c("delta.departure.presentation")),y ="delta.departure.presentation",training_frame = h2o.train.1,standardize = TRUE,hidden = c(100, 100,100),rate = 0.01,epochs = 1000,seed = 1234)
## Warning in .h2o.processResponseWarnings(res): rate cannot be specified if adaptive_rate is enabled..
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |======================================================================| 100%
h2o.prediction.y1 <- as.data.frame(h2o.predict(h2o.model, h2o.test))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle' has levels not trained on: ["1862", "1933",
## "1935", "1976", "2000", "2046", "2061", "2162", "2163", "2214", "2299", "2490",
## "2497", "2510", "2516", "2530", "2538", "2553", "2559", "2564", "2603", "2606",
## "2610", "2617", "2620", "2667", "2702", "2706", "2707", "2713", "2714", "2725",
## "2827", "2858", "2937", "3013", "3023", "3035", "3061", "3062", "3063", "3072",
## "3116", "3120", "3124", "3297", "3391", "3392", "3407", "3417", "3425", "3427",
## "3548", "3552", "4217", "4352", "4358", "4385", "4431", "4440", "4459", "4472",
## "4487", "4490", "4500", "4531", "4561", "4562", "4864", "4881", "4885", "4925",
## "5261", "5262", "5271", "5278", "5655", "5656", "5668", "5769", "5777", "5782",
## "5784", "5786", "5790", "5818", "5824", "5826", "5832", "5835", "5838", "5841",
## "5844", "5847", "5855", "5856", "5868", "5872", "5873", "5874", "5878", "5922",
## "5926", "5939", "5946", "5950", "5961", "5983", "5993", "5994", "6034", "6066"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'location.of.the.event' has levels not trained on: ["152", "253",
## "302", "314"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'rescue.center' has levels not trained on: ["2453", "266267",
## "266268", "266269", "266270", "266276", "266278", "266279", "266295", "266298",
## "266320", "266322", "266326"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle.type' has levels not trained on: ["CESD",
## "CSP", "ESAVI", "SP BALLON", "UMH", "UMH 75", "UMH 92", "UMH 93", "UMH 94", "UMH
## BEAUJ", "UMH BOBI", "UMH DEBREPED", "UMH DIEU", "UMH GARC", "UMH LARIB", "UMH
## MONDOR", "UMH NECK", "UMH PITIE", "VE2I", "VEC", "VELD", "VES", "VIGI", "VIMP",
## "VPB", "VPC GFIS", "VPC GIS", "VRCH BSPP", "VRCP", "VRSD", "VSTI"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'n' has levels not trained on: ["13", "9"]

Compute global

h20_ys=h2o.prediction.y0+h2o.prediction.y1
pred_final=data.frame(h2o.prediction.y0,h2o.prediction.y1,h20_ys)
pred_final %>% head
##    predict predict.1 predict.2
## 1 263.7359  346.5793  610.3153
## 2 112.7732  413.5207  526.2940
## 3 130.5915  246.1286  376.7201
## 4 166.9196  283.9014  450.8210
## 5 131.5314  348.4830  480.0144
## 6 165.8459  285.0595  450.9053
pred_final$id<-x_test$emergency.vehicle.selection

Change columns order

pred_final <- pred_final[, c(4, 1, 2, 3)]

Retrieve order from original

pred_final %>% head
##        id  predict predict.1 predict.2
## 1 4715068 263.7359  346.5793  610.3153
## 2 4714816 112.7732  413.5207  526.2940
## 3 4713710 130.5915  246.1286  376.7201
## 4 4713748 166.9196  283.9014  450.8210
## 5 4713778 131.5314  348.4830  480.0144
## 6 4713812 165.8459  285.0595  450.9053
pred_final %>% head
##        id  predict predict.1 predict.2
## 1 4715068 263.7359  346.5793  610.3153
## 2 4714816 112.7732  413.5207  526.2940
## 3 4713710 130.5915  246.1286  376.7201
## 4 4713748 166.9196  283.9014  450.8210
## 5 4713778 131.5314  348.4830  480.0144
## 6 4713812 165.8459  285.0595  450.9053
id_order %>% head
##   x_test.emergency.vehicle.selection rank rank rank
## 1                            5271704    1    1    1
## 2                            5092931    2    2    2
## 3                            5153756    3    3    3
## 4                            5355572    4    4    4
## 5                            5178915    5    5    5
## 6                            5206885    6    6    6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
##   x_test.emergency.vehicle.selection rank rank rank rank
## 1                            5271704    1    1    1    1
## 2                            5092931    2    2    2    2
## 3                            5153756    3    3    3    3
## 4                            5355572    4    4    4    4
## 5                            5178915    5    5    5    5
## 6                            5206885    6    6    6    6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
##        id  predict predict.1 predict.2  rank rank.1 rank.2 rank.3
## 1 4713710 130.5915  246.1286  376.7201 14082  14082  14082  14082
## 2 4713748 166.9196  283.9014  450.8210 25965  25965  25965  25965
## 3 4713778 131.5314  348.4830  480.0144 14465  14465  14465  14465
## 4 4713812 165.8459  285.0595  450.9053 79530  79530  79530  79530
## 5 4713821 151.7333  572.5734  724.3068 47824  47824  47824  47824
## 6 4713863 124.6063  148.1335  272.7398  2072   2072   2072   2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
##            id  predict predict.1 predict.2 rank rank.1 rank.2 rank.3
## 77897 5271704 112.6084  411.4832  524.0916    1      1      1      1
## 59352 5092931 120.0579  409.7404  529.7983    2      2      2      2
## 68483 5153756 122.7251  238.9925  361.7176    3      3      3      3
## 91242 5355572 247.5476  221.9852  469.5328    4      4      4      4
## 72358 5178915 140.4392  264.6780  405.1172    5      5      5      5
## 77062 5206885 130.5567  258.7095  389.2663    6      6      6      6
y_final %>% setDT
y_final=y_final[,-c("rank.1")]
y_final %>% head
##         id  predict predict.1 predict.2 rank rank.2 rank.3
## 1: 5271704 112.6084  411.4832  524.0916    1      1      1
## 2: 5092931 120.0579  409.7404  529.7983    2      2      2
## 3: 5153756 122.7251  238.9925  361.7176    3      3      3
## 4: 5355572 247.5476  221.9852  469.5328    4      4      4
## 5: 5178915 140.4392  264.6780  405.1172    5      5      5
## 6: 5206885 130.5567  258.7095  389.2663    6      6      6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0

Write csv file. Go to pyhton script for generate good csv.

fwrite(y_final, "y_test.csv",sep=",")
R2

R2

11. To Follow …

What could be done in order to increase our R2 ?

  • Increase Feature Buidling: add supp files from data Coordinate : lot of thing to do - Direction, km from Paris Center (traffic issue) Mean speed / vehicule type Mean speed / rescue center Holidays or not (traffic issue) We have removed GPS data because of NA but can be important ? Vehicule type

  • Add a real CV and model selection part : Once the best model identified,

  • go to hyperparameters tunning.


Acknowledgements and references

The following is a list of helpful contributor.

  • Professor Nadine Galy- Econometrics and Statitical Models [U1 -MSc AI and BA- TBS] - 2020
  • Github https://github.com/quachn/X_PFB
  • Medium https://medium.com/crim/predicting-the-response-times-of-firefighters-using-data-science-da79f6965f93
  • Kaggle https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious
  • Stack Overflow community

Thank you for reading !