This group project responds to professor Nadine Galy instructions written below :
The project involves identifying a real-world business problem or opportunity and designing and implementing an analysis plan to address it using at least one of the modelling methods studied in the course. You are free to choose any business problem or opportunity or public policy issue that you consider challenging and useful to address using business analytics. The data that you use should be readily available and verifiable.
This is a notebook for the Paris Fire Brigate data challenge 2020 with ENS and College de France.
The goal of this playground challenge is to predict the The response times of the Paris Fire Brigade vehicles which is the delay between: * the selection of a rescue vehicle (the time when a rescue team is warned) * and the rescue team arrival time at the scene of the request (information sent manually via portable radio)
This measurement is composed by the 2 following periods of time: * the activation period of the rescue team * the transit time of the rescue team
Based on features like trip coordinates, pickup date, type of the arrivall destination, vehicules etc..
The data which covers the entier year 2018 for which inoperable data have been squeezed out comes in the shape of 219 337 training observations and 108 033 test observation. The dataset covers the entire year of 2018. Each row contains one Paris fire brigade intervention.
“Response time is one of the most important factors for firefighters because their ability to save lives and rescue people depends on it. Every fire department in the world seeks strategies to decrease their response time, and several analyses have been conducted in the past years to determine what could impact response time. In the meantime, fire departments have been collecting data on their interventions; yet, few of them actually use data science to develop a data-driven decision making approach.”https://medium.com/crim/predicting-the-response-times-of-firefighters-using-data-science-da79f6965f93
“A lot of fire departments and emergency services rely on geographic information systems tools, such as ESRI ARCGis or Network Analyst, to obtain estimations about the response time. These tools rely on computing the shortest route using a graphical representation of the road network, which usually gives an accurate estimate of the travel time. Their drawback is that they cannot always take into consideration external dynamic factors such as the weather, traffic or type of units or intervention. Hence, there is an opportunity for machine learning tools to be used here.”
In this notebook, we will first study and visualize the original data, engineer new features, and examine potential outliers. Then, we implement a boosted Tree for our first model, do some dimension reductions on qualitative features and implement a linear regression. Finaly, we created a final predict and uploaded it to the data plateforme.
We hope that this notebook will have good results to the challenge and responds fully to Nadine Galy requirement. As always, any feedback, questions, or constructive criticism are much appreciated.
Input parameters (x_train.csv and x_test.csv):
emergency vehicle selection
: identifier of the selection instance of an emergency vehicle for an interventionintervention
: identifier of the interventionalert reason category
(category): alert reason categoryalert reason
(category): alert reasonintervention on public roads
(boolean): 1 when it concerns an intervention on public roads, 0 otherwisefloor
(int): floor of the interventionlocation of the event
(category): qualifies the location of the emergency request, for example: entrance hall, boiler room, motorway, etc.longitude intervention
(float): approximate longitude of the intervention address. ATTENTION: intervention_longitude
!latitude intervention
(float): approximate latitude of the intervention address. ATTENTION: intervention_latitude
!emergency vehicle
: identifier of the emergency vehicle
emergency vehicle type
(category): type of the emergency vehiclerescue center
(category): identifier of the rescue center to which belong the vehicle (parking spot of the emergency vehicle)selection time
(datetime): selection time of the emergency vehicle
date key selection
(int): selection date in YYYYMMDD formattime key selection
(int): selection time in HHMMSS formatstatus preceding selection
(category): status of the emergency vehicle prior to selection. An emergency vehicle is in various statuses during an intervention:
delta status preceding selection-selection
(int): number of seconds before the vehicle was selected when its previous status was entereddeparted from its rescue center
(boolean) : 1 when the vehicle departed from its rescue center (emergency vehicle parking spot), 0 otherwiselongitude before departure
(float): longitude of the position of the vehicle preceding his departure. ATTENTION: departure_longitude
!latitude previous departure
(float): latitude of the position of the vehicle preceding his departure. ATTENTION: departure_latitude
!delta position gps previous departure-departure
(int): number of seconds before the selection of the vehicle where its GPS position was recorded (when not parked at its emergency center)GPS tracks departure-presentation
(float pair list): successive GPS positions (longitude,latitude;longitude,latitude, etc.) of the vehicle between departure and presentation. This information is for informational purposes to study vehicle behaviors. (The beacons, emitting the GPS positions of vehicles, are currently not always lit)GPS tracks departure-presentation datetime
(datetime list): datetime associated with successive GPS positions between the departure and the presentation of the vehicle.OSRM estimated route
(json object): service route response of an OSRM instance (http://project-osrm.org/docs/v5.15.2/api/#route-service) setup with the Ile-de-France OpenStreetMap dataOSRM estimated distance
(float): distance calculated by the OSRM route serviceOSRM estimated duration
(float): transit delay calculated by the OSRM route serviceOutput parameters (y_train.csv and y_test.csv):
emergency vehicle selection
: identifier of the selection instance of an emergency vehicle for an interventiondelta selection-departure
(int): elapsed time in seconds between the selection and the departure of the emergency vehicledelta departure-presentation
(int): elapsed time in seconds between the departure of the emergency vehicle and its presentation on the intervention scenedelta selection-presentation
(int): elapsed time in seconds between the selection of the emergency vehicle and its presentation on the intervention scene (delta selection-departure + delta departure-presentation)Supplementary files (x_train_additional_file.csv and x_test_additional_file.csv)
emergency vehicle selection
: identifier of the selection instance of an emergency vehicle for an interventionOSRM estimate from last observed GPS position
(json object): service route response from last observed GPS position of an OSRM instance (http://project-osrm.org/docs/v5.15.2/api/#route-service) setup with the Ile-de-France OpenStreetMap dataOSRM estimated distance from last observed GPS position
(float): distance (in meters) calculated by the OSRM route service from last observed GPS positionOSRM estimated distance from last observed GPS position
(float): distance (in meters) calculated by the OSRM route service from last observed GPS positionOSRM estimated duration from last observed GPS position
(float): transit delay (in seconds) calculated by the OSRM route service from last observed GPS positiontime elapsed between selection and last observed GPS position
(float): in secondsupdated OSRM estimated duration
(float): time elapsed (in seconds) between selection and last observed GPS position + OSRM estimated duration from last observed GPS positionGood reading !
library(magrittr)
library(data.table)
library(sandwich)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(hms)
library(imputeTS)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(leaflet)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(FactoMineR)
library(corrplot)
## corrplot 0.84 loaded
library(Matrix)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(DataExplorer)
x_train <- read.csv("x_train.csv") %>% setDT
y_train <- read.csv("y_train.csv") %>% setDT
x_test <- read.csv("x_test.csv") %>% setDT
#View(x_train)
data <- cbind(x_train,y_train[,-1])#we don't keep id vehicule selection for no duplicate
#We rename a column which has a special caracter
c<- colnames(data)
c[14] <- "date.key.selection"
c[15] <- "time.key.selection"
colnames(data) <- c
#Same for x_test
c<- colnames(x_test)
c[14] <- "date.key.selection"
c[15] <- "time.key.selection"
colnames(x_test) <- c
Store Id order for x_test
id_order<-data.frame(x_test$emergency.vehicle.selection)
Let’s have an overview of the data sets using the introduce and head tools. First the training data:
plot_intro(data)
glimpse(data)
## Rows: 219,337
## Columns: 29
## $ emergency.vehicle.selection <int> 5105452, 4720915, 5...
## $ intervention <int> 13264186, 12663715,...
## $ alert.reason.category <int> 3, 3, 3, 3, 3, 3, 9...
## $ alert.reason <int> 2162, 2124, 2163, 2...
## $ intervention.on.public.roads <int> 0, 0, 0, 0, 0, 0, 0...
## $ floor <int> 0, 1, 2, 0, 3, 1, 4...
## $ location.of.the.event <dbl> 148, 136, 139, 136,...
## $ longitude.intervention <dbl> 2.284796, 2.247464,...
## $ latitude.intervention <dbl> 48.87967, 48.81819,...
## $ emergency.vehicle <int> 4511, 4327, 4509, 5...
## $ emergency.vehicle.type <chr> "VSAV BSPP", "PSE",...
## $ rescue.center <int> 2447, 2464, 2438, 2...
## $ selection.time <chr> "2018-07-08 19:02:4...
## $ date.key.selection <int> 20180708, 20180104,...
## $ time.key.selection <int> 190243, 90259, 1011...
## $ status.preceding.selection <chr> "Rentré", "Rentré...
## $ delta.status.preceding.selection.selection <int> 2027, 28233, 1981, ...
## $ departed.from.its.rescue.center <int> 1, 1, 0, 1, 1, 1, 1...
## $ longitude.before.departure <dbl> 2.288053, 2.268519,...
## $ latitude.before.departure <dbl> 48.88470, 48.82396,...
## $ delta.position.gps.previous.departure.departure <dbl> NA, NA, 33, NA, NA,...
## $ GPS.tracks.departure.presentation <chr> "2.289000,48.885113...
## $ GPS.tracks.datetime.departure.presentation <chr> "2018-07-08 19:04:4...
## $ OSRM.response <chr> "{\"code\":\"Ok\",\...
## $ OSRM.estimated.distance <dbl> 952.5, 2238.5, 3026...
## $ OSRM.estimated.duration <dbl> 105.8, 243.2, 295.4...
## $ delta.selection.departure <int> 86, 164, 125, 168, ...
## $ delta.departure.presentation <int> 324, 297, 365, 160,...
## $ delta.selection.presentation <int> 410, 461, 490, 328,...
plot_intro(x_test)
glimpse(x_test)
## Rows: 108,033
## Columns: 26
## $ emergency.vehicle.selection <int> 5271704, 5092931, 5...
## $ intervention <int> 13535032, 13244794,...
## $ alert.reason.category <int> 3, 3, 3, 3, 3, 3, 1...
## $ alert.reason <int> 2113, 2113, 2112, 2...
## $ intervention.on.public.roads <int> 0, 0, 1, 0, 1, 0, 0...
## $ floor <int> 2, 0, 0, 0, 0, 15, ...
## $ location.of.the.event <dbl> 136, 228, 148, 201,...
## $ longitude.intervention <dbl> 2.464084, 2.325948,...
## $ latitude.intervention <dbl> 48.81844, 48.92520,...
## $ emergency.vehicle <int> 5755, 3100, 3538, 6...
## $ emergency.vehicle.type <chr> "VSAV BSPP", "VSAV ...
## $ rescue.center <int> 2483, 2462, 2482, 2...
## $ selection.time <chr> "2018-10-02 12:41:2...
## $ date.key.selection <int> 20181002, 20180703,...
## $ time.key.selection <int> 124122, 131447, 134...
## $ status.preceding.selection <chr> "Rentré", "Rentré...
## $ delta.status.preceding.selection.selection <int> 953, 1906, 654, 108...
## $ departed.from.its.rescue.center <int> 1, 1, 1, 1, 1, 1, 1...
## $ longitude.before.departure <dbl> 2.481148, 2.301399,...
## $ latitude.before.departure <dbl> 48.84103, 48.92930,...
## $ delta.position.gps.previous.departure.departure <dbl> NA, NA, NA, NA, NA,...
## $ GPS.tracks.departure.presentation <chr> "", "2.309139,48.92...
## $ GPS.tracks.datetime.departure.presentation <chr> "", "2018-07-03 13:...
## $ OSRM.response <chr> "{\"code\":\"Ok\",\...
## $ OSRM.estimated.distance <dbl> 3266.8, 2710.3, 914...
## $ OSRM.estimated.duration <dbl> 336.5, 218.4, 85.1,...
We find : - We have a great mix of qualitative data and quantitative data - Some quali data are characters such as "status.preceding.selection, other dummy variables coded 0,1 such as $intervention.on.public.roads - We have NA values - We have ID variables that we can remove - We will have to deal with outliers
# visualize missing data
introduce(data)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1: 219337 29 6 23 0
## total_missing_values complete_rows total_observations memory_usage
## 1: 227144 4553 6360773 167533280
plot_missing(data)
introduce(x_train)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1: 219337 26 6 20 0
## total_missing_values complete_rows total_observations memory_usage
## 1: 227144 4553 5702762 164900528
plot_missing(x_train)
We remoev the useless column and the raw with empty cells
data <- data[,-24] # delete the column OSMR response json object
data <- data[,-21] #delete the column delta position gps
Same with x_test data
x_test<-x_test[,-24]
x_test<-x_test[,-21]
Let’s check how many na it remains
sum(is.na(data))
## [1] 12710
sum(is.na(x_test))
## [1] 6331
Good proporition (around 5% of dataset), acceptable to omit those values.
data <- na.omit(data)
which(is.na(data))
## integer(0)
(We apply same cleaning in x_train and y_train in case)
y_train <- na.omit(y_train)
x_train<-na.omit(x_train)
Convert qualitative variables into factor
data$alert.reason.category<- as.factor(data$alert.reason.category)
data$alert.reason<-as.factor(data$alert.reason)
data$location.of.the.event <- as.factor(data$location.of.the.event)
data$intervention.on.public.roads <- as.factor(data$intervention.on.public.roads)
data$emergency.vehicle.type <- as.factor(data$emergency.vehicle.type)
data$rescue.center <- as.factor(data$rescue.center)
data$status.preceding.selection <- as.factor(data$status.preceding.selection)
data$departed.from.its.rescue.center <- as.factor(data$departed.from.its.rescue.center)
data$floor<-as.factor(data$floor)
data$emergency.vehicle<-as.factor(data$emergency.vehicle)
Apply same in x_test
x_test$alert.reason.category<- as.factor(x_test$alert.reason.category)
x_test$alert.reason<-as.factor(x_test$alert.reason)
x_test$location.of.the.event <- as.factor(x_test$location.of.the.event)
x_test$intervention.on.public.roads <- as.factor(x_test$intervention.on.public.roads )
x_test$emergency.vehicle.type <- as.factor(x_test$emergency.vehicle.type)
x_test$rescue.center <- as.factor(x_test$rescue.center)
x_test$status.preceding.selection <- as.factor(x_test$status.preceding.selection)
x_test$departed.from.its.rescue.center <- as.factor(x_test$departed.from.its.rescue.center)
x_test$floor<-as.factor(x_test$floor)
x_test$emergency.vehicle<-as.factor(x_test$emergency.vehicle)
Manage *Selection time
selection time
(datetime): selection time of the emergency vehicle * date key selection
(int): selection date in YYYYMMDD format * time key selection
(int): selection time in HHMMSS formatWe can delete selection time. We keep numeric format for date key selection and time key selection for now (more manageable for cor analysis and regression). For EDA part, we could convert them into date and time format
data<-data[,-c("selection.time")]
x_test<-x_test[,-c("selection.time")]
Store quali, quali and quali with Ys variables
str(data)
## Classes 'data.table' and 'data.frame': 206627 obs. of 26 variables:
## $ emergency.vehicle.selection : int 5105452 4720915 5365374 4741586 5381209 4731603 5196431 4774057 5277444 5277017 ...
## $ intervention : int 13264186 12663715 13675521 12695745 13698743 12680636 13415648 12744692 13544161 13543445 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 3 3 3 3 3 3 9 3 3 3 ...
## $ alert.reason : Factor w/ 122 levels "1911","1912",..: 60 41 61 60 60 31 101 60 32 31 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ floor : Factor w/ 45 levels "-10","-6","-5",..: 8 9 10 8 11 9 12 8 8 8 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 48 36 39 36 5 56 36 36 48 94 ...
## $ longitude.intervention : num 2.28 2.25 2.26 2.39 2.46 ...
## $ latitude.intervention : num 48.9 48.8 48.8 48.8 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 398 334 396 517 473 330 293 575 398 606 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 37 24 37 37 37 37 24 37 37 37 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 15 30 6 72 42 55 42 16 15 50 ...
## $ date.key.selection : int 20180708 20180104 20181116 20180115 20181124 20180109 20180824 20180130 20181005 20181004 ...
## $ time.key.selection : int 190243 90259 101147 3846 3426 222327 81800 74243 82648 5918 ...
## $ status.preceding.selection : Factor w/ 2 levels "Disponible","Rentré": 2 2 1 2 2 2 2 2 2 2 ...
## $ delta.status.preceding.selection.selection: int 2027 28233 1981 1842 2716 5592 37282 5661 31361 304 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
## $ longitude.before.departure : num 2.29 2.27 2.27 2.39 2.44 ...
## $ latitude.before.departure : num 48.9 48.8 48.9 48.8 48.9 ...
## $ GPS.tracks.departure.presentation : chr "2.289000,48.885113;2.288861,48.884998;2.288000,48.883335;2.284444,48.878582;2.286250,48.880196" "" "2.272972,48.850498;2.269056,48.847443;2.262611,48.839554;2.260306,48.836887;2.257917,48.836250" "2.394278,48.782112;2.393639,48.776833" ...
## $ GPS.tracks.datetime.departure.presentation: chr "2018-07-08 19:04:43;2018-07-08 19:05:55;2018-07-08 19:07:07;2018-07-08 19:08:19;2018-07-08 19:09:18" "" "2018-11-16 10:14:31;2018-11-16 10:15:43;2018-11-16 10:16:57;2018-11-16 10:18:07;2018-11-16 10:19:19" "2018-01-15 00:42:46;2018-01-15 00:43:58" ...
## $ OSRM.estimated.distance : num 952 2238 3026 1934 2707 ...
## $ OSRM.estimated.duration : num 106 243 295 167 263 ...
## $ delta.selection.departure : int 86 164 125 168 138 124 104 118 103 129 ...
## $ delta.departure.presentation : int 324 297 365 160 523 419 452 404 411 237 ...
## $ delta.selection.presentation : int 410 461 490 328 661 543 556 522 514 366 ...
## - attr(*, ".internal.selfref")=<externalptr>
colnames(data)
## [1] "emergency.vehicle.selection"
## [2] "intervention"
## [3] "alert.reason.category"
## [4] "alert.reason"
## [5] "intervention.on.public.roads"
## [6] "floor"
## [7] "location.of.the.event"
## [8] "longitude.intervention"
## [9] "latitude.intervention"
## [10] "emergency.vehicle"
## [11] "emergency.vehicle.type"
## [12] "rescue.center"
## [13] "date.key.selection"
## [14] "time.key.selection"
## [15] "status.preceding.selection"
## [16] "delta.status.preceding.selection.selection"
## [17] "departed.from.its.rescue.center"
## [18] "longitude.before.departure"
## [19] "latitude.before.departure"
## [20] "GPS.tracks.departure.presentation"
## [21] "GPS.tracks.datetime.departure.presentation"
## [22] "OSRM.estimated.distance"
## [23] "OSRM.estimated.duration"
## [24] "delta.selection.departure"
## [25] "delta.departure.presentation"
## [26] "delta.selection.presentation"
data.quali<-data[,c(3,4,5,6,7,10,11,12,15,17)]
data.quanti.y<-data[,-c(3,4,5,6,7,10,11,12,15,17)] #with Ys
data.quanti<-data[,-c(3,4,5,6,7,10,11,12,15,17,24,25,26)]
We start with a map of Paris and overlay a manageable number of coordinates to get a general overview of the locations and distances in question. For this visualization we use the leaflet package, which includes a variety of nice tools for interactive maps. In this map you can zoom and pan through the intervention locations:
set.seed(1234)
foo <- sample_n(data, 8e3)
leaflet(data = foo) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircleMarkers(~ longitude.intervention, ~latitude.intervention, radius = 1,
color = "blue", fillOpacity = 0.3)
We have 2 dependant variables and 1 global variable
summary(y_train[,-1])
## delta.selection.departure delta.departure.presentation
## Min. : 0.0 Min. : 1.0
## 1st Qu.: 100.0 1st Qu.: 231.0
## Median : 131.0 Median : 319.0
## Mean : 138.8 Mean : 356.2
## 3rd Qu.: 168.0 3rd Qu.: 434.0
## Max. :17758.0 Max. :22722.0
## delta.selection.presentation
## Min. : 4.0
## 1st Qu.: 363.0
## Median : 458.0
## Mean : 494.9
## 3rd Qu.: 581.0
## Max. :22934.0
Extreme values for your Ys variables around 6 hours !
Plot density distribution for time of vehicle selection
data %>%
ggplot(aes(delta.selection.departure)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 5 rows containing non-finite values (stat_density).
Note the logarithmic x-axis and square-root y-axis.
We find:
Whe majority of vehicule rather timinh follow a rather smooth distribution that looks almost log-normal with a peak just short of 200 seconds, i.e. about 4 minutes.
There are several suspiciously short rides with less than 10 seconds duration.
Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:
Plot density distribution the time to present to the place
data %>%
ggplot(aes(delta.departure.presentation)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
We find:
same smooth distribution that looks almost log-normal with a peak around 400, 500 seconds, i.e. about 6 minutes.
Global variable : selection to arrival around 8 to 10 mintues.
In many fire departments the measurement of turnout time and travel time are done manually. An officer presses a button located in the vehicle to signal his departure and his arrival. This process introduces irregularities and variation in the data, which will need to be cleaned.
Box plots Ys
For delta.selection.departure
data %>%
ggplot(aes(delta.selection.departure)) +
geom_boxplot()
For delta.departure. presentation
data %>%
ggplot(aes(delta.departure.presentation)) +
geom_boxplot()
Cor between Ys
corr<-rcorr(as.matrix(y_train[,-c(1,4)])) # We remove global and ID which will be cor
y_train_cor= corr$r
corr
## delta.selection.departure
## delta.selection.departure 1.00
## delta.departure.presentation 0.03
## delta.departure.presentation
## delta.selection.departure 0.03
## delta.departure.presentation 1.00
##
## n= 219337
##
##
## P
## delta.selection.departure
## delta.selection.departure
## delta.departure.presentation 0
## delta.departure.presentation
## delta.selection.departure 0
## delta.departure.presentation
corrplot(y_train_cor, method="square",type="upper", order="hclust", tl.col="black", tl.srt=45)
The 2 Ys are not corr between them.
Plot delta.selection.departure
y_train %>%
mutate(
delta.selection.departure_minutes = round(delta.selection.departure / 60, 0),
duration_grp = case_when(
between(delta.selection.departure_minutes, 0, 2) ~ "Less than 2 minutes",
between(delta.selection.departure_minutes, 2, 4) ~ "2 to 4 minutes",
between(delta.selection.departure_minutes, 4, 8) ~ "8 to 12 minutes",
delta.selection.departure_minutes >= 8 ~ "8 or more minutes"
),
duration_grp = factor(duration_grp,
levels = c("Less than 2 minutes", "2 to 4 minutes", "8 to 12 minutes", "12 or more minutes"))
) %>%
group_by(duration_grp) %>%
ggplot(aes(x=duration_grp,group=duration_grp)) +
geom_bar(fill="#E41A1C") +
labs(x="Group", y="count") +
theme_minimal()
Most of data is around 0 to 12 max minutes for delta.selection.departure.
Departure to presentation
y_train %>%
mutate(
delta.departure.presentation_minutes = round(delta.departure.presentation / 60, 0),
duration2_grp = case_when(
between(delta.departure.presentation_minutes, 0, 5) ~ "Less than 5 minutes",
between(delta.departure.presentation_minutes, 5, 10) ~ "5 to 10 minutes",
between(delta.departure.presentation_minutes, 10, 15) ~ "10 to 15 minutes",
between(delta.departure.presentation_minutes, 15, 20) ~ "15 to 20 minutes",
delta.departure.presentation_minutes >= 20 ~ " 20 or more minutes"
),
duration2_grp = factor(duration2_grp,
levels = c("Less than 5 minutes", "5 to 10 minutes", "10 to 15 minutes", "15 to 20 minutes","20 or more minutes" ))
) %>%
group_by(duration2_grp) %>%
ggplot(aes(x=duration2_grp,group=duration2_grp)) +
geom_bar(fill="#E41A1C") +
labs(x="Group", y="count") +
theme_minimal()
Max 20 min
y_train %>%
mutate(
delta.selection.presentation_minutes = round(delta.selection.presentation / 60, 0),
duration3_grp = case_when(
between(delta.selection.presentation_minutes, 0, 5) ~ "Less than 5 minutes",
between(delta.selection.presentation_minutes, 5, 10) ~ "5 to 10 minutes",
between(delta.selection.presentation_minutes, 10, 15) ~ "10 to 15 minutes",
between(delta.selection.presentation_minutes, 15, 20) ~ "15 to 20 minutes",
delta.selection.presentation_minutes >= 20 ~ " 20 or more minutes"
),
duration3_grp = factor(duration3_grp,
levels = c("Less than 5 minutes", "5 to 10 minutes", "10 to 15 minutes", "15 to 20 minutes","20 or more minutes" ))
) %>%
group_by(duration3_grp) %>%
ggplot(aes(x=duration3_grp,group=duration3_grp)) +
geom_bar(fill="#E41A1C") +
labs(x="Group", y="count") +
theme_minimal()
=> Most of duration are between 0 to 20 min (1200 sec) We can set our outlier cleaning
data<- data[data$delta.selection.departure < 1200,]
data<-data[data$delta.departure.presentation < 1200,]
data<-data[data$delta.selection.presentation < 1200,]
In this part, we will first try to check if there is relation between our dependent variables and the dependent
Let’s study Y0 ‘selection-departure’, Y1’departure-presentation’,Y2’selection-presentation’ vs feature vars
data and y without GPS data
data.nogps<-data[,-c(21,20)]
We sample data in 3000 observations for scatter plot
set.seed(1234)
foocor.all <- sample_n(data.nogps, 3000)
Let’s use here scatter plot. Scatter plot are plotted along two axes, the pattern of the resulting points revealing any correlation present. One pattern of special interest is a linear pattern, where the data has a general look of a line going uphill or downhill.
For instance, let’s plot y0 vs alert.reason.category
car::scatterplot(delta.selection.departure ~ OSRM.estimated.duration, data = foocor.all,
smoother = TRUE, grid = TRUE)
## Warning in plot.window(...): "smoother" n'est pas un paramètre graphique
## Warning in plot.xy(xy, type, ...): "smoother" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in box(...): "smoother" n'est pas un paramètre graphique
## Warning in title(...): "smoother" n'est pas un paramètre graphique
Here we can see that there no clear linear correlation, maybe a some little smooth trend. But let’s take delta. departure. presentation vs the OSRM estimated duration.
car::scatterplot(delta.departure.presentation~ OSRM.estimated.duration, data = foocor.all,
smoother = TRUE, grid = TRUE)
## Warning in plot.window(...): "smoother" n'est pas un paramètre graphique
## Warning in plot.xy(xy, type, ...): "smoother" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" n'est
## pas un paramètre graphique
## Warning in box(...): "smoother" n'est pas un paramètre graphique
## Warning in title(...): "smoother" n'est pas un paramètre graphique
We can observe a clear linear positive correlation. Normal since distance and time of presentation are correlated.
Let’s scatter plot matrix. We just check the first lines for ys vs other variables for now and try to identify visually if there is some high independent explicative variable.
Let’s start by Y0 ‘selection-departure’
Plot 5 first
pairs(delta.selection.departure~.,data=foocor.all[,c(1,2,3,4,5,22)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
We find :
It seems that we don’t have clear linear correlation. Let’s scatter plot the 5 next features.
5 next
pairs(delta.selection.departure~.,data=foocor.all[,c(6,7,8,9,10,11,22)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion.
5 next
pairs(delta.selection.departure~.,data=foocor.all[,c(12,13,14,15,16,17,22)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion.
5 next
pairs(delta.selection.departure~.,data=foocor.all[,c(18,19,20,21,22,23,24)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion.
Here, We find that there is no clear visual linear relation between yo and other features.
We will deeper our correlation analysis in the 4. part.
Let’s quickly check scatter plot matrix for delta.departure.presentation.
pairs(delta.departure.presentation~.,data=foocor.all[,c(1,2,3,4,5,23)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion.
pairs(delta.departure.presentation~.,data=foocor.all[,c(6,7,8,9,10,11,23)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion
pairs(delta.departure.presentation~.,data=foocor.all[,c(12,13,14,15,16,17,23)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
Same conclusion.
pairs(delta.departure.presentation~.,data=foocor.all[,c(18,19,20,21,22,23)],
main="Simple Scatterplot Matrix",lower.panel = NULL)
We find a linear correlation between y1 and OSRM estimated estimated time and distance (which seems to be normal).
We will deeper our correlation analysis in the 4part of EDA.
But first, let’s manage outliers with some EDA with independant variables !
which(is.na(data.quanti)) #check for NA values before starting
## integer(0)
hist(data.quanti$longitude.intervention, col="blue",main="Longitude intervention")
hist(data.quanti$latitude.intervention, col="blue",main="Latitude intervention")
hist(data.quanti$delta.status.preceding.selection.selection, col="blue",main="delta status preceding selection-selection")
hist(data.quanti$longitude.before.departure, col="blue",main="longitude before departure ")
hist(data.quanti$latitude.before.departure, col="blue",main="Latitude before departure ")
hist(data.quanti$OSRM.estimated.distance, col="blue",main="OSRM estimated distance")
hist(data.quanti$OSRM.estimated.duration, col="blue",main="OSRM estimated duration")
data.quanti %>%
ggplot(aes(longitude.intervention)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
data.quanti %>%
ggplot(aes(latitude.intervention)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
data.quanti %>%
ggplot(aes(delta.status.preceding.selection.selection)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing non-finite values (stat_density).
data.quanti %>%
ggplot(aes(longitude.before.departure)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
data.quanti %>%
ggplot(aes(latitude.before.departure)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
data.quanti %>%
ggplot(aes(OSRM.estimated.distance)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
data.quanti %>%
ggplot(aes(OSRM.estimated.duration)) +
geom_density(fill = "red", bins = 100) +
scale_x_log10() +
scale_y_sqrt() +
theme_minimal()
## Warning: Ignoring unknown parameters: bins
boxplot(data.quanti$longitude.intervention, col="blue",main="Longitude intervention")
boxplot(data.quanti$latitude.intervention, col="blue",main="Latitude intervention")
boxplot(data.quanti$delta.status.preceding.selection.selection , col="blue",main="delta status preceding selection-selection")
boxplot(data.quanti$longitude.before.departure, col="blue",main="longitude before departure")
boxplot(data.quanti$latitude.before.departure, col="blue",main="latitude before departure")
boxplot(data.quanti$OSRM.estimated.distance, col="blue",main="OSRM estimated distance")
boxplot(data.quanti$OSRM.estimated.duration, col="blue",main="OSRM estimated duration")
data.clean <- data[data$delta.status.preceding.selection.selection < 100000,]
data.clean <- data[data$OSRM.estimated.duration < 1000,]
#Assign the new value to data.quanti
colnames(data.clean)
## [1] "emergency.vehicle.selection"
## [2] "intervention"
## [3] "alert.reason.category"
## [4] "alert.reason"
## [5] "intervention.on.public.roads"
## [6] "floor"
## [7] "location.of.the.event"
## [8] "longitude.intervention"
## [9] "latitude.intervention"
## [10] "emergency.vehicle"
## [11] "emergency.vehicle.type"
## [12] "rescue.center"
## [13] "date.key.selection"
## [14] "time.key.selection"
## [15] "status.preceding.selection"
## [16] "delta.status.preceding.selection.selection"
## [17] "departed.from.its.rescue.center"
## [18] "longitude.before.departure"
## [19] "latitude.before.departure"
## [20] "GPS.tracks.departure.presentation"
## [21] "GPS.tracks.datetime.departure.presentation"
## [22] "OSRM.estimated.distance"
## [23] "OSRM.estimated.duration"
## [24] "delta.selection.departure"
## [25] "delta.departure.presentation"
## [26] "delta.selection.presentation"
data.quanti<-data.clean[,-c(3,4,5,6,7,10,11,12,15,17,24,25,26)]
data.quanti.y<-data.clean[,-c(3,4,5,6,7,10,11,12,15,17)]
str(data.quali)
## Classes 'data.table' and 'data.frame': 206627 obs. of 10 variables:
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 3 3 3 3 3 3 9 3 3 3 ...
## $ alert.reason : Factor w/ 122 levels "1911","1912",..: 60 41 61 60 60 31 101 60 32 31 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ floor : Factor w/ 45 levels "-10","-6","-5",..: 8 9 10 8 11 9 12 8 8 8 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 48 36 39 36 5 56 36 36 48 94 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 398 334 396 517 473 330 293 575 398 606 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 37 24 37 37 37 37 24 37 37 37 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 15 30 6 72 42 55 42 16 15 50 ...
## $ status.preceding.selection : Factor w/ 2 levels "Disponible","Rentré": 2 2 1 2 2 2 2 2 2 2 ...
## $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
table(data.quali$alert.reason.category)
##
## 1 2 3 4 5 6 7 8 9
## 15032 5402 172294 1185 614 4019 715 188 7178
prop.table(table(data.quali$alert.reason.category))
##
## 1 2 3 4 5 6
## 0.0727494471 0.0261437276 0.8338406888 0.0057349717 0.0029715381 0.0194505074
## 7 8 9
## 0.0034603416 0.0009098521 0.0347389257
table(data.quali$alert.reason)
##
## 1911 1912 1914 1918 1920 1922 1923 1924 1926 1927 1929 1932 1934
## 111 699 977 1305 3 8426 261 264 5 24 575 3 2
## 1940 1941 1942 1944 1951 1952 2011 2012 2014 2015 2017 2018 2020
## 433 1911 6 2 23 2 1914 22 3 21 1490 1934 4
## 2021 2022 2026 2028 2112 2113 2115 2116 2118 2119 2120 2121 2122
## 3 3 5 3 36679 22790 82 511 1722 2164 3327 62 11
## 2123 2124 2126 2127 2128 2129 2131 2132 2133 2134 2135 2136 2137
## 17 5693 184 1 1 73 33 791 6 3362 7498 1661 43
## 2141 2142 2143 2144 2145 2146 2147 2162 2163 2210 2211 2212 2213
## 457 18 341 10 1 8 66 66311 15052 4 28 556 17
## 2214 2215 2216 2218 2221 2311 2312 2313 2314 2317 2411 2412 2413
## 358 3 212 1 9 162 143 229 79 1 163 93 1
## 2414 2416 2421 2422 2423 2424 2426 2430 2431 2432 2511 2514 2519
## 3 6 2350 164 28 19 621 2 240 45 21 30 1
## 2523 2524 2525 2532 2612 2613 2614 2623 2624 2711 2712 2714 2715
## 79 467 55 62 1 2 10 165 10 5744 13 5 2
## 2716 2720 2724 2725 2727 2752 2753 4926 5061 5062 5811 7912 7913
## 1185 84 16 18 1 56 10 12 369 2934 1 57 162
## 7914 10821 10971 11021 93529
## 65 1 1 5 37
prop.table(table(data.quali$alert.reason))
##
## 1911 1912 1914 1918 1920 1922
## 5.371999e-04 3.382907e-03 4.728327e-03 6.315728e-03 1.451892e-05 4.077879e-02
## 1923 1924 1926 1927 1929 1932
## 1.263146e-03 1.277665e-03 2.419819e-05 1.161513e-04 2.782792e-03 1.451892e-05
## 1934 1940 1941 1942 1944 1951
## 9.679277e-06 2.095564e-03 9.248549e-03 2.903783e-05 9.679277e-06 1.113117e-04
## 1952 2011 2012 2014 2015 2017
## 9.679277e-06 9.263068e-03 1.064720e-04 1.451892e-05 1.016324e-04 7.211061e-03
## 2018 2020 2021 2022 2026 2028
## 9.359861e-03 1.935855e-05 1.451892e-05 1.451892e-05 2.419819e-05 1.451892e-05
## 2112 2113 2115 2116 2118 2119
## 1.775131e-01 1.102954e-01 3.968504e-04 2.473055e-03 8.333858e-03 1.047298e-02
## 2120 2121 2122 2123 2124 2126
## 1.610148e-02 3.000576e-04 5.323602e-05 8.227386e-05 2.755206e-02 8.904935e-04
## 2127 2128 2129 2131 2132 2133
## 4.839639e-06 4.839639e-06 3.532936e-04 1.597081e-04 3.828154e-03 2.903783e-05
## 2134 2135 2136 2137 2141 2142
## 1.627086e-02 3.628761e-02 8.038640e-03 2.081045e-04 2.211715e-03 8.711349e-05
## 2143 2144 2145 2146 2147 2162
## 1.650317e-03 4.839639e-05 4.839639e-06 3.871711e-05 3.194161e-04 3.209213e-01
## 2163 2210 2211 2212 2213 2214
## 7.284624e-02 1.935855e-05 1.355099e-04 2.690839e-03 8.227386e-05 1.732591e-03
## 2215 2216 2218 2221 2311 2312
## 1.451892e-05 1.026003e-03 4.839639e-06 4.355675e-05 7.840214e-04 6.920683e-04
## 2313 2314 2317 2411 2412 2413
## 1.108277e-03 3.823314e-04 4.839639e-06 7.888611e-04 4.500864e-04 4.839639e-06
## 2414 2416 2421 2422 2423 2424
## 1.451892e-05 2.903783e-05 1.137315e-02 7.937007e-04 1.355099e-04 9.195313e-05
## 2426 2430 2431 2432 2511 2514
## 3.005416e-03 9.679277e-06 1.161513e-03 2.177837e-04 1.016324e-04 1.451892e-04
## 2519 2523 2524 2525 2532 2612
## 4.839639e-06 3.823314e-04 2.260111e-03 2.661801e-04 3.000576e-04 4.839639e-06
## 2613 2614 2623 2624 2711 2712
## 9.679277e-06 4.839639e-05 7.985404e-04 4.839639e-05 2.779888e-02 6.291530e-05
## 2714 2715 2716 2720 2724 2725
## 2.419819e-05 9.679277e-06 5.734972e-03 4.065296e-04 7.743422e-05 8.711349e-05
## 2727 2752 2753 4926 5061 5062
## 4.839639e-06 2.710198e-04 4.839639e-05 5.807566e-05 1.785827e-03 1.419950e-02
## 5811 7912 7913 7914 10821 10971
## 4.839639e-06 2.758594e-04 7.840214e-04 3.145765e-04 4.839639e-06 4.839639e-06
## 11021 93529
## 2.419819e-05 1.790666e-04
According to the previous results, we need to combine several categories for the alert reason variable. However, we don’t have enough information for grouping levels of that variable.
table(data.quali$intervention.on.public.roads)
##
## 0 1
## 174494 32133
prop.table(table(data.quali$intervention.on.public.roads))
##
## 0 1
## 0.8444879 0.1555121
table(data.quali$floor)
##
## -10 -6 -5 -4 -3 -2 -1 0 1 2 3
## 1 5 5 113 141 612 3702 127325 17707 14725 12298
## 4 5 6 7 8 9 10 11 12 13 14
## 9770 6892 4708 2756 1691 1121 789 564 418 316 250
## 15 16 17 18 19 20 21 22 23 24 25
## 182 131 134 73 25 14 23 23 11 12 8
## 26 27 28 29 30 31 32 33 37 52 79
## 15 17 3 12 10 1 7 2 4 1 1
## 100
## 9
prop.table(table(data.quali$floor))
##
## -10 -6 -5 -4 -3 -2
## 4.839639e-06 2.419819e-05 2.419819e-05 5.468792e-04 6.823890e-04 2.961859e-03
## -1 0 1 2 3 4
## 1.791634e-02 6.162070e-01 8.569548e-02 7.126368e-02 5.951788e-02 4.728327e-02
## 5 6 7 8 9 10
## 3.335479e-02 2.278502e-02 1.333804e-02 8.183829e-03 5.425235e-03 3.818475e-03
## 11 12 13 14 15 16
## 2.729556e-03 2.022969e-03 1.529326e-03 1.209910e-03 8.808142e-04 6.339927e-04
## 17 18 19 20 21 22
## 6.485116e-04 3.532936e-04 1.209910e-04 6.775494e-05 1.113117e-04 1.113117e-04
## 23 24 25 26 27 28
## 5.323602e-05 5.807566e-05 3.871711e-05 7.259458e-05 8.227386e-05 1.451892e-05
## 29 30 31 32 33 37
## 5.807566e-05 4.839639e-05 4.839639e-06 3.387747e-05 9.679277e-06 1.935855e-05
## 52 79 100
## 4.839639e-06 4.839639e-06 4.355675e-05
According to the previous results, we need to combine several categories for the floor variable.
table(data.quali$location.of.the.event)
##
## 100 101 102 103 104 105 106 107 108 109 110 111 112
## 2357 388 6 93 2687 1354 161 4869 350 2 50 498 579
## 113 114 115 116 118 119 120 121 122 123 124 125 126
## 146 677 228 14 315 26 94 181 16 61 2 124 24
## 127 128 129 130 131 132 133 134 135 136 137 138 139
## 308 513 13 401 3130 54 120 304 1796 44400 4679 516 51031
## 140 141 142 143 144 145 146 147 148 149 150 151 153
## 5237 235 90 562 8 9 320 1024 31997 9905 5 1 2
## 154 155 156 157 158 159 160 162 163 164 165 166 167
## 3 18 18 149 30 2 1 15 11 709 100 72 364
## 168 169 170 171 172 174 175 177 178 179 180 181 182
## 74 6 14 21 12 6 4 12 9 142 28 74 8
## 183 184 186 187 188 189 190 191 192 193 194 195 196
## 2 19 111 7 16 115 17 260 34 18 107 70 1776
## 197 198 199 200 201 202 203 204 205 206 207 208 209
## 11 292 785 1 747 1297 165 281 465 478 199 92 135
## 210 211 212 213 214 215 216 217 218 219 220 221 222
## 1040 763 194 1 65 1 622 81 842 1866 289 98 391
## 223 224 225 226 227 228 229 230 231 232 233 234 235
## 3 945 78 727 274 2142 321 382 330 706 366 35 63
## 236 237 238 239 240 241 242 243 244 245 246 247 248
## 93 186 27 12 3 1932 193 34 384 2 18 1 1
## 250 252 254 255 256 257 258 259 260 261 262 263 264
## 3 11 137 29 49 749 544 2349 134 226 13 103 201
## 265 267 268 269 270 271 272 274 275 276 277 279 280
## 1 6 10 41 238 34 31 617 17 127 10 2 5
## 282 284 285 286 287 288 289 290 291 292 293 294 295
## 3 1 3 189 6 48 4 1 12 2 1 34 17
## 296 297 298 299 300 301 302 303 305 306 307 308 309
## 20 2 3 2 32 3 1 3 7 3 12 80 31
## 310 311 312 313 315 316 317 318 319 320 321 322 323
## 6 2 38 2 2 135 34 8 21 110 41 143 589
## 324 325
## 1 2413
prop.table(table(data.quali$location.of.the.event))
##
## 100 101 102 103 104 105
## 1.140703e-02 1.877780e-03 2.903783e-05 4.500864e-04 1.300411e-02 6.552871e-03
## 106 107 108 109 110 111
## 7.791818e-04 2.356420e-02 1.693874e-03 9.679277e-06 2.419819e-04 2.410140e-03
## 112 113 114 115 116 118
## 2.802151e-03 7.065872e-04 3.276435e-03 1.103438e-03 6.775494e-05 1.524486e-03
## 119 120 121 122 123 124
## 1.258306e-04 4.549260e-04 8.759746e-04 7.743422e-05 2.952180e-04 9.679277e-06
## 125 126 127 128 129 130
## 6.001152e-04 1.161513e-04 1.490609e-03 2.482735e-03 6.291530e-05 1.940695e-03
## 131 132 133 134 135 136
## 1.514807e-02 2.613405e-04 5.807566e-04 1.471250e-03 8.691991e-03 2.148800e-01
## 137 138 139 140 141 142
## 2.264467e-02 2.497254e-03 2.469716e-01 2.534519e-02 1.137315e-03 4.355675e-04
## 143 144 145 146 147 148
## 2.719877e-03 3.871711e-05 4.355675e-05 1.548684e-03 4.955790e-03 1.548539e-01
## 149 150 151 153 154 155
## 4.793662e-02 2.419819e-05 4.839639e-06 9.679277e-06 1.451892e-05 8.711349e-05
## 156 157 158 159 160 162
## 8.711349e-05 7.211061e-04 1.451892e-04 9.679277e-06 4.839639e-06 7.259458e-05
## 163 164 165 166 167 168
## 5.323602e-05 3.431304e-03 4.839639e-04 3.484540e-04 1.761628e-03 3.581333e-04
## 169 170 171 172 174 175
## 2.903783e-05 6.775494e-05 1.016324e-04 5.807566e-05 2.903783e-05 1.935855e-05
## 177 178 179 180 181 182
## 5.807566e-05 4.355675e-05 6.872287e-04 1.355099e-04 3.581333e-04 3.871711e-05
## 183 184 186 187 188 189
## 9.679277e-06 9.195313e-05 5.371999e-04 3.387747e-05 7.743422e-05 5.565584e-04
## 190 191 192 193 194 195
## 8.227386e-05 1.258306e-03 1.645477e-04 8.711349e-05 5.178413e-04 3.387747e-04
## 196 197 198 199 200 201
## 8.595198e-03 5.323602e-05 1.413174e-03 3.799116e-03 4.839639e-06 3.615210e-03
## 202 203 204 205 206 207
## 6.277011e-03 7.985404e-04 1.359938e-03 2.250432e-03 2.313347e-03 9.630881e-04
## 208 209 210 211 212 213
## 4.452467e-04 6.533512e-04 5.033224e-03 3.692644e-03 9.388899e-04 4.839639e-06
## 214 215 216 217 218 219
## 3.145765e-04 4.839639e-06 3.010255e-03 3.920107e-04 4.074976e-03 9.030766e-03
## 220 221 222 223 224 225
## 1.398656e-03 4.742846e-04 1.892299e-03 1.451892e-05 4.573458e-03 3.774918e-04
## 226 227 228 229 230 231
## 3.518417e-03 1.326061e-03 1.036651e-02 1.553524e-03 1.848742e-03 1.597081e-03
## 232 233 234 235 236 237
## 3.416785e-03 1.771308e-03 1.693874e-04 3.048972e-04 4.500864e-04 9.001728e-04
## 238 239 240 241 242 243
## 1.306702e-04 5.807566e-05 1.451892e-05 9.350182e-03 9.340502e-04 1.645477e-04
## 244 245 246 247 248 250
## 1.858421e-03 9.679277e-06 8.711349e-05 4.839639e-06 4.839639e-06 1.451892e-05
## 252 254 255 256 257 258
## 5.323602e-05 6.630305e-04 1.403495e-04 2.371423e-04 3.624889e-03 2.632763e-03
## 259 260 261 262 263 264
## 1.136831e-02 6.485116e-04 1.093758e-03 6.291530e-05 4.984828e-04 9.727674e-04
## 265 267 268 269 270 271
## 4.839639e-06 2.903783e-05 4.839639e-05 1.984252e-04 1.151834e-03 1.645477e-04
## 272 274 275 276 277 279
## 1.500288e-04 2.986057e-03 8.227386e-05 6.146341e-04 4.839639e-05 9.679277e-06
## 280 282 284 285 286 287
## 2.419819e-05 1.451892e-05 4.839639e-06 1.451892e-05 9.146917e-04 2.903783e-05
## 288 289 290 291 292 293
## 2.323027e-04 1.935855e-05 4.839639e-06 5.807566e-05 9.679277e-06 4.839639e-06
## 294 295 296 297 298 299
## 1.645477e-04 8.227386e-05 9.679277e-05 9.679277e-06 1.451892e-05 9.679277e-06
## 300 301 302 303 305 306
## 1.548684e-04 1.451892e-05 4.839639e-06 1.451892e-05 3.387747e-05 1.451892e-05
## 307 308 309 310 311 312
## 5.807566e-05 3.871711e-04 1.500288e-04 2.903783e-05 9.679277e-06 1.839063e-04
## 313 315 316 317 318 319
## 9.679277e-06 9.679277e-06 6.533512e-04 1.645477e-04 3.871711e-05 1.016324e-04
## 320 321 322 323 324 325
## 5.323602e-04 1.984252e-04 6.920683e-04 2.850547e-03 4.839639e-06 1.167805e-02
According to the previous results, we need to combine several categories for the alert reason variable. However, we don’t have enough information for grouping levels of that variable.
table(data.quali$emergency.vehicle)
##
## 1815 1823 1828 1832 1834 1843 1844 1847 1856 1859 1867 1877 1879 1880 1893 1895
## 61 62 210 281 272 11 96 221 92 157 17 2 503 103 99 2
## 1901 1910 1913 1914 1926 1933 1941 1943 1952 1969 1980 1985 1986 1991 1992 1994
## 328 328 716 240 59 1 130 52 494 287 8 264 191 34 457 160
## 1996 2004 2008 2019 2021 2022 2047 2053 2056 2062 2065 2067 2068 2094 2098 2100
## 39 36 494 261 157 72 307 46 150 19 608 468 30 247 5 496
## 2105 2115 2118 2125 2132 2144 2145 2151 2154 2171 2174 2207 2216 2219 2221 2231
## 107 288 339 380 127 128 6 15 239 325 19 4 199 243 104 46
## 2241 2242 2244 2252 2253 2255 2276 2280 2281 2289 2290 2291 2292 2293 2294 2296
## 378 155 204 33 543 284 27 300 60 662 265 15 51 277 341 39
## 2297 2298 2302 2303 2304 2307 2308 2311 2312 2319 2321 2324 2326 2327 2329 2339
## 278 53 50 407 10 39 269 60 225 31 60 284 406 61 91 57
## 2348 2349 2364 2403 2407 2414 2417 2418 2419 2420 2422 2426 2428 2429 2432 2434
## 185 354 169 47 3 32 25 435 36 6 11 147 217 175 177 34
## 2456 2473 2476 2477 2478 2485 2493 2496 2499 2500 2501 2515 2517 2518 2525 2526
## 3 199 76 40 144 105 306 7 129 31 86 1 90 206 4 8
## 2527 2532 2535 2536 2537 2539 2540 2541 2543 2545 2552 2555 2557 2560 2561 2567
## 4 420 320 43 48 71 268 242 650 35 142 42 39 264 230 100
## 2569 2572 2575 2577 2581 2582 2583 2586 2587 2588 2590 2592 2615 2626 2629 2634
## 47 30 68 84 17 1 68 2 25 108 649 52 1 106 42 662
## 2635 2638 2639 2641 2663 2664 2669 2670 2671 2672 2699 2705 2734 2766 2896 2902
## 402 427 811 584 34 2 430 131 5 48 124 76 18 28 60 2
## 2903 2940 2958 2959 3032 3033 3041 3042 3051 3053 3054 3063 3065 3067 3073 3075
## 4 489 1 20 26 49 80 24 60 1 30 1 142 414 44 30
## 3076 3077 3078 3081 3082 3083 3084 3085 3086 3087 3089 3091 3093 3094 3095 3097
## 33 475 499 78 46 80 23 44 1 74 64 64 22 187 64 424
## 3099 3100 3101 3103 3105 3107 3113 3114 3121 3122 3123 3133 3135 3136 3218 3219
## 574 692 508 626 273 23 2 1 6 7 7 225 14 210 62 13
## 3221 3223 3228 3229 3230 3231 3233 3234 3235 3236 3251 3257 3258 3291 3293 3294
## 1 1 1 1 23 4 47 1 58 11 1 8 12 13 1 12
## 3295 3296 3298 3300 3301 3304 3305 3440 3441 3442 3520 3521 3524 3528 3530 3532
## 14 57 63 39 14 4 30 13 1 1 533 250 326 71 38 867
## 3533 3536 3537 3538 3539 3540 3542 3543 3544 3550 3551 3603 3902 3903 3904 3905
## 652 473 48 707 831 321 908 384 889 241 472 28 45 62 61 51
## 4114 4175 4176 4177 4178 4179 4180 4181 4184 4185 4186 4187 4188 4189 4190 4207
## 1 334 433 208 719 289 273 673 6 158 182 154 116 298 160 1
## 4209 4210 4211 4214 4215 4216 4219 4220 4223 4226 4227 4228 4229 4230 4231 4235
## 451 429 195 246 609 4 2 3 330 195 215 562 150 269 209 202
## 4238 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4293 4304 4305 4307
## 512 583 604 441 1 54 90 112 11 99 60 66 1 415 451 458
## 4308 4309 4311 4313 4314 4315 4316 4317 4318 4319 4320 4322 4323 4327 4328 4333
## 658 909 309 476 945 716 1020 887 821 709 419 289 1070 334 1 4
## 4335 4336 4340 4342 4343 4345 4349 4351 4353 4359 4362 4363 4366 4367 4370 4372
## 492 1 437 38 38 104 560 2 6 87 35 312 137 609 3 3
## 4375 4376 4377 4379 4386 4387 4388 4389 4391 4399 4400 4401 4406 4407 4415 4430
## 194 617 216 1 374 912 676 654 2 59 699 775 55 349 259 135
## 4432 4436 4438 4442 4444 4453 4454 4455 4456 4458 4460 4461 4462 4466 4467 4469
## 6 80 2 1197 326 25 7 32 34 18 674 385 2 744 32 1187
## 4471 4475 4485 4486 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513
## 770 577 21 17 727 1203 1185 544 1087 512 662 1228 1256 1291 613 873
## 4515 4516 4518 4519 4520 4522 4523 4524 4530 4531 4533 4534 4536 4537 4538 4539
## 1015 386 17 1160 818 63 1133 948 3 2 1 174 1198 1024 493 1305
## 4540 4541 4542 4543 4544 4545 4546 4565 4648 4653 4844 4870 4871 4872 4873 4874
## 1281 1337 724 1121 743 669 4 1 594 1 1 665 717 591 1055 643
## 4875 4876 4877 4878 4879 4891 4893 4894 4899 4900 4902 4903 4904 4905 4906 4907
## 785 819 96 220 1268 30 55 1 2 1109 11 1196 755 755 864 23
## 4908 4909 4910 4911 4912 4913 4914 4915 4920 4923 4924 4926 4927 4928 4929 4930
## 1291 1039 447 709 718 2 86 22 16 23 1 711 549 930 71 876
## 4931 4940 5059 5060 5264 5269 5621 5622 5623 5624 5625 5633 5635 5637 5642 5645
## 1122 48 3 40 1 1 1287 1255 1615 933 980 912 4 990 1 61
## 5646 5657 5661 5665 5666 5670 5671 5673 5683 5684 5685 5686 5695 5697 5699 5700
## 16 4 1708 3 1 687 912 1155 97 36 42 55 52 415 17 61
## 5701 5702 5703 5708 5710 5712 5713 5715 5717 5718 5719 5720 5721 5723 5725 5726
## 1 1524 54 24 45 46 94 1248 116 92 1466 616 1224 956 82 1390
## 5727 5728 5729 5730 5731 5732 5734 5742 5745 5746 5749 5750 5751 5752 5754 5755
## 325 1070 1368 323 1024 1027 1123 853 856 947 1228 37 50 966 41 974
## 5757 5758 5760 5761 5762 5763 5764 5765 5766 5767 5772 5774 5775 5776 5778 5779
## 3 85 1455 1257 772 674 356 870 587 28 63 57 1394 55 69 735
## 5780 5781 5783 5785 5787 5788 5789 5795 5798 5799 5803 5804 5805 5806 5807 5808
## 700 1343 542 125 49 2 1 372 3 38 88 131 27 4 108 4
## 5810 5811 5812 5813 5814 5816 5817 5819 5820 5821 5822 5823 5881 5882 5883 5885
## 29 116 99 135 6 467 42 105 1898 1384 384 316 826 1334 1272 1184
## 5886 5887 5888 5889 5890 5894 5896 5917 5921 5923 5924 5930 5934 5936 5940 5948
## 941 1285 591 651 13 632 1 110 1 94 64 3 47 48 3 7
## 5949 5962 5963 5964 5965 5966 5968 5969 5975 5982 5995 5996 5997 5998 6000 6001
## 3 2 53 210 27 277 247 373 1 66 1 684 725 1801 1378 836
## 6002 6003 6004 6005 6008 6030 6032 6033 6035 6037 6038 6039 6041 6042 6043 6044
## 1580 439 1308 1273 1 37 150 4 138 619 548 652 361 719 1 4
## 6045 6046 6047 6048 6049 6051 6052 6053 6055 6056 6064 6065 6067 6080 6092
## 36 38 38 22 229 49 53 42 41 5 322 657 603 24 3
prop.table(table(data.quali$emergency.vehicle))
##
## 1815 1823 1828 1832 1834 1843
## 2.952180e-04 3.000576e-04 1.016324e-03 1.359938e-03 1.316382e-03 5.323602e-05
## 1844 1847 1856 1859 1867 1877
## 4.646053e-04 1.069560e-03 4.452467e-04 7.598233e-04 8.227386e-05 9.679277e-06
## 1879 1880 1893 1895 1901 1910
## 2.434338e-03 4.984828e-04 4.791242e-04 9.679277e-06 1.587401e-03 1.587401e-03
## 1913 1914 1926 1933 1941 1943
## 3.465181e-03 1.161513e-03 2.855387e-04 4.839639e-06 6.291530e-04 2.516612e-04
## 1952 1969 1980 1985 1986 1991
## 2.390781e-03 1.388976e-03 3.871711e-05 1.277665e-03 9.243710e-04 1.645477e-04
## 1992 1994 1996 2004 2008 2019
## 2.211715e-03 7.743422e-04 1.887459e-04 1.742270e-04 2.390781e-03 1.263146e-03
## 2021 2022 2047 2053 2056 2062
## 7.598233e-04 3.484540e-04 1.485769e-03 2.226234e-04 7.259458e-04 9.195313e-05
## 2065 2067 2068 2094 2098 2100
## 2.942500e-03 2.264951e-03 1.451892e-04 1.195391e-03 2.419819e-05 2.400461e-03
## 2105 2115 2118 2125 2132 2144
## 5.178413e-04 1.393816e-03 1.640637e-03 1.839063e-03 6.146341e-04 6.194737e-04
## 2145 2151 2154 2171 2174 2207
## 2.903783e-05 7.259458e-05 1.156674e-03 1.572883e-03 9.195313e-05 1.935855e-05
## 2216 2219 2221 2231 2241 2242
## 9.630881e-04 1.176032e-03 5.033224e-04 2.226234e-04 1.829383e-03 7.501440e-04
## 2244 2252 2253 2255 2276 2280
## 9.872863e-04 1.597081e-04 2.627924e-03 1.374457e-03 1.306702e-04 1.451892e-03
## 2281 2289 2290 2291 2292 2293
## 2.903783e-04 3.203841e-03 1.282504e-03 7.259458e-05 2.468216e-04 1.340580e-03
## 2294 2296 2297 2298 2302 2303
## 1.650317e-03 1.887459e-04 1.345420e-03 2.565008e-04 2.419819e-04 1.969733e-03
## 2304 2307 2308 2311 2312 2319
## 4.839639e-05 1.887459e-04 1.301863e-03 2.903783e-04 1.088919e-03 1.500288e-04
## 2321 2324 2326 2327 2329 2339
## 2.903783e-04 1.374457e-03 1.964893e-03 2.952180e-04 4.404071e-04 2.758594e-04
## 2348 2349 2364 2403 2407 2414
## 8.953331e-04 1.713232e-03 8.178989e-04 2.274630e-04 1.451892e-05 1.548684e-04
## 2417 2418 2419 2420 2422 2426
## 1.209910e-04 2.105243e-03 1.742270e-04 2.903783e-05 5.323602e-05 7.114269e-04
## 2428 2429 2432 2434 2456 2473
## 1.050202e-03 8.469368e-04 8.566160e-04 1.645477e-04 1.451892e-05 9.630881e-04
## 2476 2477 2478 2485 2493 2496
## 3.678125e-04 1.935855e-04 6.969080e-04 5.081621e-04 1.480929e-03 3.387747e-05
## 2499 2500 2501 2515 2517 2518
## 6.243134e-04 1.500288e-04 4.162089e-04 4.839639e-06 4.355675e-04 9.969655e-04
## 2525 2526 2527 2532 2535 2536
## 1.935855e-05 3.871711e-05 1.935855e-05 2.032648e-03 1.548684e-03 2.081045e-04
## 2537 2539 2540 2541 2543 2545
## 2.323027e-04 3.436143e-04 1.297023e-03 1.171193e-03 3.145765e-03 1.693874e-04
## 2552 2555 2557 2560 2561 2567
## 6.872287e-04 2.032648e-04 1.887459e-04 1.277665e-03 1.113117e-03 4.839639e-04
## 2569 2572 2575 2577 2581 2582
## 2.274630e-04 1.451892e-04 3.290954e-04 4.065296e-04 8.227386e-05 4.839639e-06
## 2583 2586 2587 2588 2590 2592
## 3.290954e-04 9.679277e-06 1.209910e-04 5.226810e-04 3.140925e-03 2.516612e-04
## 2615 2626 2629 2634 2635 2638
## 4.839639e-06 5.130017e-04 2.032648e-04 3.203841e-03 1.945535e-03 2.066526e-03
## 2639 2641 2663 2664 2669 2670
## 3.924947e-03 2.826349e-03 1.645477e-04 9.679277e-06 2.081045e-03 6.339927e-04
## 2671 2672 2699 2705 2734 2766
## 2.419819e-05 2.323027e-04 6.001152e-04 3.678125e-04 8.711349e-05 1.355099e-04
## 2896 2902 2903 2940 2958 2959
## 2.903783e-04 9.679277e-06 1.935855e-05 2.366583e-03 4.839639e-06 9.679277e-05
## 3032 3033 3041 3042 3051 3053
## 1.258306e-04 2.371423e-04 3.871711e-04 1.161513e-04 2.903783e-04 4.839639e-06
## 3054 3063 3065 3067 3073 3075
## 1.451892e-04 4.839639e-06 6.872287e-04 2.003610e-03 2.129441e-04 1.451892e-04
## 3076 3077 3078 3081 3082 3083
## 1.597081e-04 2.298828e-03 2.414980e-03 3.774918e-04 2.226234e-04 3.871711e-04
## 3084 3085 3086 3087 3089 3091
## 1.113117e-04 2.129441e-04 4.839639e-06 3.581333e-04 3.097369e-04 3.097369e-04
## 3093 3094 3095 3097 3099 3100
## 1.064720e-04 9.050124e-04 3.097369e-04 2.052007e-03 2.777953e-03 3.349030e-03
## 3101 3103 3105 3107 3113 3114
## 2.458536e-03 3.029614e-03 1.321221e-03 1.113117e-04 9.679277e-06 4.839639e-06
## 3121 3122 3123 3133 3135 3136
## 2.903783e-05 3.387747e-05 3.387747e-05 1.088919e-03 6.775494e-05 1.016324e-03
## 3218 3219 3221 3223 3228 3229
## 3.000576e-04 6.291530e-05 4.839639e-06 4.839639e-06 4.839639e-06 4.839639e-06
## 3230 3231 3233 3234 3235 3236
## 1.113117e-04 1.935855e-05 2.274630e-04 4.839639e-06 2.806990e-04 5.323602e-05
## 3251 3257 3258 3291 3293 3294
## 4.839639e-06 3.871711e-05 5.807566e-05 6.291530e-05 4.839639e-06 5.807566e-05
## 3295 3296 3298 3300 3301 3304
## 6.775494e-05 2.758594e-04 3.048972e-04 1.887459e-04 6.775494e-05 1.935855e-05
## 3305 3440 3441 3442 3520 3521
## 1.451892e-04 6.291530e-05 4.839639e-06 4.839639e-06 2.579527e-03 1.209910e-03
## 3524 3528 3530 3532 3533 3536
## 1.577722e-03 3.436143e-04 1.839063e-04 4.195967e-03 3.155444e-03 2.289149e-03
## 3537 3538 3539 3540 3542 3543
## 2.323027e-04 3.421624e-03 4.021740e-03 1.553524e-03 4.394392e-03 1.858421e-03
## 3544 3550 3551 3603 3902 3903
## 4.302439e-03 1.166353e-03 2.284309e-03 1.355099e-04 2.177837e-04 3.000576e-04
## 3904 3905 4114 4175 4176 4177
## 2.952180e-04 2.468216e-04 4.839639e-06 1.616439e-03 2.095564e-03 1.006645e-03
## 4178 4179 4180 4181 4184 4185
## 3.479700e-03 1.398656e-03 1.321221e-03 3.257077e-03 2.903783e-05 7.646629e-04
## 4186 4187 4188 4189 4190 4207
## 8.808142e-04 7.453043e-04 5.613981e-04 1.442212e-03 7.743422e-04 4.839639e-06
## 4209 4210 4211 4214 4215 4216
## 2.182677e-03 2.076205e-03 9.437295e-04 1.190551e-03 2.947340e-03 1.935855e-05
## 4219 4220 4223 4226 4227 4228
## 9.679277e-06 1.451892e-05 1.597081e-03 9.437295e-04 1.040522e-03 2.719877e-03
## 4229 4230 4231 4235 4238 4272
## 7.259458e-04 1.301863e-03 1.011484e-03 9.776070e-04 2.477895e-03 2.821509e-03
## 4273 4274 4275 4276 4277 4278
## 2.923142e-03 2.134281e-03 4.839639e-06 2.613405e-04 4.355675e-04 5.420395e-04
## 4279 4280 4281 4282 4293 4304
## 5.323602e-05 4.791242e-04 2.903783e-04 3.194161e-04 4.839639e-06 2.008450e-03
## 4305 4307 4308 4309 4311 4313
## 2.182677e-03 2.216554e-03 3.184482e-03 4.399231e-03 1.495448e-03 2.303668e-03
## 4314 4315 4316 4317 4318 4319
## 4.573458e-03 3.465181e-03 4.936431e-03 4.292759e-03 3.973343e-03 3.431304e-03
## 4320 4322 4323 4327 4328 4333
## 2.027809e-03 1.398656e-03 5.178413e-03 1.616439e-03 4.839639e-06 1.935855e-05
## 4335 4336 4340 4342 4343 4345
## 2.381102e-03 4.839639e-06 2.114922e-03 1.839063e-04 1.839063e-04 5.033224e-04
## 4349 4351 4353 4359 4362 4363
## 2.710198e-03 9.679277e-06 2.903783e-05 4.210486e-04 1.693874e-04 1.509967e-03
## 4366 4367 4370 4372 4375 4376
## 6.630305e-04 2.947340e-03 1.451892e-05 1.451892e-05 9.388899e-04 2.986057e-03
## 4377 4379 4386 4387 4388 4389
## 1.045362e-03 4.839639e-06 1.810025e-03 4.413750e-03 3.271596e-03 3.165124e-03
## 4391 4399 4400 4401 4406 4407
## 9.679277e-06 2.855387e-04 3.382907e-03 3.750720e-03 2.661801e-04 1.689034e-03
## 4415 4430 4432 4436 4438 4442
## 1.253466e-03 6.533512e-04 2.903783e-05 3.871711e-04 9.679277e-06 5.793047e-03
## 4444 4453 4454 4455 4456 4458
## 1.577722e-03 1.209910e-04 3.387747e-05 1.548684e-04 1.645477e-04 8.711349e-05
## 4460 4461 4462 4466 4467 4469
## 3.261916e-03 1.863261e-03 9.679277e-06 3.600691e-03 1.548684e-04 5.744651e-03
## 4471 4475 4485 4486 4502 4503
## 3.726522e-03 2.792471e-03 1.016324e-04 8.227386e-05 3.518417e-03 5.822085e-03
## 4504 4505 4506 4507 4508 4509
## 5.734972e-03 2.632763e-03 5.260687e-03 2.477895e-03 3.203841e-03 5.943076e-03
## 4510 4511 4512 4513 4515 4516
## 6.078586e-03 6.247973e-03 2.966698e-03 4.225004e-03 4.912233e-03 1.868100e-03
## 4518 4519 4520 4522 4523 4524
## 8.227386e-05 5.613981e-03 3.958824e-03 3.048972e-04 5.483311e-03 4.587977e-03
## 4530 4531 4533 4534 4536 4537
## 1.451892e-05 9.679277e-06 4.839639e-06 8.420971e-04 5.797887e-03 4.955790e-03
## 4538 4539 4540 4541 4542 4543
## 2.385942e-03 6.315728e-03 6.199577e-03 6.470597e-03 3.503898e-03 5.425235e-03
## 4544 4545 4546 4565 4648 4653
## 3.595851e-03 3.237718e-03 1.935855e-05 4.839639e-06 2.874745e-03 4.839639e-06
## 4844 4870 4871 4872 4873 4874
## 4.839639e-06 3.218360e-03 3.470021e-03 2.860226e-03 5.105819e-03 3.111888e-03
## 4875 4876 4877 4878 4879 4891
## 3.799116e-03 3.963664e-03 4.646053e-04 1.064720e-03 6.136662e-03 1.451892e-04
## 4893 4894 4899 4900 4902 4903
## 2.661801e-04 4.839639e-06 9.679277e-06 5.367159e-03 5.323602e-05 5.788208e-03
## 4904 4905 4906 4907 4908 4909
## 3.653927e-03 3.653927e-03 4.181448e-03 1.113117e-04 6.247973e-03 5.028384e-03
## 4910 4911 4912 4913 4914 4915
## 2.163318e-03 3.431304e-03 3.474860e-03 9.679277e-06 4.162089e-04 1.064720e-04
## 4920 4923 4924 4926 4927 4928
## 7.743422e-05 1.113117e-04 4.839639e-06 3.440983e-03 2.656962e-03 4.500864e-03
## 4929 4930 4931 4940 5059 5060
## 3.436143e-04 4.239523e-03 5.430074e-03 2.323027e-04 1.451892e-05 1.935855e-04
## 5264 5269 5621 5622 5623 5624
## 4.839639e-06 4.839639e-06 6.228615e-03 6.073746e-03 7.816016e-03 4.515383e-03
## 5625 5633 5635 5637 5642 5645
## 4.742846e-03 4.413750e-03 1.935855e-05 4.791242e-03 4.839639e-06 2.952180e-04
## 5646 5657 5661 5665 5666 5670
## 7.743422e-05 1.935855e-05 8.266103e-03 1.451892e-05 4.839639e-06 3.324832e-03
## 5671 5673 5683 5684 5685 5686
## 4.413750e-03 5.589783e-03 4.694449e-04 1.742270e-04 2.032648e-04 2.661801e-04
## 5695 5697 5699 5700 5701 5702
## 2.516612e-04 2.008450e-03 8.227386e-05 2.952180e-04 4.839639e-06 7.375609e-03
## 5703 5708 5710 5712 5713 5715
## 2.613405e-04 1.161513e-04 2.177837e-04 2.226234e-04 4.549260e-04 6.039869e-03
## 5717 5718 5719 5720 5721 5723
## 5.613981e-04 4.452467e-04 7.094910e-03 2.981217e-03 5.923718e-03 4.626694e-03
## 5725 5726 5727 5728 5729 5730
## 3.968504e-04 6.727098e-03 1.572883e-03 5.178413e-03 6.620626e-03 1.563203e-03
## 5731 5732 5734 5742 5745 5746
## 4.955790e-03 4.970309e-03 5.434914e-03 4.128212e-03 4.142731e-03 4.583138e-03
## 5749 5750 5751 5752 5754 5755
## 5.943076e-03 1.790666e-04 2.419819e-04 4.675091e-03 1.984252e-04 4.713808e-03
## 5757 5758 5760 5761 5762 5763
## 1.451892e-05 4.113693e-04 7.041674e-03 6.083426e-03 3.736201e-03 3.261916e-03
## 5764 5765 5766 5767 5772 5774
## 1.722911e-03 4.210486e-03 2.840868e-03 1.355099e-04 3.048972e-04 2.758594e-04
## 5775 5776 5778 5779 5780 5781
## 6.746456e-03 2.661801e-04 3.339351e-04 3.557134e-03 3.387747e-03 6.499635e-03
## 5783 5785 5787 5788 5789 5795
## 2.623084e-03 6.049548e-04 2.371423e-04 9.679277e-06 4.839639e-06 1.800346e-03
## 5798 5799 5803 5804 5805 5806
## 1.451892e-05 1.839063e-04 4.258882e-04 6.339927e-04 1.306702e-04 1.935855e-05
## 5807 5808 5810 5811 5812 5813
## 5.226810e-04 1.935855e-05 1.403495e-04 5.613981e-04 4.791242e-04 6.533512e-04
## 5814 5816 5817 5819 5820 5821
## 2.903783e-05 2.260111e-03 2.032648e-04 5.081621e-04 9.185634e-03 6.698060e-03
## 5822 5823 5881 5882 5883 5885
## 1.858421e-03 1.529326e-03 3.997541e-03 6.456078e-03 6.156020e-03 5.730132e-03
## 5886 5887 5888 5889 5890 5894
## 4.554100e-03 6.218936e-03 2.860226e-03 3.150605e-03 6.291530e-05 3.058652e-03
## 5896 5917 5921 5923 5924 5930
## 4.839639e-06 5.323602e-04 4.839639e-06 4.549260e-04 3.097369e-04 1.451892e-05
## 5934 5936 5940 5948 5949 5962
## 2.274630e-04 2.323027e-04 1.451892e-05 3.387747e-05 1.451892e-05 9.679277e-06
## 5963 5964 5965 5966 5968 5969
## 2.565008e-04 1.016324e-03 1.306702e-04 1.340580e-03 1.195391e-03 1.805185e-03
## 5975 5982 5995 5996 5997 5998
## 4.839639e-06 3.194161e-04 4.839639e-06 3.310313e-03 3.508738e-03 8.716189e-03
## 6000 6001 6002 6003 6004 6005
## 6.669022e-03 4.045938e-03 7.646629e-03 2.124601e-03 6.330247e-03 6.160860e-03
## 6008 6030 6032 6033 6035 6037
## 4.839639e-06 1.790666e-04 7.259458e-04 1.935855e-05 6.678701e-04 2.995736e-03
## 6038 6039 6041 6042 6043 6044
## 2.652122e-03 3.155444e-03 1.747110e-03 3.479700e-03 4.839639e-06 1.935855e-05
## 6045 6046 6047 6048 6049 6051
## 1.742270e-04 1.839063e-04 1.839063e-04 1.064720e-04 1.108277e-03 2.371423e-04
## 6052 6053 6055 6056 6064 6065
## 2.565008e-04 2.032648e-04 1.984252e-04 2.419819e-05 1.558364e-03 3.179643e-03
## 6067 6080 6092
## 2.918302e-03 1.161513e-04 1.451892e-05
table(data.quali$emergency.vehicle.type)
##
## AR BEAA BSPP CA CCR BSPP CD BSPP CFS CRAC
## 766 307 6 1285 21 8 21
## CRF DAP DEP EPA BSPP EPAN EPSA ESAV
## 1417 1 4 1322 1247 339 1
## FA FFSS FMOGP BSPP FNPC FPT BSPP FPT SSLIA FPTL BSPP
## 2228 126 1 856 3607 1 288
## OHFOM PEV PSE PST SFCB SP SPTT
## 83 7 30149 17 50 3 1
## UMPS VID VLHP VLR BSPP VPS VRA VRM
## 29 666 37 4154 151 3 6
## VSAV BALA VSAV BSPP VSAV SDIS VSAV SSLIA VSIS VSTI
## 1 157393 2 20 2 1
prop.table(table(data.quali$emergency.vehicle.type))
##
## AR BEAA BSPP CA CCR BSPP CD BSPP CFS
## 3.707163e-03 1.485769e-03 2.903783e-05 6.218936e-03 1.016324e-04 3.871711e-05
## CRAC CRF DAP DEP EPA BSPP EPAN
## 1.016324e-04 6.857768e-03 4.839639e-06 1.935855e-05 6.398002e-03 6.035029e-03
## EPSA ESAV FA FFSS FMOGP BSPP FNPC
## 1.640637e-03 4.839639e-06 1.078271e-02 6.097945e-04 4.839639e-06 4.142731e-03
## FPT BSPP FPT SSLIA FPTL BSPP OHFOM PEV PSE
## 1.745658e-02 4.839639e-06 1.393816e-03 4.016900e-04 3.387747e-05 1.459103e-01
## PST SFCB SP SPTT UMPS VID
## 8.227386e-05 2.419819e-04 1.451892e-05 4.839639e-06 1.403495e-04 3.223199e-03
## VLHP VLR BSPP VPS VRA VRM VSAV BALA
## 1.790666e-04 2.010386e-02 7.307854e-04 1.451892e-05 2.903783e-05 4.839639e-06
## VSAV BSPP VSAV SDIS VSAV SSLIA VSIS VSTI
## 7.617252e-01 9.679277e-06 9.679277e-05 9.679277e-06 4.839639e-06
According to the results, we need to combine several categories for the emergency vehicle type variable. However, we don’t have enough information for grouping levels of that variable.
table(data.quali$rescue.center)
##
## 2418 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443
## 15 2325 1906 3481 3088 3079 3793 1321 4143 4219 3556
## 2444 2445 2446 2447 2448 2449 2450 2451 2452 2454 2455
## 3270 2851 4973 4825 3029 3713 3276 2512 2053 3123 2614
## 2456 2457 2458 2459 2460 2462 2463 2464 2465 2467 2469
## 2739 2489 2586 2030 4061 2630 4591 1848 2399 2756 4511
## 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480
## 2968 1076 2930 2257 4127 5607 1383 5637 3911 3201 3123
## 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491
## 2856 1416 2182 2799 2825 4472 1563 5122 2506 2186 3932
## 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2503
## 2242 3136 1759 1235 1588 4443 1265 2588 2185 3037 1088
## 2505 2506 2507 2508 2509 2510 266281 266290 266294 266296 266321
## 622 2175 2874 1768 2620 2952 2 1150 3 1 3
## 266323 266324
## 6 1
prop.table(table(data.quali$rescue.center))
##
## 2418 2434 2435 2436 2437 2438
## 7.259458e-05 1.125216e-02 9.224351e-03 1.684678e-02 1.494480e-02 1.490125e-02
## 2439 2440 2441 2442 2443 2444
## 1.835675e-02 6.393163e-03 2.005062e-02 2.041844e-02 1.720975e-02 1.582562e-02
## 2445 2446 2447 2448 2449 2450
## 1.379781e-02 2.406752e-02 2.335126e-02 1.465927e-02 1.796958e-02 1.585466e-02
## 2451 2452 2454 2455 2456 2457
## 1.215717e-02 9.935778e-03 1.511419e-02 1.265082e-02 1.325577e-02 1.204586e-02
## 2458 2459 2460 2462 2463 2464
## 1.251531e-02 9.824466e-03 1.965377e-02 1.272825e-02 2.221878e-02 8.943652e-03
## 2465 2467 2469 2470 2471 2472
## 1.161029e-02 1.333804e-02 2.183161e-02 1.436405e-02 5.207451e-03 1.418014e-02
## 2473 2474 2475 2476 2477 2478
## 1.092306e-02 1.997319e-02 2.713585e-02 6.693220e-03 2.728104e-02 1.892783e-02
## 2479 2480 2481 2482 2483 2484
## 1.549168e-02 1.511419e-02 1.382201e-02 6.852928e-03 1.056009e-02 1.354615e-02
## 2485 2486 2487 2488 2489 2490
## 1.367198e-02 2.164286e-02 7.564355e-03 2.478863e-02 1.212813e-02 1.057945e-02
## 2491 2492 2493 2494 2495 2496
## 1.902946e-02 1.085047e-02 1.517711e-02 8.512924e-03 5.976954e-03 7.685346e-03
## 2497 2498 2499 2500 2501 2503
## 2.150251e-02 6.122143e-03 1.252498e-02 1.057461e-02 1.469798e-02 5.265527e-03
## 2505 2506 2507 2508 2509 2510
## 3.010255e-03 1.052621e-02 1.390912e-02 8.556481e-03 1.267985e-02 1.428661e-02
## 266281 266290 266294 266296 266321 266323
## 9.679277e-06 5.565584e-03 1.451892e-05 4.839639e-06 1.451892e-05 2.903783e-05
## 266324
## 4.839639e-06
They are several levels which could be consider as outliers and we need to garbage them out.
table(data.quali$status.preceding.selection)
##
## Disponible Rentré
## 4553 202074
prop.table(table(data.quali$status.preceding.selection))
##
## Disponible Rentré
## 0.02203487 0.97796513
According to the results, we need to combine several categories for status preceding selection variable. However, we don’t have enough information for grouping levels of that variable.
table(data.quali$departed.from.its.rescue.center)
##
## 0 1
## 4553 202074
prop.table(table(data.quali$departed.from.its.rescue.center))
##
## 0 1
## 0.02203487 0.97796513
barplot(table(data.quali$alert.reason.category),horiz = F,cex.names=0.8,col="blue",main="Alert Reason Category",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$alert.reason),horiz = F,cex.names=0.8,col="blue",main="Alert Reason Category",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$intervention.on.public.roads),horiz = F,cex.names=0.8,col="blue",main="Intervention on public road",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$floor),horiz = F,cex.names=0.8,col="blue",main="Floors",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$location.of.the.event),horiz = F,cex.names=0.8,col="blue",main="Location of the event",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$emergency.vehicle),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$emergency.vehicle.type),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$rescue.center),horiz = F,cex.names=0.8,col="blue",main="Rescue center variable",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$status.preceding.selection),horiz = F,cex.names=0.8,col="blue",main="Emergency Vehicle Type",ylab="Frequency", plot=TRUE)
barplot(table(data.quali$departed.from.its.rescue.center),horiz = F,cex.names=0.8,col="blue",main="Departed from its rescue center",ylab="Frequency", plot=TRUE)
for the floor variable, group everything below -2 & everything after 17
table(data.clean$floor)
##
## -10 -6 -5 -4 -3 -2 -1 0 1 2 3
## 1 5 5 112 140 605 3672 126288 17577 14629 12193
## 4 5 6 7 8 9 10 11 12 13 14
## 9705 6846 4667 2728 1673 1117 777 561 415 314 247
## 15 16 17 18 19 20 21 22 23 24 25
## 179 129 133 73 24 14 23 23 11 12 8
## 26 27 28 29 30 31 32 33 37 52 79
## 15 16 3 12 10 1 7 2 4 1 1
## 100
## 9
levels(data.clean$floor)<-list("-2"=c("-10","-9","-6","-5","-4","-3", "-2"),"-1"=c("-1"),"0"=c("0"),"1"=c("1"),"2"=c("2"),"3"=c("3"),"4"=c("4"),"5"=c("5"),"6"=c("6"),"7"=c("7"),"8"=c("8"),"9"=c("9"),"10"=c("10"),"11"=c("11"),"12"=c("12"),"13"=c("13"),"14"=c("14"),"15"=c("15"),"16"=c("16"),"17"=c("17","18","19","20","21","22","23","24","25","26","27","28","29", "30", "31", "32", "33", "37","52","79","100"))
table(data.clean$floor)
##
## -2 -1 0 1 2 3 4 5 6 7 8
## 868 3672 126288 17577 14629 12193 9705 6846 4667 2728 1673
## 9 10 11 12 13 14 15 16 17
## 1117 777 561 415 314 247 179 129 402
barplot(table(data.clean$floor),horiz = F,cex.names=0.8,col="blue",main="Floors",ylab="Frequency", plot=TRUE)
data.quali<-data.clean[,c(3,4,5,6,7,10,11,12,15,17)]
Apply same for x_test
levels(x_test$floor)<-list("-2"=c("-10","-9","-6","-5","-4","-3", "-2"),"-1"=c("-1"),"0"=c("0"),"1"=c("1"),"2"=c("2"),"3"=c("3"),"4"=c("4"),"5"=c("5"),"6"=c("6"),"7"=c("7"),"8"=c("8"),"9"=c("9"),"10"=c("10"),"11"=c("11"),"12"=c("12"),"13"=c("13"),"14"=c("14"),"15"=c("15"),"16"=c("16"),"17"=c("17","18","19","20","21","22","23","24","25","26","27","28","29", "30", "31", "32", "33", "37","52","79","100"))
table(data.quali$rescue.center)
##
## 2418 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443
## 15 2314 1900 3466 3051 3060 3783 1306 4112 4169 3503
## 2444 2445 2446 2447 2448 2449 2450 2451 2452 2454 2455
## 3233 2840 4958 4780 2988 3696 3265 2487 2038 3085 2590
## 2456 2457 2458 2459 2460 2462 2463 2464 2465 2467 2469
## 2711 2457 2542 2010 4025 2586 4572 1819 2381 2740 4491
## 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480
## 2933 1058 2915 2207 4105 5575 1372 5607 3872 3184 3078
## 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491
## 2848 1403 2171 2787 2819 4456 1547 5113 2457 2178 3911
## 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2503
## 2221 3110 1748 1213 1569 4401 1252 2577 2159 3013 1083
## 2505 2506 2507 2508 2509 2510 266281 266290 266294 266296 266321
## 615 2167 2845 1742 2608 2938 2 1146 2 1 2
## 266323 266324
## 3 1
We remove some levels (for data.clean$rescue.center variable) which are not significant for us because there are very few samples for them.
table(data.clean\(rescue.center) levels(data.clean\)rescue.center)
data.clean\(rescue.center <- subset(data.clean\)rescue.center, data.clean$rescue.center != “266281”)
data.clean <- data.clean[data.clean$rescue.center != 266281,]
data.clean <- data.clean[data.clean$rescue.center != 266294,]
data.clean <- data.clean[data.clean$rescue.center != 266296,]
data.clean <- data.clean[data.clean$rescue.center != 266321,]
data.clean <- data.clean[data.clean$rescue.center != 266323,]
data.clean <- data.clean[data.clean$rescue.center != 266324,]
table(droplevels(data.clean$rescue.center))
table(data.clean$rescue.center)
Let’s compute time difference between departure.presentation(actual) and OSRM estimated duration (predicted). This difference can’t be taken into account in our future model.
time.diff<-data.clean$delta.departure.presentation-data.clean$OSRM.estimated.duration
time.diff %>% head
## [1] 218.2 53.8 69.6 -6.6 260.4 231.3
hist(time.diff, col="blue",main="Time diff")
summary(time.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -640.6 27.1 89.0 110.8 169.2 1008.9
Here we can see that thre is a median 90 s (1min 30) of time diffirence between OSRM estimated duration and actual timing. One max at 1000 second. There is also min value which means that the brigate can also drive quickier than what OSRM predicted.
In this section we build new features from the existing ones, trying to find better predictors for our target variable. We prefer to define all these new features in a single code block below and then study them in the following subsections.
estimated speed by OSRM : distance / time
data.fe<-data.clean
data.fe$OSRM.estimated.speed<-(data.clean$OSRM.estimated.distance/1000)/(data.clean$OSRM.estimated.duration/60^2)
data.fe$OSRM.estimated.speed %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.661 26.154 30.780 31.369 35.601 69.382
hist(data.fe$OSRM.estimated.speed , col="blue",main="Estimated Speed km/h")
Same for x_test
x_test$OSRM.estimated.speed<-(x_test$OSRM.estimated.distance/1000)/(x_test$OSRM.estimated.duration/60^2)
data.fe.sample <- sample_n(data.fe, 100)
data.fe.sample %>%
ggplot(aes(x=alert.reason, y=OSRM.estimated.speed,group=alert.reason)) +
geom_boxplot()
data.fe.sample %>%
ggplot(aes(x=rescue.center, y=OSRM.estimated.speed)) +
geom_boxplot()
data.fe.sample %>%
ggplot(aes(x=alert.reason.category, y=OSRM.estimated.speed)) +
geom_boxplot()
data.fe$month <-month(as.Date(as.character(data.fe$date.key.selection), "%Y%m%d"))
data.fe$weekdays <-weekdays(as.Date(as.character(data.fe$date.key.selection), "%Y%m%d"))
Same for x_test
x_test$month <-month(as.Date(as.character(x_test$date.key.selection), "%Y%m%d"))
x_test$weekdays <-weekdays(as.Date(as.character(x_test$date.key.selection), "%Y%m%d"))
data.fe$hours <-as.hms(formatC(as.integer(data.fe$time.key.selection), big.mark = ":", big.interval = 2L))
## Warning: as.hms() is deprecated, please use as_hms().
## This warning is displayed once per session.
## Warning: Lossy cast from <character> to <hms> at position(s) 4, 5, 10, 84,
## 103, ... (and 7415 more)
data.fe$hours <- replace(data.fe$hours, is.na(data.fe$hours), "00:00:00")
Same for x_test
x_test$hours <-as.hms(formatC(as.integer(x_test$time.key.selection), big.mark = ":", big.interval = 2L))
## Warning: Lossy cast from <character> to <hms> at position(s) 36, 40, 54, 130,
## 138, ... (and 3880 more)
x_test$hours <- replace(x_test$hours, is.na(x_test$hours), "00:00:00")
data.fe$hours <- substr(data.fe$hours, 1, 2)
x_test$hours <- substr(x_test$hours, 1, 2)
table(data.fe$month)
##
## 1 2 3 4 5 6 7 8 10 11 12
## 19483 17123 18860 17661 18879 19098 19792 16327 19134 19065 19565
prop.table(table(data.fe$month))
##
## 1 2 3 4 5 6 7
## 0.09504505 0.08353213 0.09200583 0.08615668 0.09209852 0.09316688 0.09655246
## 8 10 11 12
## 0.07964895 0.09334250 0.09300590 0.09544508
barplot(table(data.fe$month),horiz = F,cex.names=0.8,col="blue",main="Month",ylab="Frequency", plot=TRUE)
table(data.fe$weekdays)
##
## dimanche jeudi lundi mardi mercredi samedi vendredi
## 28066 29257 31000 29163 29178 28093 30230
prop.table(table(data.fe$weekdays))
##
## dimanche jeudi lundi mardi mercredi samedi vendredi
## 0.1369160 0.1427261 0.1512291 0.1422676 0.1423407 0.1370477 0.1474728
barplot(table(data.fe$weekdays),horiz = F,cex.names=0.8,col="blue",main="Day",ylab="Frequency", plot=TRUE)
table(data.fe$hours)
##
## 00 01 02 03 04 05 06 07 08 09 10 11 12
## 7420 6346 5153 4706 4142 3953 4376 5562 7596 9230 9871 10546 11326
## 13 14 15 16 17 18 19 20 21 22 23
## 11159 10737 10307 10159 10282 10684 11349 11122 10462 9767 8732
prop.table(table(data.fe$hours))
##
## 00 01 02 03 04 05 06
## 0.03619742 0.03095806 0.02513818 0.02295755 0.02020616 0.01928415 0.02134770
## 07 08 09 10 11 12 13
## 0.02713343 0.03705601 0.04502725 0.04815427 0.05144716 0.05525228 0.05443760
## 14 15 16 17 18 19 20
## 0.05237893 0.05028124 0.04955924 0.05015928 0.05212038 0.05536449 0.05425710
## 21 22 23
## 0.05103738 0.04764692 0.04259782
barplot(table(data.fe$hours),horiz = F,cex.names=0.8,col="blue",main="Hours",ylab="Frequency", plot=TRUE)
We can now remove date.key.selection and time.key.selection
data.fe<-data.fe[,-c(13,14)]
Same for x_test
x_test<-x_test[,-c(13,14)]
Convert time and date build variables into factors
data.fe$hours <- as.factor(data.fe$hours)
data.fe$weekdays <- as.factor(data.fe$weekdays)
data.fe$month <- as.factor(data.fe$month)
x_test$hours <- as.factor(x_test$hours)
x_test$weekdays <- as.factor(x_test$weekdays)
x_test$month <- as.factor(x_test$month)
For the same event, multiple brigade can leave. It is worth to take into account that it might be a correlation between the order of the brigade to leave and the time for preparation. Once we have classify those brigade, we will be able to remove the id of the intervention.
data.fe$intervention %>% as.factor() %>% str
## Factor w/ 196480 levels "12649492","12650159",..: 111710 2131 168556 8491 173225 5521 137030 17551 143501 143369 ...
data.fe$emergency.vehicle.selection %>% str
## int [1:204987] 5105452 4720915 5365374 4741586 5381209 4731603 5196431 4774057 5277444 5277017 ...
204987-196480
## [1] 8507
x_test$intervention %>% as.factor() %>% str
## Factor w/ 102466 levels "12649492","12650159",..: 73849 56296 64950 86532 68610 73049 784 52060 18501 92121 ...
Plot frequency order of id intervention.
a<-data.fe %>%
group_by(intervention)%>%
tally()
b<-x_test %>%
group_by(intervention)%>%
tally()
a %>% setDT
a %>% str
## Classes 'data.table' and 'data.frame': 196480 obs. of 2 variables:
## $ intervention: int 12649492 12650159 12651734 12651765 12651790 12651791 12651829 12651860 12651907 12651960 ...
## $ n : int 1 1 1 1 1 1 1 1 1 1 ...
b %>% setDT
data.fe<-merge(a,data.fe, by.x = 'intervention', by.y = 'intervention', all=FALSE)
x_test<-merge(b,x_test, by.x = 'intervention', by.y = 'intervention', all=FALSE)
Rename and convert as factor
data.fe$n<-data.fe$n %>% as.factor
x_test$n<-x_test$n %>% as.factor
Now that we have intervention frequency as factor, we can remove intervention ID
data.fe<-data.fe[,-c(1)]
x_test<-x_test[,-c(1)]
We decide to remove GPS data since we have the same data.
data.fe<-data.fe[,-c(18,19)]
x_test<-x_test[,-c(18,19)]
Update data.quanti and data.quali
colnames(data.fe)
## [1] "n"
## [2] "emergency.vehicle.selection"
## [3] "alert.reason.category"
## [4] "alert.reason"
## [5] "intervention.on.public.roads"
## [6] "floor"
## [7] "location.of.the.event"
## [8] "longitude.intervention"
## [9] "latitude.intervention"
## [10] "emergency.vehicle"
## [11] "emergency.vehicle.type"
## [12] "rescue.center"
## [13] "status.preceding.selection"
## [14] "delta.status.preceding.selection.selection"
## [15] "departed.from.its.rescue.center"
## [16] "longitude.before.departure"
## [17] "latitude.before.departure"
## [18] "OSRM.estimated.distance"
## [19] "OSRM.estimated.duration"
## [20] "delta.selection.departure"
## [21] "delta.departure.presentation"
## [22] "delta.selection.presentation"
## [23] "OSRM.estimated.speed"
## [24] "month"
## [25] "weekdays"
## [26] "hours"
str(data.fe)
## Classes 'data.table' and 'data.frame': 204987 obs. of 26 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ alert.reason : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ status.preceding.selection : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ longitude.before.departure : num 2.33 2.28 2.34 2.29 2.18 ...
## $ latitude.before.departure : num 48.9 48.9 48.9 48.9 48.9 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
data.quanti<-data.fe[,c(23,22,19,18,17,16,14,9,8,2)]
data.quanti.y<-data.fe[,c(23,22,21,20,19,18,17,16,14,9,8,2)]
data.quali<-data.fe[,-c(23,22,21,20,19,18,17,16,14,9,8,2)]
data.quali.y<-data.fe[,-c(23,22,19,18,17,16,14,9,8,2)]
str(data.quanti)
## Classes 'data.table' and 'data.frame': 204987 obs. of 10 variables:
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ latitude.before.departure : num 48.9 48.9 48.9 48.9 48.9 ...
## $ longitude.before.departure : num 2.33 2.28 2.34 2.29 2.18 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(data.quanti.y)
## Classes 'data.table' and 'data.frame': 204987 obs. of 12 variables:
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ latitude.before.departure : num 48.9 48.9 48.9 48.9 48.9 ...
## $ longitude.before.departure : num 2.33 2.28 2.34 2.29 2.18 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(data.quali)
## Classes 'data.table' and 'data.frame': 204987 obs. of 14 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ alert.reason : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ status.preceding.selection : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
## $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(data.quali.y)
## Classes 'data.table' and 'data.frame': 204987 obs. of 16 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ alert.reason : Factor w/ 122 levels "1911","1912",..: 6 7 60 32 60 3 4 60 3 31 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ status.preceding.selection : Factor w/ 2 levels "Disponible","Rentré": 2 2 2 2 2 2 1 2 2 2 ...
## $ departed.from.its.rescue.center: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
After engineering new features and before starting the modeling, we have to visualize the relations between our parameters using a correlation matrix. For this, we need to change all the factor features into a numerical format. The visualization uses the corrplot function from the eponymous package. Corrplot gives us great flexibility in manipulating the style of our plot.
What we see below, are the color-coded correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation (dark blue and dark red in the plots below).
Here, we are of course interested if and how strongly our correlate with the ys. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the colinearity in our data set and improve the robustness of our prediction.
data.fe %>%
select(-emergency.vehicle.selection) %>%
mutate(n = as.integer(n),
alert.reason = as.integer(alert.reason),
floor = as.integer(floor),
emergency.vehicle = as.integer(emergency.vehicle),
rescue.center = as.integer(rescue.center),
delta.selection.presentation = as.integer(delta.selection.presentation),
month = as.integer(month),
hours = as.integer(hours),
weekdays = as.integer(weekdays),
alert.reason.category = as.integer(alert.reason.category),
intervention.on.public.roads = as.integer(intervention.on.public.roads),
location.of.the.event = as.integer(location.of.the.event),
emergency.vehicle.type = as.integer(emergency.vehicle.type),
status.preceding.selection = as.integer(status.preceding.selection),
departed.from.its.rescue.center = as.integer(departed.from.its.rescue.center))%>%
cor(use="complete.obs", method = "spearman") %>%
corrplot(type="lower", method="pie",order="hclust",
diag=FALSE)
Add coefficients
data.fe %>%
select(-emergency.vehicle.selection) %>%
mutate(n = as.integer(n),
alert.reason = as.integer(alert.reason),
floor = as.integer(floor),
emergency.vehicle = as.integer(emergency.vehicle),
rescue.center = as.integer(rescue.center),
delta.selection.presentation = as.integer(delta.selection.presentation),
month = as.integer(month),
hours = as.integer(hours),
weekdays = as.integer(weekdays),
alert.reason.category = as.integer(alert.reason.category),
intervention.on.public.roads = as.integer(intervention.on.public.roads),
location.of.the.event = as.integer(location.of.the.event),
emergency.vehicle.type = as.integer(emergency.vehicle.type),
status.preceding.selection = as.integer(status.preceding.selection),
departed.from.its.rescue.center = as.integer(departed.from.its.rescue.center))%>%
cor(use="complete.obs", method = "spearman") %>%
corrplot(type="lower", method="square",order="hclust",
addCoef.col = "black", diag=FALSE)
We find :
data.fe<-data.fe[,-c("alert.reason","status.preceding.selection","longitude.before.departure","latitude.before.departure")]
x_test<-x_test[,-c("alert.reason","status.preceding.selection","longitude.before.departure","latitude.before.departure")]
Boosted Tree aka XGBoost. XGBoost is a well-known and efficient open source implementation of the improved gradient tree algorithm.
Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining estimates from a simpler and weaker set of models. GBoost reduces a regularized objective function (L1 and L2) that combines a convex loss function (based on the difference between predicted and target outputs) and a penalty condition for model complexity (in other words, regression tree functions). Training continues iteratively, adding new trees that predict residuals or errors from previous trees that are then combined with the previous trees to make the final prediction.
In other words we are building a tree and looks which value is predicted poorly and assign to it higher weigh in our prediction.
Let us build and predict our model with 100 maximum number of boosting iterations.
data.fe %>% str
## Classes 'data.table' and 'data.frame': 204987 obs. of 22 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Store final dataset
dataset<-data.fe
Sample data
set.seed(123)
n_train <- (0.8 * nrow(dataset))
train_indices <- sample(1:nrow(dataset), n_train)
trainset <- dataset[train_indices, ]
testset <- dataset[-train_indices, ]
trainset %>% head
## n emergency.vehicle.selection alert.reason.category
## 1: 1 5387610 3
## 2: 1 5410147 3
## 3: 1 5167483 3
## 4: 1 5132649 3
## 5: 3 5315418 1
## 6: 1 5058200 3
## intervention.on.public.roads floor location.of.the.event
## 1: 0 0 136
## 2: 0 0 107
## 3: 1 0 228
## 4: 1 0 149
## 5: 0 2 140
## 6: 1 0 148
## longitude.intervention latitude.intervention emergency.vehicle
## 1: 2.358910 48.88494 4316
## 2: 2.328205 48.86534 4305
## 3: 2.229676 48.91456 4931
## 4: 2.436133 48.85821 6065
## 5: 2.390189 48.94845 1834
## 6: 2.374644 48.88676 5885
## emergency.vehicle.type rescue.center
## 1: VSAV BSPP 2469
## 2: PSE 2493
## 3: VSAV BSPP 2455
## 4: VSAV BSPP 2449
## 5: PSE 2467
## 6: VSAV BSPP 2439
## delta.status.preceding.selection.selection departed.from.its.rescue.center
## 1: 5681 1
## 2: 1625 1
## 3: 1074 1
## 4: 2365 1
## 5: 5046 1
## 6: 748 1
## OSRM.estimated.distance OSRM.estimated.duration delta.selection.departure
## 1: 1533.9 166.9 166
## 2: 1800.5 219.8 156
## 3: 2899.7 293.5 65
## 4: 2953.5 370.9 143
## 5: 2755.3 248.9 126
## 6: 2086.2 295.8 126
## delta.departure.presentation delta.selection.presentation
## 1: 292 458
## 2: 592 748
## 3: 354 419
## 4: 467 610
## 5: 512 638
## 6: 138 264
## OSRM.estimated.speed month weekdays hours
## 1: 33.08592 11 lundi 23
## 2: 29.48954 12 jeudi 17
## 3: 35.56702 8 lundi 18
## 4: 28.66703 7 vendredi 17
## 5: 39.85167 10 lundi 23
## 6: 25.38986 6 mardi 02
Store only y0 target value
trainset0_y0<-dataset[train_indices,delta.selection.departure]
trainset0<-trainset[,-c("delta.selection.departure","delta.departure.presentation","delta.selection.presentation")]
testset0_y0<-dataset[-train_indices,delta.selection.departure]
testset0<-testset[,-c("delta.selection.departure","delta.departure.presentation","delta.selection.presentation")]
trainset$delta.selection.departure %>% head
## [1] 166 156 65 143 126 126
dataset[train_indices,delta.selection.departure] %>% head
## [1] 166 156 65 143 126 126
Create one-hot matrix for cat variable. with sparse-matrix
sparse_matrix <- sparse.model.matrix( ~ ., data = trainset0)[,-1]
Store into xgb Matrix
dtrain1 <- xgb.DMatrix(sparse_matrix,label = trainset$delta.selection.departure)
Set parameters for our xgb tree
xgb_params <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear"
)
train our model
set.seed(4321)
gb_0_dt <- xgb.train(params = xgb_params,
data = dtrain1,
print_every_n = 100,
nrounds = 100)
## [01:26:25] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
Plot the 30 first important features in our model
importance_matrix <- xgb.importance(model = gb_0_dt)
xgb.plot.importance(importance_matrix[1:30,])
We find : - Delta statut preceding selection - Hours - Lat intervention - Long internvetion - emergency vehicule type VSAV$VD (group is or not) - ORSRM data - Rescue center 2474,77,75,2439 (group is or not) - Alert-reason-cat-03 -departed from its rescue center1
Predict with testset0
sparse_matrix2 <- sparse.model.matrix( ~ ., data =testset0 )[,-1]
Store into xgb Matrix
dtest <- xgb.DMatrix(sparse_matrix2,label=testset$delta.selection.departure)
Predict
pred_xgboost_y0 <- predict(gb_0_dt,dtest)
pred_xgboost_y0 %>% head
## [1] 140.30222 96.39616 154.99208 115.06355 85.18041 126.75983
testset$delta.selection.departure %>% head
## [1] 47 149 97 113 124 145
Compute MSE, MAE, R2
postResample(pred = pred_xgboost_y0, obs = testset$delta.selection.departure)
## RMSE Rsquared MAE
## 46.4920917 0.2753537 34.1598139
Store only y1 target value
trainset1_y1<-trainset$delta.departure.presentation
trainset1<-trainset[,-c("delta.selection.departure","delta.selection.presentation","delta.departure.presentation")]
testset1_y1<-testset$delta.departure.presentation
testset1<-testset[,-c("delta.selection.departure","delta.selection.presentation","delta.departure.presentation")]
Create one-hot matrix for cat variable. with sparse-matrix
sparse_matrix <- sparse.model.matrix( ~ ., data = trainset1)[,-1]
Store into xgb Matrix
dtrain2 <- xgb.DMatrix(sparse_matrix,label = trainset$delta.departure.presentation)
Set parameters for our xgb tree
xgb_params <- list(colsample_bytree = 0.5, #variables per tree
subsample = 0.5, #data subset per tree
booster = "gbtree",
max_depth = 3, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear",
seed = 4321
)
train our model
set.seed(4321)
gb_1_dt <- xgb.train(params = xgb_params,
data = dtrain2,
print_every_n = 100,
nrounds = 100)
## Warning in xgb.train(params = xgb_params, data = dtrain2, print_every_n = 100, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:26:38] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
Plot the 30 first important features in our model
importance_matrix <- xgb.importance(model = gb_1_dt)
xgb.plot.importance(importance_matrix[1:30,])
We find : - OSRM data - Lat intervention - Long internvetion - statut.preceding-selection - emergency vehicule type VSAV$VD (group is or not)
Predict
sparse_matrix2 <- sparse.model.matrix( ~ ., data =testset1 )[,-1]
Store into xgb Matrix
dtest <- xgb.DMatrix(sparse_matrix2,label=testset$delta.departure.presentation)
Predict
pred_xgboost_y1 <- predict(gb_1_dt,dtest)
pred_xgboost_y1 %>% head
## [1] 415.4552 306.6591 403.4247 377.4039 363.4391 397.6378
testset$delta.departure.presentation %>% head
## [1] 376 268 409 678 432 236
postResample(pred = pred_xgboost_y1, obs = testset$delta.departure.presentation)
## RMSE Rsquared MAE
## 125.2175578 0.4038202 89.9848882
pred_xgboost_ys=pred_xgboost_y0+pred_xgboost_y1
pred_xgboost=data.frame(pred_xgboost_y0,pred_xgboost_y1,pred_xgboost_ys)
pred_xgboost %>% head
## pred_xgboost_y0 pred_xgboost_y1 pred_xgboost_ys
## 1 140.30222 415.4552 555.7574
## 2 96.39616 306.6591 403.0553
## 3 154.99208 403.4247 558.4167
## 4 115.06355 377.4039 492.4674
## 5 85.18041 363.4391 448.6195
## 6 126.75983 397.6378 524.3976
testset$delta.selection.presentation %>% head
## [1] 423 417 506 791 556 381
postResample(pred = pred_xgboost_ys, obs = testset$delta.departure.presentation)
## RMSE Rsquared MAE
## 186.759072 0.384372 165.226319
No too bad accuracy !
For Y0
importance_matrix_y0 <- xgb.importance(model = gb_0_dt)
xgb.plot.importance(importance_matrix_y0[1:30,])
importance_matrix_y0[1:30,]
## Feature Gain Cover
## 1: delta.status.preceding.selection.selection 0.156960966 0.0353014695
## 2: hours03 0.072181643 0.0134732246
## 3: hours04 0.070018506 0.0130817159
## 4: hours05 0.069297197 0.0128591216
## 5: hours02 0.068886010 0.0132399501
## 6: hours01 0.060420820 0.0125660520
## 7: hours06 0.037428929 0.0131693354
## 8: emergency.vehicle.typeVSAV BSPP 0.033491678 0.0108941036
## 9: alert.reason.category3 0.029556868 0.0038505879
## 10: emergency.vehicle.selection 0.025238333 0.0288528568
## 11: latitude.intervention 0.023163862 0.0152647111
## 12: rescue.center2474 0.017850762 0.0104794051
## 13: longitude.intervention 0.014201064 0.0083155053
## 14: OSRM.estimated.duration 0.013281196 0.0141529769
## 15: rescue.center2475 0.012587701 0.0090285526
## 16: rescue.center2477 0.012335731 0.0106209831
## 17: OSRM.estimated.distance 0.008561129 0.0026964813
## 18: OSRM.estimated.speed 0.008227623 0.0005511891
## 19: emergency.vehicle.typeVID 0.007142448 0.0054117444
## 20: rescue.center2439 0.007037138 0.0099891437
## 21: emergency.vehicle.typeCRF 0.006013882 0.0049193400
## 22: hours23 0.005624843 0.0062668227
## 23: hours12 0.005465856 0.0154358905
## 24: hours13 0.005420156 0.0127640800
## 25: hours11 0.005342021 0.0160590534
## 26: rescue.center2456 0.004569964 0.0105871828
## 27: departed.from.its.rescue.center1 0.004407864 0.0054555106
## 28: hours10 0.004372949 0.0143542630
## 29: rescue.center2493 0.003969963 0.0085054840
## 30: hours14 0.003918268 0.0123038207
## Feature Gain Cover
## Frequency
## 1: 0.096484375
## 2: 0.006640625
## 3: 0.005468750
## 4: 0.008203125
## 5: 0.006640625
## 6: 0.009765625
## 7: 0.006640625
## 8: 0.007031250
## 9: 0.009765625
## 10: 0.095703125
## 11: 0.058984375
## 12: 0.003125000
## 13: 0.052343750
## 14: 0.032031250
## 15: 0.002343750
## 16: 0.003515625
## 17: 0.042187500
## 18: 0.037500000
## 19: 0.004296875
## 20: 0.004296875
## 21: 0.001562500
## 22: 0.003906250
## 23: 0.004296875
## 24: 0.007031250
## 25: 0.006250000
## 26: 0.002734375
## 27: 0.006640625
## 28: 0.005078125
## 29: 0.001953125
## 30: 0.004687500
## Frequency
We find : - Delta status preceding selection - Hours - Lat intervention - Long intervention - emergency vehicle type VSAV$VD (group is or not) - emergency.vehicle.typePSE - emergency.vehicle.typeVID - emergency.vehicle.typeCRF - emergency.vehicle.typeFNPC - ORSRM data - Rescue center 2474,2477,2475,2439, 2435,2464, 2488, 2493(group is or not) - Alert-reason-cat-03 -2 -departed from its rescue center1
We regroup the types of emergency vehicle, the rescue centers and the alert reason category
trainset0.regroup <- trainset0
testset0.regroup <- testset0
trainset0.regroup = trainset0.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="PSE" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "FNPC" ~ 1,
TRUE ~ 0))
trainset0.regroup = trainset0.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~ 1,
rescue.center =="2477" ~ 1,
rescue.center == "2475" ~ 1,
rescue.center == "2439" ~ 1,
rescue.center == "2435" ~ 1,
rescue.center == "2464" ~ 1,
rescue.center == "2488" ~ 1,
rescue.center == "2493" ~ 1,
TRUE ~ 0))
testset0.regroup = testset0.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="PSE" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "FNPC" ~ 1,
TRUE ~ 0))
testset0.regroup = testset0.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~ 1,
rescue.center =="2477" ~ 1,
rescue.center == "2475" ~ 1,
rescue.center == "2439" ~ 1,
rescue.center == "2435" ~ 1,
rescue.center == "2464" ~ 1,
rescue.center == "2488" ~ 1,
rescue.center == "2493" ~ 1,
TRUE ~ 0))
Drop not-important features in trainset0
trainset0.regroup$rescue.center<-NULL
trainset0.regroup$n<-NULL
trainset0.regroup$emergency.vehicle.type<-NULL
trainset0.regroup$floor<-NULL
trainset0.regroup$emergency.vehicle<-NULL
trainset0.regroup$weekdays<-NULL
trainset0.regroup$intervention.on.public.roads<-NULL
trainset0.regroup$location.of.the.event<-NULL
trainset0.regroup$month<-NULL
testset0.regroup$rescue.center<-NULL
testset0.regroup$emergency.vehicle.type<-NULL
testset0.regroup$n<-NULL
testset0.regroup$floor<-NULL
testset0.regroup$emergency.vehicle<-NULL
testset0.regroup$weekdays<-NULL
testset0.regroup$intervention.on.public.roads<-NULL
testset0.regroup$location.of.the.event<-NULL
testset0.regroup$month<-NULL
Same for x_test
x_test0.regroup <- x_test
x_test0.regroup = x_test0.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="PSE" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "FNPC" ~ 1,
TRUE ~ 0))
x_test0.regroup = x_test0.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2474" ~ 1,
rescue.center =="2477" ~ 1,
rescue.center == "2475" ~ 1,
rescue.center == "2439" ~ 1,
rescue.center == "2435" ~ 1,
rescue.center == "2464" ~ 1,
rescue.center == "2488" ~ 1,
rescue.center == "2493" ~ 1,
TRUE ~ 0))
x_test0.regroup$rescue.center<-NULL
x_test0.regroup$emergency.vehicle.type<-NULL
x_test0.regroup$n<-NULL
x_test0.regroup$floor<-NULL
x_test0.regroup$emergency.vehicle<-NULL
x_test0.regroup$weekdays<-NULL
x_test0.regroup$intervention.on.public.roads<-NULL
x_test0.regroup$location.of.the.event<-NULL
x_test0.regroup$month<-NULL
For Y1
importance_matrix_y1 <- xgb.importance(model = gb_1_dt)
xgb.plot.importance(importance_matrix_y1[1:30,])
importance_matrix_y1[1:30,]
## Feature Gain Cover
## 1: OSRM.estimated.distance 0.709412629 0.058331801
## 2: OSRM.estimated.duration 0.126576169 0.065559114
## 3: latitude.intervention 0.015988622 0.037607944
## 4: longitude.intervention 0.011153806 0.027810403
## 5: OSRM.estimated.speed 0.010997592 0.016156794
## 6: delta.status.preceding.selection.selection 0.006750294 0.011458927
## 7: emergency.vehicle.selection 0.006270702 0.030779022
## 8: emergency.vehicle.typeVID 0.005569648 0.009361946
## 9: rescue.center2506 0.004225152 0.015055506
## 10: emergency.vehicle.typePSE 0.003136839 0.012488449
## 11: emergency.vehicle.typeVSAV BSPP 0.003132041 0.003476509
## 12: location.of.the.event139 0.002581180 0.013239570
## 13: alert.reason.category6 0.002392792 0.012834226
## 14: hours21 0.002281380 0.011122909
## 15: location.of.the.event136 0.002224477 0.012620189
## 16: n2 0.002174565 0.017383398
## 17: rescue.center2481 0.002140234 0.005750444
## 18: intervention.on.public.roads1 0.002021465 0.002513100
## 19: emergency.vehicle.typeCRF 0.001972039 0.010239683
## 20: rescue.center2507 0.001919102 0.013192729
## 21: hours22 0.001869608 0.011406353
## 22: hours09 0.001794345 0.009540648
## 23: rescue.center2479 0.001601613 0.012979384
## 24: rescue.center2491 0.001499905 0.013078067
## 25: rescue.center2485 0.001403521 0.009811000
## 26: rescue.center2442 0.001329027 0.006739388
## 27: rescue.center2450 0.001323689 0.004130654
## 28: rescue.center2455 0.001275787 0.002079660
## 29: hours18 0.001251206 0.009525889
## 30: rescue.center2498 0.001221759 0.009782415
## Feature Gain Cover
## Frequency
## 1: 0.093567251
## 2: 0.089181287
## 3: 0.068713450
## 4: 0.059941520
## 5: 0.036549708
## 6: 0.033625731
## 7: 0.036549708
## 8: 0.007309942
## 9: 0.007309942
## 10: 0.005847953
## 11: 0.007309942
## 12: 0.005847953
## 13: 0.008771930
## 14: 0.005847953
## 15: 0.005847953
## 16: 0.011695906
## 17: 0.004385965
## 18: 0.004385965
## 19: 0.007309942
## 20: 0.005847953
## 21: 0.007309942
## 22: 0.007309942
## 23: 0.007309942
## 24: 0.005847953
## 25: 0.005847953
## 26: 0.008771930
## 27: 0.002923977
## 28: 0.002923977
## 29: 0.004385965
## 30: 0.005847953
## Frequency
We find : - OSRM data - Lat intervention - Long intervention - status.preceding-selection - emergency.vehicle.typeVID - emergency.vehicle.typePSE - emergency.vehicle.typeCRF - rescue.center2506, 2507, 2485, 2481, 2498, 2500 - emergency.vehicle.typeVSAV BSPP - emergency.vehicle.typeFPT BSPP - alert.reason.category6 - location.of.the.event139, 136, 259
trainset1.regroup <- trainset1
testset1.regroup <- testset1
trainset1.regroup = trainset1.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="FPT BSPP" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "PSE" ~ 1,
TRUE ~ 0))
trainset1.regroup = trainset1.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~ 1,
rescue.center =="2507" ~ 1,
rescue.center == "2485" ~ 1,
rescue.center == "2481" ~ 1,
rescue.center == "2498" ~ 1,
rescue.center == "2500" ~ 1,
TRUE ~ 0))
trainset1.regroup = trainset1.regroup %>%
mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~ 1,
location.of.the.event =="139" ~ 1,
location.of.the.event == "136" ~ 1,
rescue.center == "259" ~ 1,
TRUE ~ 0))
testset1.regroup = testset1.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="FPT BSPP" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "PSE" ~ 1,
TRUE ~ 0))
testset1.regroup = testset1.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~ 1,
rescue.center =="2507" ~ 1,
rescue.center == "2485" ~ 1,
rescue.center == "2481" ~ 1,
rescue.center == "2498" ~ 1,
rescue.center == "2500" ~ 1,
TRUE ~ 0))
testset1.regroup = testset1.regroup%>%
mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~ 1,
location.of.the.event =="139" ~ 1,
location.of.the.event == "136" ~ 1,
rescue.center == "259" ~ 1,
TRUE ~ 0))
Drop not important features
trainset1.regroup$rescue.center<-NULL
trainset1.regroup$emergency.vehicle.type<-NULL
trainset1.regroup$location.of.the.event<-NULL
trainset1.regroup$emergency.vehicle<-NULL
trainset1.regroup$departed.from.its.rescue.center<-NULL
testset1.regroup$rescue.center<-NULL
testset1.regroup$emergency.vehicle.type<-NULL
testset1.regroup$location.of.the.event<-NULL
testset1.regroup$emergency.vehicle<-NULL
testset1.regroup$departed.from.its.rescue.center<-NULL
Same for X_test
x_test1.regroup <- x_test
x_test1.regroup = x_test1.regroup %>%
mutate(emergency.vehicle.type.regroup = case_when(emergency.vehicle.type == "VSAV BSPP" ~ 1,
emergency.vehicle.type =="FPT BSPP" ~ 1,
emergency.vehicle.type == "VID" ~ 1,
emergency.vehicle.type == "CRF" ~ 1,
emergency.vehicle.type == "PSE" ~ 1,
TRUE ~ 0))
x_test1.regroup = x_test1.regroup %>%
mutate(rescue.center.regroup = case_when(rescue.center == "2506" ~ 1,
rescue.center =="2507" ~ 1,
rescue.center == "2485" ~ 1,
rescue.center == "2481" ~ 1,
rescue.center == "2498" ~ 1,
rescue.center == "2500" ~ 1,
TRUE ~ 0))
x_test1.regroup = x_test1.regroup %>%
mutate(location.of.the.event.regroup = case_when(location.of.the.event == "2506" ~ 1,
location.of.the.event =="139" ~ 1,
location.of.the.event == "136" ~ 1,
rescue.center == "259" ~ 1,
TRUE ~ 0))
Drop not important features
x_test1.regroup$rescue.center<-NULL
x_test1.regroup$emergency.vehicle.type<-NULL
x_test1.regroup$location.of.the.event<-NULL
x_test1.regroup$emergency.vehicle<-NULL
x_test1.regroup$departed.from.its.rescue.center<-NULL
Remove id variable
trainset0.regroup<-trainset0.regroup[,-c("emergency.vehicle.selection")]
trainset0.regroup$delta.selection.departure<-trainset$delta.selection.departure
testset0.regroup<-testset0.regroup[,-c("emergency.vehicle.selection")]
testset0.regroup$delta.selection.departure<-testset$delta.selection.departure
Same x_test
x_test0.regroup<-x_test0.regroup[,-c("emergency.vehicle.selection")]
#Linear Regression,
options(max.print = 10000)
lm <- lm(delta.selection.departure ~.,data = trainset0.regroup)
summary(lm)
##
## Call:
## lm(formula = delta.selection.departure ~ ., data = trainset0.regroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -219.10 -31.19 -4.02 26.07 901.11
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.742e+03 1.242e+02 -14.027
## alert.reason.category2 -1.868e+01 9.088e-01 -20.559
## alert.reason.category3 -1.699e+01 5.410e-01 -31.404
## alert.reason.category4 3.479e+00 1.664e+00 2.091
## alert.reason.category5 1.805e+01 2.334e+00 7.733
## alert.reason.category6 1.266e+01 9.736e-01 13.000
## alert.reason.category7 2.694e+01 2.279e+00 11.819
## alert.reason.category8 7.039e+00 3.975e+00 1.771
## alert.reason.category9 3.767e+00 8.048e-01 4.681
## longitude.intervention 2.430e+01 1.432e+00 16.976
## latitude.intervention 3.803e+01 2.540e+00 14.976
## delta.status.preceding.selection.selection 6.206e-05 3.028e-06 20.495
## departed.from.its.rescue.center1 -2.555e+00 8.275e-01 -3.087
## OSRM.estimated.distance -2.495e-03 5.199e-04 -4.799
## OSRM.estimated.duration 4.455e-02 4.977e-03 8.950
## OSRM.estimated.speed 1.124e-01 3.795e-02 2.962
## hours01 1.502e+01 9.266e-01 16.215
## hours02 2.770e+01 9.852e-01 28.115
## hours03 3.238e+01 1.011e+00 32.019
## hours04 3.531e+01 1.055e+00 33.459
## hours05 3.576e+01 1.070e+00 33.412
## hours06 1.891e+01 1.037e+00 18.230
## hours07 -1.944e+01 9.657e-01 -20.134
## hours08 -3.107e+01 8.879e-01 -34.991
## hours09 -2.651e+01 8.491e-01 -31.220
## hours10 -3.685e+01 8.353e-01 -44.115
## hours11 -3.918e+01 8.255e-01 -47.457
## hours12 -4.004e+01 8.124e-01 -49.282
## hours13 -3.599e+01 8.142e-01 -44.210
## hours14 -3.667e+01 8.209e-01 -44.671
## hours15 -3.574e+01 8.277e-01 -43.181
## hours16 -3.740e+01 8.326e-01 -44.923
## hours17 -3.401e+01 8.289e-01 -41.034
## hours18 -2.932e+01 8.229e-01 -35.628
## hours19 -3.356e+01 8.113e-01 -41.367
## hours20 -3.354e+01 8.150e-01 -41.147
## hours21 -3.249e+01 8.255e-01 -39.359
## hours22 -3.068e+01 8.364e-01 -36.686
## hours23 -1.832e+01 8.584e-01 -21.340
## emergency.vehicle.type.regroup -7.772e+00 5.581e-01 -13.926
## rescue.center.regroup 9.675e+00 3.503e-01 27.620
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## alert.reason.category2 < 2e-16 ***
## alert.reason.category3 < 2e-16 ***
## alert.reason.category4 0.03656 *
## alert.reason.category5 1.06e-14 ***
## alert.reason.category6 < 2e-16 ***
## alert.reason.category7 < 2e-16 ***
## alert.reason.category8 0.07661 .
## alert.reason.category9 2.86e-06 ***
## longitude.intervention < 2e-16 ***
## latitude.intervention < 2e-16 ***
## delta.status.preceding.selection.selection < 2e-16 ***
## departed.from.its.rescue.center1 0.00202 **
## OSRM.estimated.distance 1.60e-06 ***
## OSRM.estimated.duration < 2e-16 ***
## OSRM.estimated.speed 0.00306 **
## hours01 < 2e-16 ***
## hours02 < 2e-16 ***
## hours03 < 2e-16 ***
## hours04 < 2e-16 ***
## hours05 < 2e-16 ***
## hours06 < 2e-16 ***
## hours07 < 2e-16 ***
## hours08 < 2e-16 ***
## hours09 < 2e-16 ***
## hours10 < 2e-16 ***
## hours11 < 2e-16 ***
## hours12 < 2e-16 ***
## hours13 < 2e-16 ***
## hours14 < 2e-16 ***
## hours15 < 2e-16 ***
## hours16 < 2e-16 ***
## hours17 < 2e-16 ***
## hours18 < 2e-16 ***
## hours19 < 2e-16 ***
## hours20 < 2e-16 ***
## hours21 < 2e-16 ***
## hours22 < 2e-16 ***
## hours23 < 2e-16 ***
## emergency.vehicle.type.regroup < 2e-16 ***
## rescue.center.regroup < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.68 on 163948 degrees of freedom
## Multiple R-squared: 0.1989, Adjusted R-squared: 0.1987
## F-statistic: 1018 on 40 and 163948 DF, p-value: < 2.2e-16
All our coef are significants. We will check hypothesis in 7.2.3.
library(sandwich)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(lm)
## GVIF Df GVIF^(1/(2*Df))
## alert.reason.category 1.537152 8 1.027235
## longitude.intervention 1.051810 1 1.025578
## latitude.intervention 1.068139 1 1.033508
## delta.status.preceding.selection.selection 1.181290 1 1.086872
## departed.from.its.rescue.center 1.010514 1 1.005243
## OSRM.estimated.distance 32.339705 1 5.686801
## OSRM.estimated.duration 23.118015 1 4.808120
## OSRM.estimated.speed 4.662221 1 2.159218
## hours 1.029844 23 1.000639
## emergency.vehicle.type.regroup 1.540721 1 1.241258
## rescue.center.regroup 1.093090 1 1.045509
Except for OSRM data, no high correlated coefficient. We decide to keep them as they are important for our model.
Study the residuals of the selected model What we’ve done is not enough to validate the model. We need to study the residuals if hypothesis are not validated (see course.), the test of the coefficient are false Study the residuals of the selected model
#mean of residuals
summary(lm$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -219.100 -31.192 -4.021 0.000 26.069 901.115
Yes it is !
Normality test : shapiro test H0 : Normality and H1 : no normality
library(tseries)
##
## Attaching package: 'tseries'
## The following object is masked from 'package:imputeTS':
##
## na.remove
jarque.bera.test(lm$residuals)
##
## Jarque Bera Test
##
## data: lm$residuals
## X-squared = 290909, df = 2, p-value < 2.2e-16
No normality because p-value is << 5%. Here, residuals are not normally distributed. NB : non normality could appear because of outliers.
qqnorm(lm$residuals)
qqline(lm$residuals)
plot(lm$residuals~lm$fitted)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:imputeTS':
##
## na.locf
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Breush Pagan test H0 : homoskedasticity against H1 : heteroskedasticity
bptest(lm)
##
## studentized Breusch-Pagan test
##
## data: lm
## BP = 2239.7, df = 40, p-value < 2.2e-16
pvalue << 5%, we reject H0 residuals are heteroskedastic
In case of heteroskedasticity, we have to use a robust standard error estimator. Otherwise, all our t-tests will be wrong.
We calculate the robust covariance matrix
library(sandwich)
vcov_y0 <- vcovHC(lm, type = "HC1")
coeftest(lm, vcov. = vcov_y0)
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.7421e+03 1.2710e+02 -13.7064
## alert.reason.category2 -1.8685e+01 9.2106e-01 -20.2861
## alert.reason.category3 -1.6990e+01 6.2786e-01 -27.0595
## alert.reason.category4 3.4795e+00 1.6322e+00 2.1318
## alert.reason.category5 1.8047e+01 3.7930e+00 4.7579
## alert.reason.category6 1.2656e+01 1.0128e+00 12.4965
## alert.reason.category7 2.6938e+01 3.7618e+00 7.1609
## alert.reason.category8 7.0385e+00 3.8429e+00 1.8315
## alert.reason.category9 3.7669e+00 8.6163e-01 4.3719
## longitude.intervention 2.4304e+01 1.4147e+00 17.1794
## latitude.intervention 3.8031e+01 2.5966e+00 14.6462
## delta.status.preceding.selection.selection 6.2062e-05 4.9377e-06 12.5691
## departed.from.its.rescue.center1 -2.5549e+00 9.3812e-01 -2.7234
## OSRM.estimated.distance -2.4948e-03 5.2900e-04 -4.7161
## OSRM.estimated.duration 4.4547e-02 5.0658e-03 8.7937
## OSRM.estimated.speed 1.1241e-01 3.8333e-02 2.9324
## hours01 1.5025e+01 1.0488e+00 14.3262
## hours02 2.7699e+01 1.1384e+00 24.3308
## hours03 3.2377e+01 1.1847e+00 27.3287
## hours04 3.5306e+01 1.2587e+00 28.0499
## hours05 3.5755e+01 1.2618e+00 28.3375
## hours06 1.8906e+01 1.2225e+00 15.4646
## hours07 -1.9443e+01 1.0544e+00 -18.4395
## hours08 -3.1067e+01 9.4702e-01 -32.8052
## hours09 -2.6510e+01 9.3268e-01 -28.4236
## hours10 -3.6849e+01 8.7700e-01 -42.0165
## hours11 -3.9176e+01 8.6615e-01 -45.2301
## hours12 -4.0035e+01 8.2627e-01 -48.4532
## hours13 -3.5994e+01 8.4550e-01 -42.5711
## hours14 -3.6671e+01 8.4597e-01 -43.3476
## hours15 -3.5741e+01 8.5382e-01 -41.8607
## hours16 -3.7403e+01 8.6637e-01 -43.1715
## hours17 -3.4015e+01 8.5510e-01 -39.7786
## hours18 -2.9320e+01 8.6035e-01 -34.0791
## hours19 -3.3562e+01 8.3594e-01 -40.1494
## hours20 -3.3535e+01 8.3792e-01 -40.0219
## hours21 -3.2491e+01 8.4999e-01 -38.2249
## hours22 -3.0684e+01 8.6388e-01 -35.5189
## hours23 -1.8318e+01 9.2811e-01 -19.7363
## emergency.vehicle.type.regroup -7.7724e+00 7.4798e-01 -10.3913
## rescue.center.regroup 9.6752e+00 3.7435e-01 25.8451
## Pr(>|t|)
## (Intercept) < 2.2e-16 ***
## alert.reason.category2 < 2.2e-16 ***
## alert.reason.category3 < 2.2e-16 ***
## alert.reason.category4 0.033025 *
## alert.reason.category5 1.958e-06 ***
## alert.reason.category6 < 2.2e-16 ***
## alert.reason.category7 8.050e-13 ***
## alert.reason.category8 0.067020 .
## alert.reason.category9 1.232e-05 ***
## longitude.intervention < 2.2e-16 ***
## latitude.intervention < 2.2e-16 ***
## delta.status.preceding.selection.selection < 2.2e-16 ***
## departed.from.its.rescue.center1 0.006462 **
## OSRM.estimated.distance 2.406e-06 ***
## OSRM.estimated.duration < 2.2e-16 ***
## OSRM.estimated.speed 0.003364 **
## hours01 < 2.2e-16 ***
## hours02 < 2.2e-16 ***
## hours03 < 2.2e-16 ***
## hours04 < 2.2e-16 ***
## hours05 < 2.2e-16 ***
## hours06 < 2.2e-16 ***
## hours07 < 2.2e-16 ***
## hours08 < 2.2e-16 ***
## hours09 < 2.2e-16 ***
## hours10 < 2.2e-16 ***
## hours11 < 2.2e-16 ***
## hours12 < 2.2e-16 ***
## hours13 < 2.2e-16 ***
## hours14 < 2.2e-16 ***
## hours15 < 2.2e-16 ***
## hours16 < 2.2e-16 ***
## hours17 < 2.2e-16 ***
## hours18 < 2.2e-16 ***
## hours19 < 2.2e-16 ***
## hours20 < 2.2e-16 ***
## hours21 < 2.2e-16 ***
## hours22 < 2.2e-16 ***
## hours23 < 2.2e-16 ***
## emergency.vehicle.type.regroup < 2.2e-16 ***
## rescue.center.regroup < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimated value of the coefficients remains, coefficients are still significant with this model
There are several tests for autocorrelation Durbin-Watson test is one of the most often used
H0 : Residuals are non autocorrelated H1 : Residuals are autocorrelated
We can’t use the Durbin-Waston test since the size of our linear regression
library(sandwich)
#Calculate the robust covariance matrix
vcov_y0_2 <- NeweyWest(lm)
coeftest(lm, vcov. = vcov_y0_2)
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.7421e+03 1.2711e+02 -13.7049
## alert.reason.category2 -1.8685e+01 9.1994e-01 -20.3107
## alert.reason.category3 -1.6990e+01 6.2878e-01 -27.0201
## alert.reason.category4 3.4795e+00 1.6303e+00 2.1342
## alert.reason.category5 1.8047e+01 3.7978e+00 4.7519
## alert.reason.category6 1.2656e+01 1.0120e+00 12.5056
## alert.reason.category7 2.6938e+01 3.7616e+00 7.1614
## alert.reason.category8 7.0385e+00 3.8426e+00 1.8317
## alert.reason.category9 3.7669e+00 8.6374e-01 4.3612
## longitude.intervention 2.4304e+01 1.4121e+00 17.2107
## latitude.intervention 3.8031e+01 2.5973e+00 14.6421
## delta.status.preceding.selection.selection 6.2062e-05 4.9610e-06 12.5100
## departed.from.its.rescue.center1 -2.5549e+00 9.3620e-01 -2.7290
## OSRM.estimated.distance -2.4948e-03 5.2772e-04 -4.7274
## OSRM.estimated.duration 4.4547e-02 5.0594e-03 8.8047
## OSRM.estimated.speed 1.1241e-01 3.8229e-02 2.9404
## hours01 1.5025e+01 1.0435e+00 14.3980
## hours02 2.7699e+01 1.1367e+00 24.3676
## hours03 3.2377e+01 1.1792e+00 27.4571
## hours04 3.5306e+01 1.2448e+00 28.3635
## hours05 3.5755e+01 1.2565e+00 28.4552
## hours06 1.8906e+01 1.2160e+00 15.5467
## hours07 -1.9443e+01 1.0524e+00 -18.4759
## hours08 -3.1067e+01 9.4355e-01 -32.9260
## hours09 -2.6510e+01 9.2218e-01 -28.7474
## hours10 -3.6849e+01 8.7357e-01 -42.1815
## hours11 -3.9176e+01 8.5723e-01 -45.7004
## hours12 -4.0035e+01 8.2020e-01 -48.8118
## hours13 -3.5994e+01 8.3903e-01 -42.8994
## hours14 -3.6671e+01 8.3854e-01 -43.7320
## hours15 -3.5741e+01 8.4643e-01 -42.2260
## hours16 -3.7403e+01 8.6307e-01 -43.3366
## hours17 -3.4015e+01 8.4964e-01 -40.0342
## hours18 -2.9320e+01 8.5407e-01 -34.3298
## hours19 -3.3562e+01 8.2919e-01 -40.4762
## hours20 -3.3535e+01 8.3080e-01 -40.3648
## hours21 -3.2491e+01 8.3965e-01 -38.6958
## hours22 -3.0684e+01 8.6078e-01 -35.6467
## hours23 -1.8318e+01 9.1927e-01 -19.9262
## emergency.vehicle.type.regroup -7.7724e+00 7.5135e-01 -10.3446
## rescue.center.regroup 9.6752e+00 3.7590e-01 25.7388
## Pr(>|t|)
## (Intercept) < 2.2e-16 ***
## alert.reason.category2 < 2.2e-16 ***
## alert.reason.category3 < 2.2e-16 ***
## alert.reason.category4 0.032827 *
## alert.reason.category5 2.017e-06 ***
## alert.reason.category6 < 2.2e-16 ***
## alert.reason.category7 8.022e-13 ***
## alert.reason.category8 0.066993 .
## alert.reason.category9 1.294e-05 ***
## longitude.intervention < 2.2e-16 ***
## latitude.intervention < 2.2e-16 ***
## delta.status.preceding.selection.selection < 2.2e-16 ***
## departed.from.its.rescue.center1 0.006354 **
## OSRM.estimated.distance 2.276e-06 ***
## OSRM.estimated.duration < 2.2e-16 ***
## OSRM.estimated.speed 0.003279 **
## hours01 < 2.2e-16 ***
## hours02 < 2.2e-16 ***
## hours03 < 2.2e-16 ***
## hours04 < 2.2e-16 ***
## hours05 < 2.2e-16 ***
## hours06 < 2.2e-16 ***
## hours07 < 2.2e-16 ***
## hours08 < 2.2e-16 ***
## hours09 < 2.2e-16 ***
## hours10 < 2.2e-16 ***
## hours11 < 2.2e-16 ***
## hours12 < 2.2e-16 ***
## hours13 < 2.2e-16 ***
## hours14 < 2.2e-16 ***
## hours15 < 2.2e-16 ***
## hours16 < 2.2e-16 ***
## hours17 < 2.2e-16 ***
## hours18 < 2.2e-16 ***
## hours19 < 2.2e-16 ***
## hours20 < 2.2e-16 ***
## hours21 < 2.2e-16 ***
## hours22 < 2.2e-16 ***
## hours23 < 2.2e-16 ***
## emergency.vehicle.type.regroup < 2.2e-16 ***
## rescue.center.regroup < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimated value of the coefficients remains, coefficients are still significant with this model.
pred_reg_y0 <- predict(lm,testset0.regroup)
pred_reg_y0 %>% head
## 1 2 3 4 5 6
## 155.5281 153.3825 167.5792 186.7568 169.1919 165.1531
testset$delta.selection.departure %>% head
## [1] 47 149 97 113 124 145
Compute R2
postResample(pred = pred_reg_y0, obs = testset0.regroup$delta.selection.departure)
## RMSE Rsquared MAE
## 49.0785462 0.1924098 36.3520945
Pred for x_0
pred_reg_final_y0 <- predict(lm,x_test0.regroup)
Remove id variable
trainset1.regroup<-trainset1.regroup[,-c("emergency.vehicle.selection")]
trainset1.regroup$delta.departure.presentation<-trainset$delta.departure.presentation
testset1.regroup<-testset1.regroup[,-c("emergency.vehicle.selection")]
testset1.regroup$delta.departure.presentation<-testset$delta.departure.presentation
Same x_test
x_test1.regroup<-x_test1.regroup[,-c("emergency.vehicle.selection")]
#Linear Regression,
options(max.print = 10000)
lm1 <- lm(delta.departure.presentation ~.,data = trainset1.regroup)
summary(lm1)
##
## Call:
## lm(formula = delta.departure.presentation ~ ., data = trainset1.regroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -672.05 -80.89 -23.92 52.61 876.20
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.394e+03 3.271e+02 -25.660
## n2 -2.398e+01 1.626e+00 -14.753
## n3 -3.451e+01 2.954e+00 -11.685
## n4 -4.055e+01 5.405e+00 -7.502
## n5 -2.865e+01 7.910e+00 -3.622
## n6 -1.674e+01 9.024e+00 -1.855
## n7 -9.879e-02 1.125e+01 -0.009
## n8 2.928e+01 3.866e+01 0.757
## n10 9.906e+01 4.842e+01 2.046
## n11 1.414e+01 2.866e+01 0.493
## n12 4.716e+01 2.867e+01 1.645
## n14 -4.331e+01 3.555e+01 -1.218
## alert.reason.category2 -3.444e+01 2.513e+00 -13.707
## alert.reason.category3 -2.722e+01 1.580e+00 -17.232
## alert.reason.category4 5.772e+00 4.437e+00 1.301
## alert.reason.category5 7.400e+01 6.180e+00 11.974
## alert.reason.category6 3.670e+01 2.629e+00 13.957
## alert.reason.category7 7.934e+01 6.020e+00 13.179
## alert.reason.category8 6.163e+01 1.046e+01 5.894
## alert.reason.category9 8.487e-01 2.253e+00 0.377
## intervention.on.public.roads1 -7.392e+00 1.009e+00 -7.328
## floor-1 6.431e+00 5.373e+00 1.197
## floor0 1.644e+01 4.877e+00 3.372
## floor1 1.745e+01 4.987e+00 3.498
## floor2 2.206e+01 5.014e+00 4.401
## floor3 2.127e+01 5.041e+00 4.220
## floor4 2.268e+01 5.083e+00 4.462
## floor5 2.287e+01 5.171e+00 4.423
## floor6 2.505e+01 5.301e+00 4.725
## floor7 1.970e+01 5.586e+00 3.527
## floor8 3.346e+01 6.014e+00 5.563
## floor9 4.080e+01 6.537e+00 6.241
## floor10 2.471e+01 7.079e+00 3.491
## floor11 2.587e+01 7.723e+00 3.349
## floor12 2.201e+01 8.448e+00 2.606
## floor13 2.530e+01 9.374e+00 2.698
## floor14 4.352e+01 1.042e+01 4.177
## floor15 2.023e+01 1.156e+01 1.750
## floor16 7.719e+00 1.335e+01 0.578
## floor17 3.578e+01 8.688e+00 4.118
## longitude.intervention -5.784e+01 3.823e+00 -15.130
## latitude.intervention 1.759e+02 6.690e+00 26.298
## delta.status.preceding.selection.selection 4.355e-05 8.013e-06 5.436
## OSRM.estimated.distance 1.608e-02 1.366e-03 11.773
## OSRM.estimated.duration 5.828e-01 1.308e-02 44.565
## OSRM.estimated.speed 2.515e+00 9.941e-02 25.295
## month2 1.091e+01 1.497e+00 7.283
## month3 3.130e+00 1.461e+00 2.142
## month4 -3.234e+00 1.487e+00 -2.175
## month5 -2.860e+00 1.463e+00 -1.955
## month6 4.627e+00 1.458e+00 3.174
## month7 3.177e+00 1.445e+00 2.198
## month8 -1.017e+01 1.519e+00 -6.695
## month10 9.886e+00 1.457e+00 6.787
## month11 1.607e+01 1.460e+00 11.007
## month12 1.195e+01 1.448e+00 8.251
## weekdaysjeudi 1.631e+01 1.197e+00 13.623
## weekdayslundi 1.101e+01 1.182e+00 9.314
## weekdaysmardi 1.706e+01 1.199e+00 14.238
## weekdaysmercredi 1.737e+01 1.198e+00 14.503
## weekdayssamedi 5.466e+00 1.208e+00 4.525
## weekdaysvendredi 1.691e+01 1.188e+00 14.234
## hours01 1.080e+01 2.435e+00 4.434
## hours02 1.605e+01 2.589e+00 6.199
## hours03 1.976e+01 2.658e+00 7.434
## hours04 2.467e+01 2.773e+00 8.896
## hours05 2.657e+01 2.813e+00 9.446
## hours06 2.054e+01 2.726e+00 7.533
## hours07 1.539e+01 2.540e+00 6.060
## hours08 2.229e+01 2.336e+00 9.545
## hours09 2.714e+01 2.234e+00 12.147
## hours10 1.002e+01 2.198e+00 4.557
## hours11 8.148e+00 2.173e+00 3.750
## hours12 -4.206e+00 2.138e+00 -1.967
## hours13 -4.523e+00 2.143e+00 -2.110
## hours14 1.598e+00 2.161e+00 0.739
## hours15 9.796e+00 2.179e+00 4.496
## hours16 8.533e+00 2.191e+00 3.894
## hours17 2.195e+01 2.180e+00 10.069
## hours18 2.327e+01 2.164e+00 10.752
## hours19 9.492e+00 2.134e+00 4.449
## hours20 -5.096e+00 2.142e+00 -2.378
## hours21 -1.355e+01 2.170e+00 -6.244
## hours22 -1.204e+01 2.198e+00 -5.478
## hours23 -8.427e+00 2.256e+00 -3.736
## emergency.vehicle.type.regroup 6.517e+00 1.550e+00 4.205
## rescue.center.regroup -5.469e+00 1.280e+00 -4.274
## location.of.the.event.regroup 1.549e+01 7.811e-01 19.831
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## n2 < 2e-16 ***
## n3 < 2e-16 ***
## n4 6.32e-14 ***
## n5 0.000293 ***
## n6 0.063603 .
## n7 0.992993
## n8 0.448771
## n10 0.040778 *
## n11 0.621810
## n12 0.099993 .
## n14 0.223106
## alert.reason.category2 < 2e-16 ***
## alert.reason.category3 < 2e-16 ***
## alert.reason.category4 0.193300
## alert.reason.category5 < 2e-16 ***
## alert.reason.category6 < 2e-16 ***
## alert.reason.category7 < 2e-16 ***
## alert.reason.category8 3.78e-09 ***
## alert.reason.category9 0.706379
## intervention.on.public.roads1 2.34e-13 ***
## floor-1 0.231304
## floor0 0.000747 ***
## floor1 0.000468 ***
## floor2 1.08e-05 ***
## floor3 2.45e-05 ***
## floor4 8.14e-06 ***
## floor5 9.74e-06 ***
## floor6 2.31e-06 ***
## floor7 0.000420 ***
## floor8 2.66e-08 ***
## floor9 4.35e-10 ***
## floor10 0.000482 ***
## floor11 0.000810 ***
## floor12 0.009161 **
## floor13 0.006966 **
## floor14 2.96e-05 ***
## floor15 0.080091 .
## floor16 0.563094
## floor17 3.83e-05 ***
## longitude.intervention < 2e-16 ***
## latitude.intervention < 2e-16 ***
## delta.status.preceding.selection.selection 5.47e-08 ***
## OSRM.estimated.distance < 2e-16 ***
## OSRM.estimated.duration < 2e-16 ***
## OSRM.estimated.speed < 2e-16 ***
## month2 3.27e-13 ***
## month3 0.032179 *
## month4 0.029646 *
## month5 0.050628 .
## month6 0.001504 **
## month7 0.027930 *
## month8 2.16e-11 ***
## month10 1.15e-11 ***
## month11 < 2e-16 ***
## month12 < 2e-16 ***
## weekdaysjeudi < 2e-16 ***
## weekdayslundi < 2e-16 ***
## weekdaysmardi < 2e-16 ***
## weekdaysmercredi < 2e-16 ***
## weekdayssamedi 6.05e-06 ***
## weekdaysvendredi < 2e-16 ***
## hours01 9.25e-06 ***
## hours02 5.69e-10 ***
## hours03 1.06e-13 ***
## hours04 < 2e-16 ***
## hours05 < 2e-16 ***
## hours06 4.99e-14 ***
## hours07 1.36e-09 ***
## hours08 < 2e-16 ***
## hours09 < 2e-16 ***
## hours10 5.19e-06 ***
## hours11 0.000177 ***
## hours12 0.049232 *
## hours13 0.034819 *
## hours14 0.459723
## hours15 6.93e-06 ***
## hours16 9.85e-05 ***
## hours17 < 2e-16 ***
## hours18 < 2e-16 ***
## hours19 8.64e-06 ***
## hours20 0.017385 *
## hours21 4.26e-10 ***
## hours22 4.30e-08 ***
## hours23 0.000187 ***
## emergency.vehicle.type.regroup 2.61e-05 ***
## rescue.center.regroup 1.92e-05 ***
## location.of.the.event.regroup < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127.9 on 163901 degrees of freedom
## Multiple R-squared: 0.3732, Adjusted R-squared: 0.3728
## F-statistic: 1122 on 87 and 163901 DF, p-value: < 2.2e-16
All our coefficients are significants #### 7.2.2 Y1 : Multicolinéarity
library(sandwich)
library(car)
vif(lm1)
## GVIF Df GVIF^(1/(2*Df))
## n 1.714904 11 1.024819
## alert.reason.category 2.084703 8 1.046985
## intervention.on.public.roads 1.342675 1 1.158739
## floor 1.418573 19 1.009244
## longitude.intervention 1.086565 1 1.042384
## latitude.intervention 1.073770 1 1.036229
## delta.status.preceding.selection.selection 1.198188 1 1.094618
## OSRM.estimated.distance 32.351598 1 5.687847
## OSRM.estimated.duration 23.116192 1 4.807930
## OSRM.estimated.speed 4.634243 1 2.152729
## month 1.030278 10 1.001493
## weekdays 1.023036 6 1.001900
## hours 1.079047 23 1.001655
## emergency.vehicle.type.regroup 1.452641 1 1.205255
## rescue.center.regroup 1.055779 1 1.027511
## location.of.the.event.regroup 1.519869 1 1.232830
#variables qualitatives
Except for OSRM data, no high correlated coefficient. We decide to keep them as they are important for our model.
Study the residuals of the selected model What we’ve done is not enough to validate the model. We need to study the residuals if hypothesis are not validated (see course.), the test of the coefficient are false Study the residuals of the selected model
#mean of residuals
summary(lm1$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -672.05 -80.89 -23.92 0.00 52.61 876.20
Yes it is !
Normality test : shapiro test H0 : Normality and H1 : no normality
library(tseries)
jarque.bera.test(lm1$residuals)
##
## Jarque Bera Test
##
## data: lm1$residuals
## X-squared = 165384, df = 2, p-value < 2.2e-16
#install.packages("nortest")
library(nortest)
ad.test(lm1$residuals)
##
## Anderson-Darling normality test
##
## data: lm1$residuals
## A = 3622.5, p-value < 2.2e-16
No normality because p-value is << 5%. Here, residuals are not normally distributed. NB : non normality could appear because of outliers.
qqnorm(lm1$residuals)
qqline(lm1$residuals)
plot(lm1$residuals~lm$fitted)
library(lmtest)
# Breush Pagan test H0 : homoskedasticity against H1 : heteroskedasticity
bptest(lm1)
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 1115.7, df = 87, p-value < 2.2e-16
pvalue << 5%, we reject H0 residuals are heteroskedastic
In case of heteroskedasticity, we have to use a robust standard error estimator. Otherwise, all our t-tests will be wrong.
library(sandwich)
#Calculate the robust covariance matrix
vcov_y1 <- vcovHC(lm1, type = "HC1")
coeftest(lm, vcov. = vcov_y1)
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.7421e+03 3.4100e+02 -5.1086
## alert.reason.category2 -1.8685e+01 2.4176e+00 -7.7286
## alert.reason.category3 -1.6990e+01 1.4315e+00 -11.8681
## alert.reason.category4 3.4795e+00 4.3607e+00 0.7979
## alert.reason.category5 1.8047e+01 8.2063e+00 2.1991
## alert.reason.category6 1.2656e+01 2.3447e+00 5.3978
## alert.reason.category7 2.6938e+01 7.9098e+00 3.4056
## alert.reason.category8 7.0385e+00 1.0666e+01 0.6599
## alert.reason.category9 3.7669e+00 2.1490e+00 1.7529
## longitude.intervention 2.4304e+01 3.8166e+00 6.3680
## latitude.intervention 3.8031e+01 6.9665e+00 5.4591
## delta.status.preceding.selection.selection 6.2062e-05 8.5934e-06 7.2220
## OSRM.estimated.distance -2.4948e-03 1.6595e-03 -1.5033
## OSRM.estimated.duration 4.4547e-02 1.5425e-02 2.8879
## OSRM.estimated.speed 1.1241e-01 1.0981e-01 1.0237
## hours01 1.5025e+01 2.4279e+00 6.1884
## hours02 2.7699e+01 2.5871e+00 10.7067
## hours03 3.2377e+01 2.6714e+00 12.1197
## hours04 3.5306e+01 2.8289e+00 12.4805
## hours05 3.5755e+01 2.8958e+00 12.3472
## hours06 1.8906e+01 2.7097e+00 6.9770
## hours07 -1.9443e+01 2.4860e+00 -7.8212
## hours08 -3.1067e+01 2.3630e+00 -13.1474
## hours09 -2.6510e+01 2.2702e+00 -11.6776
## hours10 -3.6849e+01 2.1661e+00 -17.0111
## hours11 -3.9176e+01 2.1373e+00 -18.3292
## hours12 -4.0035e+01 2.0786e+00 -19.2611
## hours13 -3.5994e+01 2.0945e+00 -17.1850
## hours14 -3.6671e+01 2.1299e+00 -17.2172
## hours15 -3.5741e+01 2.1729e+00 -16.4491
## hours16 -3.7403e+01 2.1440e+00 -17.4454
## hours17 -3.4015e+01 2.1808e+00 -15.5970
## hours18 -2.9320e+01 2.1560e+00 -13.5990
## hours19 -3.3562e+01 2.0890e+00 -16.0666
## hours20 -3.3535e+01 2.0969e+00 -15.9927
## hours21 -3.2491e+01 2.0905e+00 -15.5424
## hours22 -3.0684e+01 2.1357e+00 -14.3671
## hours23 -1.8318e+01 2.1904e+00 -8.3627
## emergency.vehicle.type.regroup -7.7724e+00 1.5459e+00 -5.0279
## rescue.center.regroup 9.6752e+00 1.2649e+00 7.6490
## Pr(>|t|)
## (Intercept) 3.248e-07 ***
## alert.reason.category2 1.094e-14 ***
## alert.reason.category3 < 2.2e-16 ***
## alert.reason.category4 0.4249280
## alert.reason.category5 0.0278690 *
## alert.reason.category6 6.757e-08 ***
## alert.reason.category7 0.0006603 ***
## alert.reason.category8 0.5093228
## alert.reason.category9 0.0796239 .
## longitude.intervention 1.920e-10 ***
## latitude.intervention 4.794e-08 ***
## delta.status.preceding.selection.selection 5.144e-13 ***
## OSRM.estimated.distance 0.1327596
## OSRM.estimated.duration 0.0038791 **
## OSRM.estimated.speed 0.3059961
## hours01 6.091e-10 ***
## hours02 < 2.2e-16 ***
## hours03 < 2.2e-16 ***
## hours04 < 2.2e-16 ***
## hours05 < 2.2e-16 ***
## hours06 3.027e-12 ***
## hours07 5.262e-15 ***
## hours08 < 2.2e-16 ***
## hours09 < 2.2e-16 ***
## hours10 < 2.2e-16 ***
## hours11 < 2.2e-16 ***
## hours12 < 2.2e-16 ***
## hours13 < 2.2e-16 ***
## hours14 < 2.2e-16 ***
## hours15 < 2.2e-16 ***
## hours16 < 2.2e-16 ***
## hours17 < 2.2e-16 ***
## hours18 < 2.2e-16 ***
## hours19 < 2.2e-16 ***
## hours20 < 2.2e-16 ***
## hours21 < 2.2e-16 ***
## hours22 < 2.2e-16 ***
## hours23 < 2.2e-16 ***
## emergency.vehicle.type.regroup 4.964e-07 ***
## rescue.center.regroup 2.036e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimated value of the coefficients remains, coefficients are still significant with this model
library(sandwich)
#Calculate the robust covariance matrix
vcov_y1_2 <- NeweyWest(lm1)
coeftest(lm1, vcov. = vcov_y1_2)
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -8.3937e+03 3.4466e+02 -24.3538
## n2 -2.3983e+01 1.6322e+00 -14.6940
## n3 -3.4513e+01 2.5354e+00 -13.6122
## n4 -4.0549e+01 4.8942e+00 -8.2852
## n5 -2.8647e+01 7.2386e+00 -3.9576
## n6 -1.6739e+01 8.3987e+00 -1.9931
## n7 -9.8787e-02 8.3199e+00 -0.0119
## n8 2.9284e+01 2.4819e+01 1.1799
## n10 9.9059e+01 3.4526e+01 2.8691
## n11 1.4139e+01 2.5688e+01 0.5504
## n12 4.7163e+01 4.9617e+01 0.9505
## n14 -4.3307e+01 3.8384e+01 -1.1283
## alert.reason.category2 -3.4441e+01 2.4043e+00 -14.3250
## alert.reason.category3 -2.7220e+01 1.4337e+00 -18.9863
## alert.reason.category4 5.7721e+00 4.3378e+00 1.3306
## alert.reason.category5 7.4002e+01 8.2640e+00 8.9548
## alert.reason.category6 3.6699e+01 2.3381e+00 15.6959
## alert.reason.category7 7.9337e+01 7.9784e+00 9.9440
## alert.reason.category8 6.1634e+01 1.0676e+01 5.7733
## alert.reason.category9 8.4869e-01 2.1946e+00 0.3867
## intervention.on.public.roads1 -7.3918e+00 1.0015e+00 -7.3806
## floor-1 6.4313e+00 5.0683e+00 1.2689
## floor0 1.6445e+01 4.6849e+00 3.5102
## floor1 1.7446e+01 4.8070e+00 3.6293
## floor2 2.2062e+01 4.8516e+00 4.5474
## floor3 2.1271e+01 4.8851e+00 4.3543
## floor4 2.2677e+01 4.9151e+00 4.6139
## floor5 2.2869e+01 4.9865e+00 4.5862
## floor6 2.5045e+01 5.1395e+00 4.8731
## floor7 1.9702e+01 5.4958e+00 3.5850
## floor8 3.3456e+01 5.9426e+00 5.6298
## floor9 4.0799e+01 6.7665e+00 6.0296
## floor10 2.4712e+01 6.8837e+00 3.5899
## floor11 2.5868e+01 7.5053e+00 3.4466
## floor12 2.2015e+01 8.3297e+00 2.6429
## floor13 2.5296e+01 8.7284e+00 2.8982
## floor14 4.3515e+01 1.1507e+01 3.7817
## floor15 2.0228e+01 1.2042e+01 1.6798
## floor16 7.7191e+00 1.2224e+01 0.6315
## floor17 3.5776e+01 9.3643e+00 3.8205
## longitude.intervention -5.7841e+01 3.7733e+00 -15.3293
## latitude.intervention 1.7592e+02 7.0420e+00 24.9822
## delta.status.preceding.selection.selection 4.3553e-05 8.7107e-06 4.9999
## OSRM.estimated.distance 1.6084e-02 1.6395e-03 9.8102
## OSRM.estimated.duration 5.8277e-01 1.5192e-02 38.3612
## OSRM.estimated.speed 2.5145e+00 1.0942e-01 22.9806
## month2 1.0905e+01 1.4728e+00 7.4046
## month3 3.1297e+00 1.4355e+00 2.1802
## month4 -3.2340e+00 1.4559e+00 -2.2212
## month5 -2.8598e+00 1.4199e+00 -2.0141
## month6 4.6272e+00 1.4426e+00 3.2076
## month7 3.1767e+00 1.4115e+00 2.2505
## month8 -1.0168e+01 1.4589e+00 -6.9697
## month10 9.8865e+00 1.4550e+00 6.7950
## month11 1.6068e+01 1.4423e+00 11.1403
## month12 1.1947e+01 1.4177e+00 8.4266
## weekdaysjeudi 1.6307e+01 1.1947e+00 13.6493
## weekdayslundi 1.1009e+01 1.1815e+00 9.3172
## weekdaysmardi 1.7064e+01 1.1997e+00 14.2237
## weekdaysmercredi 1.7372e+01 1.1760e+00 14.7712
## weekdayssamedi 5.4657e+00 1.1899e+00 4.5933
## weekdaysvendredi 1.6909e+01 1.1882e+00 14.2314
## hours01 1.0796e+01 2.4133e+00 4.4736
## hours02 1.6051e+01 2.5992e+00 6.1751
## hours03 1.9761e+01 2.6616e+00 7.4247
## hours04 2.4673e+01 2.8772e+00 8.5753
## hours05 2.6569e+01 2.8667e+00 9.2682
## hours06 2.0536e+01 2.7127e+00 7.5704
## hours07 1.5391e+01 2.5025e+00 6.1502
## hours08 2.2293e+01 2.3568e+00 9.4587
## hours09 2.7140e+01 2.2702e+00 11.9548
## hours10 1.0016e+01 2.1807e+00 4.5930
## hours11 8.1477e+00 2.1549e+00 3.7810
## hours12 -4.2055e+00 2.0920e+00 -2.0103
## hours13 -4.5227e+00 2.0872e+00 -2.1669
## hours14 1.5978e+00 2.1512e+00 0.7428
## hours15 9.7961e+00 2.2087e+00 4.4352
## hours16 8.5334e+00 2.1426e+00 3.9828
## hours17 2.1954e+01 2.2034e+00 9.9634
## hours18 2.3269e+01 2.1767e+00 10.6900
## hours19 9.4924e+00 2.0881e+00 4.5460
## hours20 -5.0956e+00 2.0868e+00 -2.4419
## hours21 -1.3548e+01 2.0773e+00 -6.5222
## hours22 -1.2043e+01 2.1409e+00 -5.6250
## hours23 -8.4270e+00 2.1788e+00 -3.8678
## emergency.vehicle.type.regroup 6.5173e+00 1.5492e+00 4.2069
## rescue.center.regroup -5.4685e+00 1.2975e+00 -4.2147
## location.of.the.event.regroup 1.5491e+01 7.7500e-01 19.9884
## Pr(>|t|)
## (Intercept) < 2.2e-16 ***
## n2 < 2.2e-16 ***
## n3 < 2.2e-16 ***
## n4 < 2.2e-16 ***
## n5 7.574e-05 ***
## n6 0.0462563 *
## n7 0.9905264
## n8 0.2380499
## n10 0.0041171 **
## n11 0.5820385
## n12 0.3418390
## n14 0.2592041
## alert.reason.category2 < 2.2e-16 ***
## alert.reason.category3 < 2.2e-16 ***
## alert.reason.category4 0.1833083
## alert.reason.category5 < 2.2e-16 ***
## alert.reason.category6 < 2.2e-16 ***
## alert.reason.category7 < 2.2e-16 ***
## alert.reason.category8 7.788e-09 ***
## alert.reason.category9 0.6989664
## intervention.on.public.roads1 1.584e-13 ***
## floor-1 0.2044681
## floor0 0.0004479 ***
## floor1 0.0002843 ***
## floor2 5.435e-06 ***
## floor3 1.336e-05 ***
## floor4 3.956e-06 ***
## floor5 4.517e-06 ***
## floor6 1.100e-06 ***
## floor7 0.0003372 ***
## floor8 1.807e-08 ***
## floor9 1.647e-09 ***
## floor10 0.0003309 ***
## floor11 0.0005678 ***
## floor12 0.0082201 **
## floor13 0.0037541 **
## floor14 0.0001558 ***
## floor15 0.0930038 .
## floor16 0.5277364
## floor17 0.0001332 ***
## longitude.intervention < 2.2e-16 ***
## latitude.intervention < 2.2e-16 ***
## delta.status.preceding.selection.selection 5.741e-07 ***
## OSRM.estimated.distance < 2.2e-16 ***
## OSRM.estimated.duration < 2.2e-16 ***
## OSRM.estimated.speed < 2.2e-16 ***
## month2 1.322e-13 ***
## month3 0.0292475 *
## month4 0.0263359 *
## month5 0.0439966 *
## month6 0.0013386 **
## month7 0.0244171 *
## month8 3.189e-12 ***
## month10 1.087e-11 ***
## month11 < 2.2e-16 ***
## month12 < 2.2e-16 ***
## weekdaysjeudi < 2.2e-16 ***
## weekdayslundi < 2.2e-16 ***
## weekdaysmardi < 2.2e-16 ***
## weekdaysmercredi < 2.2e-16 ***
## weekdayssamedi 4.366e-06 ***
## weekdaysvendredi < 2.2e-16 ***
## hours01 7.696e-06 ***
## hours02 6.628e-10 ***
## hours03 1.136e-13 ***
## hours04 < 2.2e-16 ***
## hours05 < 2.2e-16 ***
## hours06 3.741e-14 ***
## hours07 7.758e-10 ***
## hours08 < 2.2e-16 ***
## hours09 < 2.2e-16 ***
## hours10 4.373e-06 ***
## hours11 0.0001563 ***
## hours12 0.0444024 *
## hours13 0.0302411 *
## hours14 0.4576334
## hours15 9.203e-06 ***
## hours16 6.813e-05 ***
## hours17 < 2.2e-16 ***
## hours18 < 2.2e-16 ***
## hours19 5.473e-06 ***
## hours20 0.0146132 *
## hours21 6.950e-11 ***
## hours22 1.858e-08 ***
## hours23 0.0001099 ***
## emergency.vehicle.type.regroup 2.591e-05 ***
## rescue.center.regroup 2.502e-05 ***
## location.of.the.event.regroup < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimated value of the coefficients remains, coefficients are still significant with this model
pred_reg_y1 <- predict(lm1,testset1.regroup)
pred_reg_y1 %>% head
## 1 2 3 4 5 6
## 407.3334 298.8634 397.5383 374.2344 381.2628 386.3228
testset$delta.departure.presentation %>% head
## [1] 376 268 409 678 432 236
Compute MSE
postResample(pred = pred_reg_y1, obs = testset1.regroup$delta.departure.presentation)
## RMSE Rsquared MAE
## 128.1085800 0.3756295 92.7618495
testset1.regroup %>% str
## Classes 'data.table' and 'data.frame': 40998 obs. of 17 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 3 3 1 3 3 3 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 6 3 4 3 10 6 3 3 3 3 ...
## $ longitude.intervention : num 2.28 2.3 2.2 2.5 2.4 ...
## $ latitude.intervention : num 48.9 48.9 48.9 49 48.9 ...
## $ delta.status.preceding.selection.selection: int 16251 606 4693 86 1485 917 17 161 2061 1022 ...
## $ OSRM.estimated.distance : num 2347 1812 2586 2442 2314 ...
## $ OSRM.estimated.duration : num 218 198 280 247 315 ...
## $ OSRM.estimated.speed : num 38.8 33 33.2 35.5 26.4 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 10 1 2 1 1 1 2 2 2 2 ...
## $ emergency.vehicle.type.regroup : num 1 1 1 0 1 1 1 1 1 1 ...
## $ rescue.center.regroup : num 0 0 0 0 0 0 0 0 0 0 ...
## $ location.of.the.event.regroup : num 0 0 1 0 1 0 0 0 1 1 ...
## $ delta.departure.presentation : int 376 268 409 678 432 236 302 225 404 294 ...
## - attr(*, ".internal.selfref")=<externalptr>
x_test1.regroup %>% str
## Classes 'data.table' and 'data.frame': 108033 obs. of 16 variables:
## $ n : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
## $ longitude.intervention : num 2.34 2.28 2.28 2.34 2.41 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ delta.status.preceding.selection.selection: int 2636 16243 597 1834 1341 2197 16 1312 263 437 ...
## $ OSRM.estimated.distance : num 1283 2347 1078 1791 1451 ...
## $ OSRM.estimated.duration : num 214 218 120 250 199 ...
## $ OSRM.estimated.speed : num 21.6 38.8 32.4 25.7 26.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
## $ emergency.vehicle.type.regroup : num 1 0 1 1 1 1 1 1 1 1 ...
## $ rescue.center.regroup : num 0 0 0 0 0 1 1 0 0 0 ...
## $ location.of.the.event.regroup : num 1 0 1 0 0 0 0 0 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
#pred_reg_final_y1 <- predict(lm1,x_test1.regroup)
New levels in x_test
testset1.regroup$n %>% levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "10" "11" "12" "14"
x_test1.regroup$n %>% levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
Re_train model without n-levels
#Linear Regression,
options(max.print = 10000)
lm1 <- lm(delta.departure.presentation ~.,data = trainset1.regroup[,-c("n")])
summary(lm1)
##
## Call:
## lm(formula = delta.departure.presentation ~ ., data = trainset1.regroup[,
## -c("n")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -667.10 -81.01 -23.86 52.64 876.52
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.438e+03 3.273e+02 -25.782
## alert.reason.category2 -2.348e+01 2.420e+00 -9.701
## alert.reason.category3 -1.525e+01 1.397e+00 -10.917
## alert.reason.category4 1.739e+01 4.381e+00 3.968
## alert.reason.category5 8.633e+01 6.140e+00 14.059
## alert.reason.category6 4.437e+01 2.562e+00 17.320
## alert.reason.category7 8.787e+01 5.999e+00 14.648
## alert.reason.category8 6.160e+01 1.046e+01 5.890
## alert.reason.category9 1.235e+01 2.137e+00 5.780
## intervention.on.public.roads1 -6.499e+00 1.008e+00 -6.447
## floor-1 7.882e+00 5.373e+00 1.467
## floor0 1.967e+01 4.873e+00 4.037
## floor1 2.022e+01 4.984e+00 4.056
## floor2 2.480e+01 5.010e+00 4.950
## floor3 2.392e+01 5.037e+00 4.748
## floor4 2.517e+01 5.080e+00 4.955
## floor5 2.528e+01 5.168e+00 4.891
## floor6 2.749e+01 5.299e+00 5.188
## floor7 2.217e+01 5.583e+00 3.970
## floor8 3.566e+01 6.013e+00 5.931
## floor9 4.360e+01 6.537e+00 6.669
## floor10 2.747e+01 7.080e+00 3.880
## floor11 2.870e+01 7.725e+00 3.715
## floor12 2.525e+01 8.451e+00 2.988
## floor13 2.732e+01 9.382e+00 2.912
## floor14 4.616e+01 1.042e+01 4.428
## floor15 2.384e+01 1.157e+01 2.062
## floor16 1.061e+01 1.336e+01 0.794
## floor17 3.850e+01 8.693e+00 4.429
## longitude.intervention -5.811e+01 3.826e+00 -15.189
## latitude.intervention 1.764e+02 6.693e+00 26.354
## delta.status.preceding.selection.selection 3.819e-05 8.007e-06 4.769
## OSRM.estimated.distance 1.618e-02 1.367e-03 11.832
## OSRM.estimated.duration 5.829e-01 1.309e-02 44.537
## OSRM.estimated.speed 2.514e+00 9.949e-02 25.273
## month2 1.074e+01 1.498e+00 7.166
## month3 3.002e+00 1.462e+00 2.053
## month4 -3.308e+00 1.488e+00 -2.223
## month5 -2.923e+00 1.464e+00 -1.996
## month6 4.654e+00 1.459e+00 3.190
## month7 3.321e+00 1.446e+00 2.297
## month8 -1.029e+01 1.520e+00 -6.770
## month10 9.880e+00 1.458e+00 6.776
## month11 1.602e+01 1.461e+00 10.966
## month12 1.217e+01 1.449e+00 8.398
## weekdaysjeudi 1.642e+01 1.198e+00 13.706
## weekdayslundi 1.111e+01 1.183e+00 9.388
## weekdaysmardi 1.713e+01 1.200e+00 14.277
## weekdaysmercredi 1.752e+01 1.199e+00 14.616
## weekdayssamedi 5.480e+00 1.209e+00 4.533
## weekdaysvendredi 1.695e+01 1.189e+00 14.255
## hours01 1.089e+01 2.437e+00 4.466
## hours02 1.621e+01 2.592e+00 6.256
## hours03 1.940e+01 2.660e+00 7.293
## hours04 2.445e+01 2.776e+00 8.810
## hours05 2.640e+01 2.815e+00 9.376
## hours06 2.025e+01 2.729e+00 7.423
## hours07 1.508e+01 2.542e+00 5.935
## hours08 2.235e+01 2.338e+00 9.560
## hours09 2.706e+01 2.237e+00 12.101
## hours10 1.002e+01 2.200e+00 4.557
## hours11 8.252e+00 2.175e+00 3.795
## hours12 -4.245e+00 2.141e+00 -1.983
## hours13 -4.487e+00 2.145e+00 -2.092
## hours14 1.603e+00 2.163e+00 0.741
## hours15 9.746e+00 2.181e+00 4.469
## hours16 8.646e+00 2.193e+00 3.942
## hours17 2.194e+01 2.182e+00 10.054
## hours18 2.330e+01 2.166e+00 10.756
## hours19 9.505e+00 2.136e+00 4.450
## hours20 -5.059e+00 2.144e+00 -2.359
## hours21 -1.364e+01 2.172e+00 -6.280
## hours22 -1.189e+01 2.200e+00 -5.404
## hours23 -8.281e+00 2.258e+00 -3.668
## emergency.vehicle.type.regroup 1.252e+01 1.507e+00 8.308
## rescue.center.regroup -5.379e+00 1.280e+00 -4.201
## location.of.the.event.regroup 1.563e+01 7.817e-01 20.001
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## alert.reason.category2 < 2e-16 ***
## alert.reason.category3 < 2e-16 ***
## alert.reason.category4 7.24e-05 ***
## alert.reason.category5 < 2e-16 ***
## alert.reason.category6 < 2e-16 ***
## alert.reason.category7 < 2e-16 ***
## alert.reason.category8 3.86e-09 ***
## alert.reason.category9 7.48e-09 ***
## intervention.on.public.roads1 1.14e-10 ***
## floor-1 0.142399
## floor0 5.41e-05 ***
## floor1 4.99e-05 ***
## floor2 7.42e-07 ***
## floor3 2.05e-06 ***
## floor4 7.22e-07 ***
## floor5 1.00e-06 ***
## floor6 2.12e-07 ***
## floor7 7.19e-05 ***
## floor8 3.01e-09 ***
## floor9 2.58e-11 ***
## floor10 0.000105 ***
## floor11 0.000203 ***
## floor12 0.002812 **
## floor13 0.003587 **
## floor14 9.52e-06 ***
## floor15 0.039231 *
## floor16 0.426916
## floor17 9.47e-06 ***
## longitude.intervention < 2e-16 ***
## latitude.intervention < 2e-16 ***
## delta.status.preceding.selection.selection 1.85e-06 ***
## OSRM.estimated.distance < 2e-16 ***
## OSRM.estimated.duration < 2e-16 ***
## OSRM.estimated.speed < 2e-16 ***
## month2 7.73e-13 ***
## month3 0.040057 *
## month4 0.026251 *
## month5 0.045951 *
## month6 0.001425 **
## month7 0.021639 *
## month8 1.29e-11 ***
## month10 1.24e-11 ***
## month11 < 2e-16 ***
## month12 < 2e-16 ***
## weekdaysjeudi < 2e-16 ***
## weekdayslundi < 2e-16 ***
## weekdaysmardi < 2e-16 ***
## weekdaysmercredi < 2e-16 ***
## weekdayssamedi 5.83e-06 ***
## weekdaysvendredi < 2e-16 ***
## hours01 7.97e-06 ***
## hours02 3.95e-10 ***
## hours03 3.05e-13 ***
## hours04 < 2e-16 ***
## hours05 < 2e-16 ***
## hours06 1.15e-13 ***
## hours07 2.95e-09 ***
## hours08 < 2e-16 ***
## hours09 < 2e-16 ***
## hours10 5.20e-06 ***
## hours11 0.000148 ***
## hours12 0.047361 *
## hours13 0.036482 *
## hours14 0.458627
## hours15 7.88e-06 ***
## hours16 8.08e-05 ***
## hours17 < 2e-16 ***
## hours18 < 2e-16 ***
## hours19 8.59e-06 ***
## hours20 0.018311 *
## hours21 3.39e-10 ***
## hours22 6.52e-08 ***
## hours23 0.000245 ***
## emergency.vehicle.type.regroup < 2e-16 ***
## rescue.center.regroup 2.66e-05 ***
## location.of.the.event.regroup < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128 on 163912 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1277 on 76 and 163912 DF, p-value: < 2.2e-16
pred_reg_final_y1 <- predict(lm1,x_test1.regroup[,-c("n")])
pred_reg_ys=pred_reg_y0+pred_reg_y1
pred_reg=data.frame(pred_reg_y0,pred_reg_y1,pred_reg_ys)
pred_reg %>% head
## pred_reg_y0 pred_reg_y1 pred_reg_ys
## 1 155.5281 407.3334 562.8615
## 2 153.3825 298.8634 452.2458
## 3 167.5792 397.5383 565.1175
## 4 186.7568 374.2344 560.9912
## 5 169.1919 381.2628 550.4547
## 6 165.1531 386.3228 551.4758
testset$delta.selection.presentation %>% head
## [1] 423 417 506 791 556 381
postResample(pred = pred_reg_ys, obs = testset$delta.departure.presentation)
## RMSE Rsquared MAE
## 188.733277 0.357252 166.213062
No too bad accuracy !
pred_reg_final_ys=pred_reg_final_y0+pred_reg_final_y1
pred_final=data.frame(pred_reg_final_y0,pred_reg_final_y1,pred_reg_final_ys)
pred_final %>% head
## pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys
## 1 150.7865 322.0503 472.8368
## 2 163.3000 381.8173 545.1174
## 3 151.4585 270.3706 421.8291
## 4 166.5357 319.1771 485.7128
## 5 172.7904 283.1569 455.9473
## 6 172.5030 282.4379 454.9409
pred_final$id<-x_test$emergency.vehicle.selection
Change columns order
pred_final <- pred_final[, c(4, 1, 2, 3)]
Convert into integer
pred_final$pred_reg_final_y0=pred_final$pred_reg_final_y0 %>% as.integer()
pred_final$pred_reg_final_y1=pred_final$pred_reg_final_y1 %>% as.integer()
pred_final$pred_reg_final_ys=pred_final$pred_reg_final_ys %>% as.integer()
Retrieve order from original
pred_final %>% head
## id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys
## 1 4715068 150 322 472
## 2 4714816 163 381 545
## 3 4713710 151 270 421
## 4 4713748 166 319 485
## 5 4713778 172 283 455
## 6 4713812 172 282 454
id_order %>% head
## x_test.emergency.vehicle.selection
## 1 5271704
## 2 5092931
## 3 5153756
## 4 5355572
## 5 5178915
## 6 5206885
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
## id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys rank
## 1 4713710 151 270 421 14082
## 2 4713748 166 319 485 25965
## 3 4713778 172 283 455 14465
## 4 4713812 172 282 454 79530
## 5 4713821 195 583 779 47824
## 6 4713863 166 250 417 2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
## id pred_reg_final_y0 pred_reg_final_y1 pred_reg_final_ys rank
## 77897 5271704 117 427 545 1
## 59352 5092931 119 373 493 2
## 68483 5153756 119 216 336 3
## 91242 5355572 187 285 472 4
## 72358 5178915 131 278 410 5
## 77062 5206885 120 263 384 6
y_final %>% setDT
y_final=y_final[,-c("rank")]
sum(is.na(y_final))
## [1] 2
which(is.na(y_final))
## [1] 298933 406966
y_final[is.na(y_final)] <- 0
Write csv file. Go to pyhton script for generate good csv.
fwrite(y_final, "y_test.csv",sep=",")
Keeping some factors
OSRM data internvetion floor alter.reason.category long internvetion lat internvetion delta.status Weekdays Month hours departed from
data.fe %>% str
## Classes 'data.table' and 'data.frame': 204987 obs. of 22 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
x_test %>% str
## Classes 'data.table' and 'data.frame': 108033 obs. of 19 variables:
## $ n : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4715068 4714816 4713710 4713748 4713778 4713812 4713821 4713863 4713872 4713878 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
## $ location.of.the.event : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
## $ longitude.intervention : num 2.34 2.28 2.28 2.34 2.41 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
## $ emergency.vehicle.type : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
## $ rescue.center : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
## $ delta.status.preceding.selection.selection: int 2636 16243 597 1834 1341 2197 16 1312 263 437 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1078 1791 1451 ...
## $ OSRM.estimated.duration : num 214 218 120 250 199 ...
## $ OSRM.estimated.speed : num 21.6 38.8 32.4 25.7 26.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
data.fe.reduced.0<-data.fe[,-c("delta.departure.presentation","delta.selection.presentation","rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]
x_test.reduced<-x_test[,-c("rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]
Validation
set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.0$delta.selection.departure, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.0[trainIndex,]
valid=data.fe.reduced.0[-trainIndex,]
data.fe.reduced.0 %>% str
## Classes 'data.table' and 'data.frame': 204987 obs. of 14 variables:
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
x_test.reduced %>% str
## Classes 'data.table' and 'data.frame': 108033 obs. of 13 variables:
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
## $ longitude.intervention : num 2.34 2.28 2.28 2.34 2.41 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ delta.status.preceding.selection.selection: int 2636 16243 597 1834 1341 2197 16 1312 263 437 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1078 1791 1451 ...
## $ OSRM.estimated.duration : num 214 218 120 250 199 ...
## $ OSRM.estimated.speed : num 21.6 38.8 32.4 25.7 26.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
foo <- train %>% select(-delta.selection.departure)
bar <- valid %>% select(-delta.selection.departure)
dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.selection.departure)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.selection.departure)
dtest <- xgb.DMatrix(data.matrix(x_test.reduced))
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear",
seed = 4321
)
watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
data = dtrain,
print_every_n = 50,
watchlist = watchlist,
nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 50, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:36:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:110.043587 valid-rmse:110.268089
## [51] train-rmse:47.278492 valid-rmse:47.751270
## [101] train-rmse:46.977036 valid-rmse:47.623722
## [151] train-rmse:46.771313 valid-rmse:47.570621
## [201] train-rmse:46.627319 valid-rmse:47.560299
## [251] train-rmse:46.464745 valid-rmse:47.514023
## [300] train-rmse:46.327545 valid-rmse:47.511181
After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=200)
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:36:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:109.420416+0.495150 test-rmse:109.430615+0.488056
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:84.520085+0.587442 test-rmse:84.538504+0.502799
## [3] train-rmse:69.349644+0.893923 test-rmse:69.372398+0.827324
## [4] train-rmse:60.300552+1.127178 test-rmse:60.337537+0.937264
## [5] train-rmse:54.740454+0.546901 test-rmse:54.771004+0.331087
## [6] train-rmse:51.905637+0.651854 test-rmse:51.932079+0.465446
## [7] train-rmse:50.213223+0.412485 test-rmse:50.245101+0.387258
## [8] train-rmse:49.340348+0.339408 test-rmse:49.366396+0.407458
## [9] train-rmse:48.797874+0.160656 test-rmse:48.818684+0.373239
## [10] train-rmse:48.417364+0.083016 test-rmse:48.444783+0.350240
## [11] train-rmse:48.228524+0.120281 test-rmse:48.266144+0.313282
## [12] train-rmse:48.087869+0.124237 test-rmse:48.133390+0.329473
## [13] train-rmse:47.986265+0.089676 test-rmse:48.033182+0.353658
## [14] train-rmse:47.892427+0.079926 test-rmse:47.947552+0.375428
## [15] train-rmse:47.832483+0.083574 test-rmse:47.895967+0.375571
## [16] train-rmse:47.790473+0.080415 test-rmse:47.863293+0.379066
## [17] train-rmse:47.751828+0.081082 test-rmse:47.831606+0.380974
## [18] train-rmse:47.727420+0.086136 test-rmse:47.814307+0.383756
## [19] train-rmse:47.691869+0.078602 test-rmse:47.785546+0.387505
## [20] train-rmse:47.646491+0.112707 test-rmse:47.743008+0.361829
## [21] train-rmse:47.611533+0.127790 test-rmse:47.718449+0.351909
## [22] train-rmse:47.581001+0.129603 test-rmse:47.690350+0.355777
## [23] train-rmse:47.566931+0.128581 test-rmse:47.679336+0.358482
## [24] train-rmse:47.552747+0.128625 test-rmse:47.670377+0.360645
## [25] train-rmse:47.533769+0.126592 test-rmse:47.653292+0.359117
## [26] train-rmse:47.517961+0.129046 test-rmse:47.642613+0.353710
## [27] train-rmse:47.504838+0.130470 test-rmse:47.633099+0.350346
## [28] train-rmse:47.488182+0.128669 test-rmse:47.623568+0.352864
## [29] train-rmse:47.478754+0.128524 test-rmse:47.618341+0.354121
## [30] train-rmse:47.464933+0.125357 test-rmse:47.608382+0.358899
## [31] train-rmse:47.447864+0.128427 test-rmse:47.598231+0.354538
## [32] train-rmse:47.437128+0.127135 test-rmse:47.593299+0.353727
## [33] train-rmse:47.424672+0.128209 test-rmse:47.584892+0.352185
## [34] train-rmse:47.412052+0.128769 test-rmse:47.579828+0.351704
## [35] train-rmse:47.395421+0.135913 test-rmse:47.570622+0.346784
## [36] train-rmse:47.386800+0.135271 test-rmse:47.567609+0.352190
## [37] train-rmse:47.377648+0.135101 test-rmse:47.569548+0.354002
## [38] train-rmse:47.366985+0.133217 test-rmse:47.563750+0.356662
## [39] train-rmse:47.354906+0.133437 test-rmse:47.554203+0.356047
## [40] train-rmse:47.347677+0.132827 test-rmse:47.550345+0.357539
## [41] train-rmse:47.340453+0.131505 test-rmse:47.549147+0.359843
## [42] train-rmse:47.324946+0.130718 test-rmse:47.540761+0.359252
## [43] train-rmse:47.309410+0.125738 test-rmse:47.532930+0.362823
## [44] train-rmse:47.298247+0.123099 test-rmse:47.528691+0.364755
## [45] train-rmse:47.286745+0.120426 test-rmse:47.520033+0.368465
## [46] train-rmse:47.280306+0.119586 test-rmse:47.518392+0.366660
## [47] train-rmse:47.271176+0.116132 test-rmse:47.512527+0.368884
## [48] train-rmse:47.264273+0.116093 test-rmse:47.513126+0.369346
## [49] train-rmse:47.254261+0.115051 test-rmse:47.507814+0.370803
## [50] train-rmse:47.244592+0.113336 test-rmse:47.501458+0.372071
## [51] train-rmse:47.238686+0.114972 test-rmse:47.499967+0.370398
## [52] train-rmse:47.227146+0.119079 test-rmse:47.490928+0.366837
## [53] train-rmse:47.222829+0.117205 test-rmse:47.490529+0.367030
## [54] train-rmse:47.219044+0.117141 test-rmse:47.488692+0.365355
## [55] train-rmse:47.212611+0.115741 test-rmse:47.485481+0.366452
## [56] train-rmse:47.207226+0.116960 test-rmse:47.485974+0.363875
## [57] train-rmse:47.201480+0.116104 test-rmse:47.486849+0.365250
## [58] train-rmse:47.196257+0.114605 test-rmse:47.487884+0.364872
## [59] train-rmse:47.190012+0.115193 test-rmse:47.484396+0.360961
## [60] train-rmse:47.185732+0.114597 test-rmse:47.484931+0.361209
## [61] train-rmse:47.181973+0.113139 test-rmse:47.487944+0.361337
## [62] train-rmse:47.179081+0.112625 test-rmse:47.488368+0.360251
## [63] train-rmse:47.172899+0.113509 test-rmse:47.486959+0.361472
## [64] train-rmse:47.166243+0.114017 test-rmse:47.484556+0.356938
## [65] train-rmse:47.161686+0.114000 test-rmse:47.483697+0.356843
## [66] train-rmse:47.156447+0.114541 test-rmse:47.484074+0.357742
## [67] train-rmse:47.150737+0.114146 test-rmse:47.483385+0.357810
## [68] train-rmse:47.144992+0.113556 test-rmse:47.479279+0.357175
## [69] train-rmse:47.141460+0.113787 test-rmse:47.478683+0.354675
## [70] train-rmse:47.132628+0.113724 test-rmse:47.473325+0.357615
## [71] train-rmse:47.125207+0.114018 test-rmse:47.472322+0.357278
## [72] train-rmse:47.112384+0.122296 test-rmse:47.461309+0.348387
## [73] train-rmse:47.108363+0.121871 test-rmse:47.461949+0.351203
## [74] train-rmse:47.097643+0.118741 test-rmse:47.454028+0.355086
## [75] train-rmse:47.086817+0.121844 test-rmse:47.446861+0.352881
## [76] train-rmse:47.083521+0.121895 test-rmse:47.448671+0.352664
## [77] train-rmse:47.079192+0.121195 test-rmse:47.450802+0.352935
## [78] train-rmse:47.070845+0.118291 test-rmse:47.448384+0.357312
## [79] train-rmse:47.067351+0.118595 test-rmse:47.449017+0.358714
## [80] train-rmse:47.063701+0.120803 test-rmse:47.447142+0.357918
## [81] train-rmse:47.057875+0.123965 test-rmse:47.442821+0.355277
## [82] train-rmse:47.052542+0.122954 test-rmse:47.443634+0.355643
## [83] train-rmse:47.047643+0.121431 test-rmse:47.441110+0.357339
## [84] train-rmse:47.040910+0.118579 test-rmse:47.436395+0.360167
## [85] train-rmse:47.036364+0.118335 test-rmse:47.433598+0.361451
## [86] train-rmse:47.030484+0.115679 test-rmse:47.430134+0.364509
## [87] train-rmse:47.026946+0.114941 test-rmse:47.431322+0.363012
## [88] train-rmse:47.023735+0.115903 test-rmse:47.432223+0.362066
## [89] train-rmse:47.020015+0.115515 test-rmse:47.432826+0.366036
## [90] train-rmse:47.015485+0.117276 test-rmse:47.432697+0.364722
## [91] train-rmse:47.010017+0.119581 test-rmse:47.433095+0.362567
## [92] train-rmse:47.007132+0.119785 test-rmse:47.431719+0.363207
## [93] train-rmse:47.003623+0.122485 test-rmse:47.431351+0.360890
## [94] train-rmse:46.998022+0.120368 test-rmse:47.429740+0.361636
## [95] train-rmse:46.991802+0.119639 test-rmse:47.429117+0.363401
## [96] train-rmse:46.988226+0.120151 test-rmse:47.427172+0.362112
## [97] train-rmse:46.983573+0.118707 test-rmse:47.426582+0.361713
## [98] train-rmse:46.979624+0.117132 test-rmse:47.425566+0.363019
## [99] train-rmse:46.972733+0.116868 test-rmse:47.422822+0.362955
## [100] train-rmse:46.967827+0.117163 test-rmse:47.423319+0.364356
## [101] train-rmse:46.962444+0.117891 test-rmse:47.418350+0.364594
## [102] train-rmse:46.957264+0.116050 test-rmse:47.418996+0.364431
## [103] train-rmse:46.954103+0.115924 test-rmse:47.419227+0.362320
## [104] train-rmse:46.950755+0.117457 test-rmse:47.418195+0.362387
## [105] train-rmse:46.944431+0.115788 test-rmse:47.418155+0.362797
## [106] train-rmse:46.939888+0.116501 test-rmse:47.418310+0.360590
## [107] train-rmse:46.927883+0.113873 test-rmse:47.408280+0.362281
## [108] train-rmse:46.921961+0.112111 test-rmse:47.406422+0.362239
## [109] train-rmse:46.918815+0.112391 test-rmse:47.407284+0.362622
## [110] train-rmse:46.914907+0.112182 test-rmse:47.407691+0.362662
## [111] train-rmse:46.912280+0.112097 test-rmse:47.407097+0.361399
## [112] train-rmse:46.907418+0.111966 test-rmse:47.406506+0.362504
## [113] train-rmse:46.899776+0.110950 test-rmse:47.406089+0.362396
## [114] train-rmse:46.894140+0.111513 test-rmse:47.405962+0.361742
## [115] train-rmse:46.887976+0.117877 test-rmse:47.402118+0.355749
## [116] train-rmse:46.883464+0.120656 test-rmse:47.401146+0.352255
## [117] train-rmse:46.878951+0.121676 test-rmse:47.400244+0.349570
## [118] train-rmse:46.872687+0.122247 test-rmse:47.396672+0.350451
## [119] train-rmse:46.869488+0.122241 test-rmse:47.397435+0.350093
## [120] train-rmse:46.865553+0.121685 test-rmse:47.396568+0.349413
## [121] train-rmse:46.862210+0.121903 test-rmse:47.394488+0.348301
## [122] train-rmse:46.856686+0.120815 test-rmse:47.391507+0.348383
## [123] train-rmse:46.848929+0.120998 test-rmse:47.389822+0.350380
## [124] train-rmse:46.844705+0.120386 test-rmse:47.391432+0.350353
## [125] train-rmse:46.839776+0.117789 test-rmse:47.390014+0.351678
## [126] train-rmse:46.831003+0.114048 test-rmse:47.381560+0.354198
## [127] train-rmse:46.825771+0.115639 test-rmse:47.382244+0.351833
## [128] train-rmse:46.821711+0.114687 test-rmse:47.382464+0.352091
## [129] train-rmse:46.817688+0.112751 test-rmse:47.381277+0.354069
## [130] train-rmse:46.815636+0.111954 test-rmse:47.383003+0.352803
## [131] train-rmse:46.811386+0.112408 test-rmse:47.381155+0.354327
## [132] train-rmse:46.807091+0.113886 test-rmse:47.383242+0.352926
## [133] train-rmse:46.805931+0.114077 test-rmse:47.384218+0.352648
## [134] train-rmse:46.802422+0.114779 test-rmse:47.383825+0.353306
## [135] train-rmse:46.799052+0.114900 test-rmse:47.382378+0.351551
## [136] train-rmse:46.793046+0.114889 test-rmse:47.378210+0.349983
## [137] train-rmse:46.787291+0.118686 test-rmse:47.375133+0.347623
## [138] train-rmse:46.783862+0.118465 test-rmse:47.375242+0.346744
## [139] train-rmse:46.778656+0.117090 test-rmse:47.371983+0.348917
## [140] train-rmse:46.772878+0.116186 test-rmse:47.370814+0.350075
## [141] train-rmse:46.768692+0.114241 test-rmse:47.370957+0.353833
## [142] train-rmse:46.765939+0.114770 test-rmse:47.370976+0.354461
## [143] train-rmse:46.759020+0.114503 test-rmse:47.367326+0.354104
## [144] train-rmse:46.754312+0.114235 test-rmse:47.365516+0.354381
## [145] train-rmse:46.751513+0.113673 test-rmse:47.365415+0.353358
## [146] train-rmse:46.748609+0.112894 test-rmse:47.366879+0.353079
## [147] train-rmse:46.744514+0.113231 test-rmse:47.365620+0.353462
## [148] train-rmse:46.741552+0.113294 test-rmse:47.364305+0.355637
## [149] train-rmse:46.738111+0.113935 test-rmse:47.366345+0.354083
## [150] train-rmse:46.734845+0.113496 test-rmse:47.366410+0.355514
## [151] train-rmse:46.732237+0.113066 test-rmse:47.365012+0.355768
## [152] train-rmse:46.729189+0.113296 test-rmse:47.368489+0.355938
## [153] train-rmse:46.726626+0.114061 test-rmse:47.370652+0.356829
## [154] train-rmse:46.723274+0.113612 test-rmse:47.373837+0.358335
## [155] train-rmse:46.720256+0.113988 test-rmse:47.372818+0.360519
## [156] train-rmse:46.716094+0.114141 test-rmse:47.374747+0.362379
## [157] train-rmse:46.713908+0.113619 test-rmse:47.377541+0.364460
## [158] train-rmse:46.709586+0.115113 test-rmse:47.376457+0.364729
## Stopping. Best iteration:
## [148] train-rmse:46.741552+0.113294 test-rmse:47.364305+0.355637
importance_matrix_y0 <- xgb.importance(model = gb_dt_final)
xgb.plot.importance(importance_matrix_y0[1:30,])
importance_matrix_y0[1:30,]
## Feature Gain Cover
## 1: hours 0.467613973 0.077723717
## 2: delta.status.preceding.selection.selection 0.238602999 0.177045631
## 3: latitude.intervention 0.056169580 0.155942974
## 4: alert.reason.category 0.052726177 0.033613103
## 5: longitude.intervention 0.050120650 0.133622233
## 6: OSRM.estimated.speed 0.033660641 0.113168119
## 7: OSRM.estimated.duration 0.031845807 0.118622759
## 8: OSRM.estimated.distance 0.028495466 0.119362661
## 9: month 0.013864727 0.019638328
## 10: weekdays 0.011525886 0.022758270
## 11: floor 0.006960137 0.017369730
## 12: departed.from.its.rescue.center 0.006033114 0.007403547
## 13: intervention.on.public.roads 0.002380842 0.003728929
## 14: <NA> NA NA
## 15: <NA> NA NA
## 16: <NA> NA NA
## 17: <NA> NA NA
## 18: <NA> NA NA
## 19: <NA> NA NA
## 20: <NA> NA NA
## 21: <NA> NA NA
## 22: <NA> NA NA
## 23: <NA> NA NA
## 24: <NA> NA NA
## 25: <NA> NA NA
## 26: <NA> NA NA
## 27: <NA> NA NA
## 28: <NA> NA NA
## 29: <NA> NA NA
## 30: <NA> NA NA
## Feature Gain Cover
## Frequency
## 1: 0.079628725
## 2: 0.161700049
## 3: 0.144113337
## 4: 0.038104543
## 5: 0.130923302
## 6: 0.113336590
## 7: 0.098680997
## 8: 0.117733268
## 9: 0.040058622
## 10: 0.029311187
## 11: 0.032242306
## 12: 0.007327797
## 13: 0.006839277
## 14: NA
## 15: NA
## 16: NA
## 17: NA
## 18: NA
## 19: NA
## 20: NA
## 21: NA
## 22: NA
## 23: NA
## 24: NA
## 25: NA
## 26: NA
## 27: NA
## 28: NA
## 29: NA
## 30: NA
## Frequency
yo_preds <- predict(gb_dt_final,dtest)
Working !
yo_preds %>% head
## [1] 139.9449 136.5910 120.4235 158.9106 152.7684 163.7142
data.fe.reduced.1<-data.fe[,-c("delta.selection.departure","delta.selection.presentation","rescue.center","emergency.vehicle.type","emergency.vehicle","n","location.of.the.event","emergency.vehicle.selection")]
Validation
set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.1$delta.departure.presentation, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.1[trainIndex,]
valid=data.fe.reduced.1[-trainIndex,]
foo <- train %>% select(-delta.departure.presentation)
bar <- valid %>% select(-delta.departure.presentation)
dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.departure.presentation)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.departure.presentation)
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear",
seed = 4321
)
watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
data = dtrain,
print_every_n = 5,
watchlist = watchlist,
nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 5, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:37:21] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:281.581818 valid-rmse:281.096405
## [6] train-rmse:137.590942 valid-rmse:137.396545
## [11] train-rmse:129.172226 valid-rmse:129.130997
## [16] train-rmse:128.197983 valid-rmse:128.248779
## [21] train-rmse:127.834221 valid-rmse:127.959030
## [26] train-rmse:127.529221 valid-rmse:127.764519
## [31] train-rmse:127.313118 valid-rmse:127.610802
## [36] train-rmse:127.108261 valid-rmse:127.415588
## [41] train-rmse:126.971756 valid-rmse:127.301468
## [46] train-rmse:126.766457 valid-rmse:127.153137
## [51] train-rmse:126.650963 valid-rmse:127.076904
## [56] train-rmse:126.551292 valid-rmse:127.025551
## [61] train-rmse:126.418633 valid-rmse:126.859032
## [66] train-rmse:126.260788 valid-rmse:126.729599
## [71] train-rmse:126.178810 valid-rmse:126.681183
## [76] train-rmse:126.072586 valid-rmse:126.593849
## [81] train-rmse:126.011520 valid-rmse:126.566483
## [86] train-rmse:125.930672 valid-rmse:126.507385
## [91] train-rmse:125.846268 valid-rmse:126.471138
## [96] train-rmse:125.752846 valid-rmse:126.437157
## [101] train-rmse:125.673805 valid-rmse:126.377579
## [106] train-rmse:125.605606 valid-rmse:126.320541
## [111] train-rmse:125.545403 valid-rmse:126.266739
## [116] train-rmse:125.478317 valid-rmse:126.240616
## [121] train-rmse:125.421829 valid-rmse:126.227203
## [126] train-rmse:125.356369 valid-rmse:126.183113
## [131] train-rmse:125.319839 valid-rmse:126.145607
## [136] train-rmse:125.266129 valid-rmse:126.131760
## [141] train-rmse:125.217155 valid-rmse:126.135361
## [146] train-rmse:125.175476 valid-rmse:126.117546
## [151] train-rmse:125.113052 valid-rmse:126.080421
## [156] train-rmse:125.049934 valid-rmse:126.043907
## [161] train-rmse:125.003387 valid-rmse:126.022049
## [166] train-rmse:124.929024 valid-rmse:126.002174
## [171] train-rmse:124.896721 valid-rmse:126.006470
## [176] train-rmse:124.847977 valid-rmse:126.005798
## [181] train-rmse:124.791206 valid-rmse:125.986488
## [186] train-rmse:124.735176 valid-rmse:125.968216
## [191] train-rmse:124.691116 valid-rmse:125.955620
## [196] train-rmse:124.627815 valid-rmse:125.921906
## [201] train-rmse:124.601868 valid-rmse:125.902725
## [206] train-rmse:124.550392 valid-rmse:125.891167
## [211] train-rmse:124.520157 valid-rmse:125.899101
## [216] train-rmse:124.463165 valid-rmse:125.874977
## [221] train-rmse:124.418488 valid-rmse:125.842552
## [226] train-rmse:124.382301 valid-rmse:125.846573
## [231] train-rmse:124.358192 valid-rmse:125.866714
## [236] train-rmse:124.314766 valid-rmse:125.854462
## [241] train-rmse:124.280281 valid-rmse:125.850349
## [246] train-rmse:124.253571 valid-rmse:125.844170
## [251] train-rmse:124.199898 valid-rmse:125.850151
## [256] train-rmse:124.154457 valid-rmse:125.849022
## [261] train-rmse:124.123108 valid-rmse:125.879662
## [266] train-rmse:124.091957 valid-rmse:125.901802
## [271] train-rmse:124.066185 valid-rmse:125.906349
## [276] train-rmse:124.027557 valid-rmse:125.895996
## [281] train-rmse:123.982994 valid-rmse:125.882622
## [286] train-rmse:123.938988 valid-rmse:125.872154
## [291] train-rmse:123.906761 valid-rmse:125.844421
## [296] train-rmse:123.872627 valid-rmse:125.858078
## [300] train-rmse:123.847771 valid-rmse:125.851135
After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=150)
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:37:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:281.523804+0.414619 test-rmse:281.544348+0.724674
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:219.081528+1.673523 test-rmse:219.131796+1.536183
## [3] train-rmse:179.514246+1.093519 test-rmse:179.601877+1.104861
## [4] train-rmse:156.713403+0.879909 test-rmse:156.862521+0.670584
## [5] train-rmse:144.033722+0.454121 test-rmse:144.197638+0.724830
## [6] train-rmse:137.140607+0.336102 test-rmse:137.323874+0.934584
## [7] train-rmse:133.426883+0.241605 test-rmse:133.607410+0.751621
## [8] train-rmse:131.365689+0.216542 test-rmse:131.546515+0.666350
## [9] train-rmse:130.265457+0.190730 test-rmse:130.463403+0.659083
## [10] train-rmse:129.582663+0.219886 test-rmse:129.809027+0.613893
## [11] train-rmse:129.167074+0.236014 test-rmse:129.406039+0.602272
## [12] train-rmse:128.885593+0.230563 test-rmse:129.133887+0.602641
## [13] train-rmse:128.619315+0.219849 test-rmse:128.877890+0.613587
## [14] train-rmse:128.430792+0.220800 test-rmse:128.709418+0.614585
## [15] train-rmse:128.271570+0.204786 test-rmse:128.550174+0.623793
## [16] train-rmse:128.143535+0.196471 test-rmse:128.434557+0.630546
## [17] train-rmse:128.022392+0.179914 test-rmse:128.328053+0.639083
## [18] train-rmse:127.944290+0.166722 test-rmse:128.267088+0.650849
## [19] train-rmse:127.849617+0.174530 test-rmse:128.181015+0.649259
## [20] train-rmse:127.775432+0.177344 test-rmse:128.119374+0.644230
## [21] train-rmse:127.717990+0.177802 test-rmse:128.076634+0.648400
## [22] train-rmse:127.653108+0.184064 test-rmse:128.018414+0.642460
## [23] train-rmse:127.581145+0.175926 test-rmse:127.952537+0.653128
## [24] train-rmse:127.511911+0.166312 test-rmse:127.899674+0.654002
## [25] train-rmse:127.467285+0.164262 test-rmse:127.861557+0.655209
## [26] train-rmse:127.411963+0.153371 test-rmse:127.827664+0.664448
## [27] train-rmse:127.357196+0.160077 test-rmse:127.780997+0.658795
## [28] train-rmse:127.295271+0.182832 test-rmse:127.730334+0.637080
## [29] train-rmse:127.252402+0.181873 test-rmse:127.709767+0.640776
## [30] train-rmse:127.216164+0.179629 test-rmse:127.683695+0.632499
## [31] train-rmse:127.172676+0.186846 test-rmse:127.651163+0.633243
## [32] train-rmse:127.133798+0.180836 test-rmse:127.622701+0.637106
## [33] train-rmse:127.089110+0.177977 test-rmse:127.591965+0.629584
## [34] train-rmse:127.058234+0.185618 test-rmse:127.570425+0.620966
## [35] train-rmse:127.022231+0.179270 test-rmse:127.537398+0.628395
## [36] train-rmse:126.977962+0.182050 test-rmse:127.499119+0.632896
## [37] train-rmse:126.953825+0.178533 test-rmse:127.479462+0.647071
## [38] train-rmse:126.914813+0.191309 test-rmse:127.444212+0.639214
## [39] train-rmse:126.872835+0.185809 test-rmse:127.411391+0.651816
## [40] train-rmse:126.833978+0.178700 test-rmse:127.380124+0.661537
## [41] train-rmse:126.796616+0.176480 test-rmse:127.353807+0.668091
## [42] train-rmse:126.767587+0.175553 test-rmse:127.331142+0.666778
## [43] train-rmse:126.737492+0.179754 test-rmse:127.313023+0.666859
## [44] train-rmse:126.713512+0.175745 test-rmse:127.294949+0.670462
## [45] train-rmse:126.689593+0.167029 test-rmse:127.281517+0.681222
## [46] train-rmse:126.651088+0.179210 test-rmse:127.260617+0.673548
## [47] train-rmse:126.624667+0.169826 test-rmse:127.252069+0.685589
## [48] train-rmse:126.601259+0.171284 test-rmse:127.239157+0.676947
## [49] train-rmse:126.568806+0.174550 test-rmse:127.220949+0.669404
## [50] train-rmse:126.553667+0.179115 test-rmse:127.215584+0.667684
## [51] train-rmse:126.526750+0.180720 test-rmse:127.192891+0.672631
## [52] train-rmse:126.505202+0.174346 test-rmse:127.176755+0.677513
## [53] train-rmse:126.490799+0.179564 test-rmse:127.173900+0.674205
## [54] train-rmse:126.460991+0.176703 test-rmse:127.157059+0.673474
## [55] train-rmse:126.437930+0.175104 test-rmse:127.141040+0.670218
## [56] train-rmse:126.410649+0.167153 test-rmse:127.129221+0.676371
## [57] train-rmse:126.393768+0.168644 test-rmse:127.124442+0.677805
## [58] train-rmse:126.373703+0.176651 test-rmse:127.115929+0.666780
## [59] train-rmse:126.347380+0.174560 test-rmse:127.097687+0.667820
## [60] train-rmse:126.330393+0.174953 test-rmse:127.090921+0.673389
## [61] train-rmse:126.320430+0.178680 test-rmse:127.089604+0.672929
## [62] train-rmse:126.306044+0.182754 test-rmse:127.087215+0.673506
## [63] train-rmse:126.286730+0.181635 test-rmse:127.079835+0.673497
## [64] train-rmse:126.259903+0.191361 test-rmse:127.066576+0.670915
## [65] train-rmse:126.246524+0.198716 test-rmse:127.056656+0.666946
## [66] train-rmse:126.226643+0.193043 test-rmse:127.045021+0.678763
## [67] train-rmse:126.215774+0.194754 test-rmse:127.039389+0.683730
## [68] train-rmse:126.196167+0.196994 test-rmse:127.025725+0.679628
## [69] train-rmse:126.176482+0.198582 test-rmse:127.013895+0.679368
## [70] train-rmse:126.151678+0.192077 test-rmse:127.006059+0.688230
## [71] train-rmse:126.134698+0.195938 test-rmse:127.001901+0.681984
## [72] train-rmse:126.108124+0.196697 test-rmse:126.991725+0.678874
## [73] train-rmse:126.089851+0.198317 test-rmse:126.982728+0.680452
## [74] train-rmse:126.072423+0.197131 test-rmse:126.972639+0.679232
## [75] train-rmse:126.056810+0.190282 test-rmse:126.962967+0.686312
## [76] train-rmse:126.024744+0.198343 test-rmse:126.939092+0.675856
## [77] train-rmse:126.000297+0.192010 test-rmse:126.919551+0.688342
## [78] train-rmse:125.983716+0.192478 test-rmse:126.911766+0.688660
## [79] train-rmse:125.968008+0.190795 test-rmse:126.903821+0.691191
## [80] train-rmse:125.951915+0.188925 test-rmse:126.898318+0.691721
## [81] train-rmse:125.938568+0.189416 test-rmse:126.891689+0.686983
## [82] train-rmse:125.927505+0.190296 test-rmse:126.893272+0.689856
## [83] train-rmse:125.911993+0.195796 test-rmse:126.887067+0.686134
## [84] train-rmse:125.885525+0.191317 test-rmse:126.872099+0.690821
## [85] train-rmse:125.875354+0.192467 test-rmse:126.874057+0.689955
## [86] train-rmse:125.858795+0.192809 test-rmse:126.867731+0.696176
## [87] train-rmse:125.844838+0.193044 test-rmse:126.872694+0.697327
## [88] train-rmse:125.831219+0.192605 test-rmse:126.870186+0.697128
## [89] train-rmse:125.815009+0.193552 test-rmse:126.864787+0.696106
## [90] train-rmse:125.798550+0.195410 test-rmse:126.852177+0.694531
## [91] train-rmse:125.789036+0.196389 test-rmse:126.855126+0.688573
## [92] train-rmse:125.771703+0.196521 test-rmse:126.847206+0.693832
## [93] train-rmse:125.761838+0.196215 test-rmse:126.843057+0.691604
## [94] train-rmse:125.748201+0.197899 test-rmse:126.841521+0.697092
## [95] train-rmse:125.731314+0.196684 test-rmse:126.838306+0.696396
## [96] train-rmse:125.716672+0.195217 test-rmse:126.828497+0.697162
## [97] train-rmse:125.702783+0.194351 test-rmse:126.824394+0.700524
## [98] train-rmse:125.689615+0.193632 test-rmse:126.818106+0.700841
## [99] train-rmse:125.674144+0.196865 test-rmse:126.812309+0.697225
## [100] train-rmse:125.658310+0.198158 test-rmse:126.807248+0.700850
## [101] train-rmse:125.622958+0.191371 test-rmse:126.783844+0.708908
## [102] train-rmse:125.610875+0.189268 test-rmse:126.783942+0.708402
## [103] train-rmse:125.602299+0.187095 test-rmse:126.778578+0.711936
## [104] train-rmse:125.587270+0.183929 test-rmse:126.769690+0.714653
## [105] train-rmse:125.573859+0.185173 test-rmse:126.763026+0.716550
## [106] train-rmse:125.555656+0.182182 test-rmse:126.760545+0.720544
## [107] train-rmse:125.542724+0.186330 test-rmse:126.755475+0.718355
## [108] train-rmse:125.531265+0.185470 test-rmse:126.757770+0.717427
## [109] train-rmse:125.520441+0.183724 test-rmse:126.755988+0.719492
## [110] train-rmse:125.502649+0.185467 test-rmse:126.749359+0.715524
## [111] train-rmse:125.487388+0.189911 test-rmse:126.750496+0.709536
## [112] train-rmse:125.465802+0.182059 test-rmse:126.743807+0.713259
## [113] train-rmse:125.444939+0.184126 test-rmse:126.736319+0.708900
## [114] train-rmse:125.429964+0.178903 test-rmse:126.731203+0.709725
## [115] train-rmse:125.418277+0.177989 test-rmse:126.730649+0.715230
## [116] train-rmse:125.406596+0.178234 test-rmse:126.722790+0.717361
## [117] train-rmse:125.392983+0.175356 test-rmse:126.710738+0.723223
## [118] train-rmse:125.380408+0.172372 test-rmse:126.705493+0.727824
## [119] train-rmse:125.367438+0.176569 test-rmse:126.701080+0.724513
## [120] train-rmse:125.350941+0.178758 test-rmse:126.695539+0.722733
## [121] train-rmse:125.340952+0.180879 test-rmse:126.691469+0.720498
## [122] train-rmse:125.331378+0.181274 test-rmse:126.684915+0.722644
## [123] train-rmse:125.321429+0.181038 test-rmse:126.684486+0.717763
## [124] train-rmse:125.306696+0.179692 test-rmse:126.670096+0.720394
## [125] train-rmse:125.287822+0.172014 test-rmse:126.661996+0.728285
## [126] train-rmse:125.277008+0.169052 test-rmse:126.655484+0.728027
## [127] train-rmse:125.266467+0.167109 test-rmse:126.652710+0.732913
## [128] train-rmse:125.247157+0.170217 test-rmse:126.639624+0.729516
## [129] train-rmse:125.238914+0.169107 test-rmse:126.641750+0.728962
## [130] train-rmse:125.230951+0.169467 test-rmse:126.638780+0.725616
## [131] train-rmse:125.213278+0.169051 test-rmse:126.636588+0.725191
## [132] train-rmse:125.198699+0.167815 test-rmse:126.632719+0.721715
## [133] train-rmse:125.188091+0.168962 test-rmse:126.628691+0.716397
## [134] train-rmse:125.174799+0.172753 test-rmse:126.626431+0.711312
## [135] train-rmse:125.162801+0.176972 test-rmse:126.627159+0.711524
## [136] train-rmse:125.150217+0.177558 test-rmse:126.627708+0.709469
## [137] train-rmse:125.140280+0.173189 test-rmse:126.624734+0.712262
## [138] train-rmse:125.130373+0.168764 test-rmse:126.624202+0.714721
## [139] train-rmse:125.120879+0.166527 test-rmse:126.627045+0.716779
## [140] train-rmse:125.110635+0.164745 test-rmse:126.623828+0.717670
## [141] train-rmse:125.102263+0.161268 test-rmse:126.618884+0.717977
## [142] train-rmse:125.092708+0.158324 test-rmse:126.621751+0.715553
## [143] train-rmse:125.082019+0.155436 test-rmse:126.617722+0.718351
## [144] train-rmse:125.070909+0.159067 test-rmse:126.614075+0.716664
## [145] train-rmse:125.059610+0.156242 test-rmse:126.611324+0.711578
## [146] train-rmse:125.050227+0.154634 test-rmse:126.607147+0.715556
## [147] train-rmse:125.034125+0.151428 test-rmse:126.596477+0.715755
## [148] train-rmse:125.009386+0.147483 test-rmse:126.583774+0.720781
## [149] train-rmse:124.998654+0.148988 test-rmse:126.578406+0.716137
## [150] train-rmse:124.989412+0.150454 test-rmse:126.579814+0.712278
y1_preds <- predict(gb_dt_final,dtest)
y1_preds %>% head
## [1] 313.6901 397.7925 262.8186 314.9897 281.3379 263.2837
pred_ys=yo_preds+y1_preds
pred_final=data.frame(yo_preds,y1_preds,pred_ys)
pred_final %>% head
## yo_preds y1_preds pred_ys
## 1 139.9449 313.6901 453.6350
## 2 136.5910 397.7925 534.3835
## 3 120.4235 262.8186 383.2422
## 4 158.9106 314.9897 473.9003
## 5 152.7684 281.3379 434.1063
## 6 163.7142 263.2837 426.9979
pred_final$id<-x_test$emergency.vehicle.selection
Change columns order
pred_final <- pred_final[, c(4, 1, 2, 3)]
Retrieve order from original
pred_final %>% head
## id yo_preds y1_preds pred_ys
## 1 4715068 139.9449 313.6901 453.6350
## 2 4714816 136.5910 397.7925 534.3835
## 3 4713710 120.4235 262.8186 383.2422
## 4 4713748 158.9106 314.9897 473.9003
## 5 4713778 152.7684 281.3379 434.1063
## 6 4713812 163.7142 263.2837 426.9979
pred_final %>% head
## id yo_preds y1_preds pred_ys
## 1 4715068 139.9449 313.6901 453.6350
## 2 4714816 136.5910 397.7925 534.3835
## 3 4713710 120.4235 262.8186 383.2422
## 4 4713748 158.9106 314.9897 473.9003
## 5 4713778 152.7684 281.3379 434.1063
## 6 4713812 163.7142 263.2837 426.9979
id_order %>% head
## x_test.emergency.vehicle.selection rank
## 1 5271704 1
## 2 5092931 2
## 3 5153756 3
## 4 5355572 4
## 5 5178915 5
## 6 5206885 6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
## x_test.emergency.vehicle.selection rank rank
## 1 5271704 1 1
## 2 5092931 2 2
## 3 5153756 3 3
## 4 5355572 4 4
## 5 5178915 5 5
## 6 5206885 6 6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
## id yo_preds y1_preds pred_ys rank rank.1
## 1 4713710 120.4235 262.8186 383.2422 14082 14082
## 2 4713748 158.9106 314.9897 473.9003 25965 25965
## 3 4713778 152.7684 281.3379 434.1063 14465 14465
## 4 4713812 163.7142 263.2837 426.9979 79530 79530
## 5 4713821 133.7922 488.3494 622.1416 47824 47824
## 6 4713863 144.9620 233.3339 378.2959 2072 2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
## id yo_preds y1_preds pred_ys rank rank.1
## 77897 5271704 116.8953 414.5863 531.4816 1 1
## 59352 5092931 118.0171 374.9510 492.9681 2 2
## 68483 5153756 116.5670 207.0299 323.5969 3 3
## 91242 5355572 210.9534 302.8663 513.8197 4 4
## 72358 5178915 131.1625 273.5441 404.7067 5 5
## 77062 5206885 116.8149 273.0493 389.8642 6 6
y_final %>% setDT
y_final=y_final[,-c("rank.1")]
y_final %>% head
## id yo_preds y1_preds pred_ys rank
## 1: 5271704 116.8953 414.5863 531.4816 1
## 2: 5092931 118.0171 374.9510 492.9681 2
## 3: 5153756 116.5670 207.0299 323.5969 3
## 4: 5355572 210.9534 302.8663 513.8197 4
## 5: 5178915 131.1625 273.5441 404.7067 5
## 6: 5206885 116.8149 273.0493 389.8642 6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0
Write csv file. Go to pyhton script for generate good csv.
fwrite(y_final, "y_test.csv",sep=",")
Keeping some factors
OSRM data internvetion floor alter.reason.category long internvetion lat internvetion delta.status Weekdays Month hours departed from
data.fe %>% str
## Classes 'data.table' and 'data.frame': 204987 obs. of 22 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4714126 4714817 4713701 4713715 4713916 4713754 4713742 4713752 4713762 4713791 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ delta.departure.presentation : int 174 376 214 268 409 678 98 187 623 181 ...
## $ delta.selection.presentation : int 413 423 332 417 506 791 162 307 757 275 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
x_test %>% str
## Classes 'data.table' and 'data.frame': 108033 obs. of 19 variables:
## $ n : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ emergency.vehicle.selection : int 4715068 4714816 4713710 4713748 4713778 4713812 4713821 4713863 4713872 4713878 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
## $ location.of.the.event : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
## $ longitude.intervention : num 2.34 2.28 2.28 2.34 2.41 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
## $ emergency.vehicle.type : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
## $ rescue.center : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
## $ delta.status.preceding.selection.selection: int 2636 16243 597 1834 1341 2197 16 1312 263 437 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1078 1791 1451 ...
## $ OSRM.estimated.duration : num 214 218 120 250 199 ...
## $ OSRM.estimated.speed : num 21.6 38.8 32.4 25.7 26.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
data.fe.reduced.0<-data.fe[,-c("delta.departure.presentation","delta.selection.presentation","emergency.vehicle.selection")]
x_test.reduced<-x_test[,-c("emergency.vehicle.selection")]
Validation
set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.0$delta.selection.departure, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.0[trainIndex,]
valid=data.fe.reduced.0[-trainIndex,]
data.fe.reduced.0 %>% str
## Classes 'data.table' and 'data.frame': 204987 obs. of 19 variables:
## $ n : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 3 1 1 3 1 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 2 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 10 3 4 3 3 3 3 3 ...
## $ location.of.the.event : Factor w/ 210 levels "100","101","102",..: 36 21 39 48 39 65 48 1 48 49 ...
## $ longitude.intervention : num 2.34 2.28 2.33 2.3 2.2 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 639 levels "1815","1823",..: 318 92 488 398 442 598 308 445 69 125 ...
## $ emergency.vehicle.type : Factor w/ 41 levels "AR","BEAA BSPP",..: 24 24 37 37 37 15 24 37 24 8 ...
## $ rescue.center : Factor w/ 79 levels "2418","2434",..: 41 3 63 15 58 5 72 27 28 6 ...
## $ delta.status.preceding.selection.selection: int 8293 16251 875 606 4693 86 7 1382 2062 968 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1525 1812 2586 ...
## $ OSRM.estimated.duration : num 214 218 173 198 280 ...
## $ delta.selection.departure : int 239 47 118 149 97 113 64 120 134 94 ...
## $ OSRM.estimated.speed : num 21.6 38.8 31.7 33 33.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 4 10 1 1 2 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
x_test.reduced %>% str
## Classes 'data.table' and 'data.frame': 108033 obs. of 18 variables:
## $ n : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alert.reason.category : Factor w/ 9 levels "1","2","3","4",..: 1 1 3 3 1 3 1 3 3 3 ...
## $ intervention.on.public.roads : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ floor : Factor w/ 20 levels "-2","-1","0",..: 8 6 7 3 3 3 3 4 3 8 ...
## $ location.of.the.event : Factor w/ 196 levels "100","101","102",..: 35 20 38 48 48 1 47 155 196 38 ...
## $ longitude.intervention : num 2.34 2.28 2.28 2.34 2.41 ...
## $ latitude.intervention : num 48.9 48.9 48.9 48.9 48.9 ...
## $ emergency.vehicle : Factor w/ 708 levels "1815","1823",..: 72 351 678 59 646 681 127 460 557 421 ...
## $ emergency.vehicle.type : Factor w/ 66 levels "AR","BEAA BSPP",..: 21 53 62 9 25 62 25 62 62 62 ...
## $ rescue.center : Factor w/ 91 levels "2418","2434",..: 42 3 15 42 17 70 65 30 35 58 ...
## $ delta.status.preceding.selection.selection: int 2636 16243 597 1834 1341 2197 16 1312 263 437 ...
## $ departed.from.its.rescue.center : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ OSRM.estimated.distance : num 1283 2347 1078 1791 1451 ...
## $ OSRM.estimated.duration : num 214 218 120 250 199 ...
## $ OSRM.estimated.speed : num 21.6 38.8 32.4 25.7 26.2 ...
## $ month : Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekdays : Factor w/ 7 levels "dimanche","jeudi",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hours : Factor w/ 24 levels "00","01","02",..: 12 10 1 1 1 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
foo <- train %>% select(-delta.selection.departure)
bar <- valid %>% select(-delta.selection.departure)
dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.selection.departure)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.selection.departure)
dtest <- xgb.DMatrix(data.matrix(x_test.reduced))
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear",
seed = 4321
)
watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
data = dtrain,
print_every_n = 100,
watchlist = watchlist,
nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 100, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:38:34] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:108.764305 valid-rmse:108.917198
## [101] train-rmse:45.565170 valid-rmse:46.235214
## [201] train-rmse:45.151196 valid-rmse:46.161350
## [300] train-rmse:44.833618 valid-rmse:46.085770
After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=300)
## [01:39:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:39:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:109.438702+0.559431 test-rmse:109.440683+0.619834
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:84.894603+0.718813 test-rmse:84.902556+0.904187
## [3] train-rmse:69.485803+1.021398 test-rmse:69.501114+1.134613
## [4] train-rmse:59.894737+0.571888 test-rmse:59.907986+0.585138
## [5] train-rmse:54.314114+0.344400 test-rmse:54.325171+0.397595
## [6] train-rmse:51.589322+0.310478 test-rmse:51.613922+0.443222
## [7] train-rmse:49.777984+0.278846 test-rmse:49.819305+0.389955
## [8] train-rmse:48.825447+0.318711 test-rmse:48.873666+0.379909
## [9] train-rmse:48.303604+0.389986 test-rmse:48.357178+0.307877
## [10] train-rmse:47.881085+0.246625 test-rmse:47.935491+0.347936
## [11] train-rmse:47.613717+0.145970 test-rmse:47.675505+0.423193
## [12] train-rmse:47.447366+0.118329 test-rmse:47.516430+0.447960
## [13] train-rmse:47.315423+0.147158 test-rmse:47.392815+0.430028
## [14] train-rmse:47.224849+0.149517 test-rmse:47.312397+0.425389
## [15] train-rmse:47.139268+0.137270 test-rmse:47.236465+0.429832
## [16] train-rmse:47.072858+0.107285 test-rmse:47.175311+0.446108
## [17] train-rmse:47.012226+0.104342 test-rmse:47.124238+0.448828
## [18] train-rmse:46.952984+0.088795 test-rmse:47.069958+0.459836
## [19] train-rmse:46.878547+0.099424 test-rmse:46.995417+0.444797
## [20] train-rmse:46.829338+0.092866 test-rmse:46.949340+0.446384
## [21] train-rmse:46.792914+0.090098 test-rmse:46.918788+0.449279
## [22] train-rmse:46.741657+0.092691 test-rmse:46.875304+0.449124
## [23] train-rmse:46.704360+0.102778 test-rmse:46.846227+0.443588
## [24] train-rmse:46.672100+0.105100 test-rmse:46.825277+0.445101
## [25] train-rmse:46.635297+0.107106 test-rmse:46.797864+0.437146
## [26] train-rmse:46.596321+0.094237 test-rmse:46.762611+0.439128
## [27] train-rmse:46.560267+0.094102 test-rmse:46.733967+0.432636
## [28] train-rmse:46.534351+0.094057 test-rmse:46.713133+0.433678
## [29] train-rmse:46.513647+0.091461 test-rmse:46.698022+0.437434
## [30] train-rmse:46.481928+0.091837 test-rmse:46.662455+0.428521
## [31] train-rmse:46.458511+0.087206 test-rmse:46.644716+0.432821
## [32] train-rmse:46.436820+0.086933 test-rmse:46.629818+0.434153
## [33] train-rmse:46.410554+0.080262 test-rmse:46.604999+0.438643
## [34] train-rmse:46.375812+0.099476 test-rmse:46.576962+0.415402
## [35] train-rmse:46.350275+0.111620 test-rmse:46.560944+0.400779
## [36] train-rmse:46.325918+0.119989 test-rmse:46.540186+0.394581
## [37] train-rmse:46.307871+0.118590 test-rmse:46.526142+0.393106
## [38] train-rmse:46.279480+0.113912 test-rmse:46.502864+0.389999
## [39] train-rmse:46.263217+0.111436 test-rmse:46.495633+0.391198
## [40] train-rmse:46.247558+0.115388 test-rmse:46.485359+0.385311
## [41] train-rmse:46.218600+0.113302 test-rmse:46.460198+0.388392
## [42] train-rmse:46.201891+0.111633 test-rmse:46.448116+0.391217
## [43] train-rmse:46.177470+0.114666 test-rmse:46.426009+0.385887
## [44] train-rmse:46.169138+0.115109 test-rmse:46.423231+0.383833
## [45] train-rmse:46.152372+0.115892 test-rmse:46.414399+0.378098
## [46] train-rmse:46.138344+0.108031 test-rmse:46.405788+0.385344
## [47] train-rmse:46.119640+0.108681 test-rmse:46.387619+0.386052
## [48] train-rmse:46.096558+0.104960 test-rmse:46.367865+0.393450
## [49] train-rmse:46.070429+0.089682 test-rmse:46.343572+0.406688
## [50] train-rmse:46.053349+0.084951 test-rmse:46.328449+0.409164
## [51] train-rmse:46.033454+0.096135 test-rmse:46.310957+0.398398
## [52] train-rmse:46.005325+0.102338 test-rmse:46.289720+0.393850
## [53] train-rmse:45.996831+0.102862 test-rmse:46.285611+0.391098
## [54] train-rmse:45.981652+0.096508 test-rmse:46.272963+0.394636
## [55] train-rmse:45.972262+0.096199 test-rmse:46.268758+0.395324
## [56] train-rmse:45.955748+0.096589 test-rmse:46.265627+0.393099
## [57] train-rmse:45.943023+0.103073 test-rmse:46.256781+0.388549
## [58] train-rmse:45.930495+0.102585 test-rmse:46.247333+0.389796
## [59] train-rmse:45.921631+0.101210 test-rmse:46.243295+0.392078
## [60] train-rmse:45.910559+0.099743 test-rmse:46.233862+0.393873
## [61] train-rmse:45.897163+0.095237 test-rmse:46.225727+0.397722
## [62] train-rmse:45.890328+0.095729 test-rmse:46.227280+0.398735
## [63] train-rmse:45.882620+0.095649 test-rmse:46.227895+0.402963
## [64] train-rmse:45.874048+0.099698 test-rmse:46.226121+0.398159
## [65] train-rmse:45.867832+0.100948 test-rmse:46.224503+0.398578
## [66] train-rmse:45.858866+0.096844 test-rmse:46.219268+0.402189
## [67] train-rmse:45.851366+0.094488 test-rmse:46.214951+0.402451
## [68] train-rmse:45.835413+0.094237 test-rmse:46.203290+0.401946
## [69] train-rmse:45.826470+0.091611 test-rmse:46.198814+0.404163
## [70] train-rmse:45.817159+0.093187 test-rmse:46.193200+0.404034
## [71] train-rmse:45.804537+0.095164 test-rmse:46.188594+0.403349
## [72] train-rmse:45.793393+0.092860 test-rmse:46.183155+0.406127
## [73] train-rmse:45.782345+0.098632 test-rmse:46.175396+0.397158
## [74] train-rmse:45.770280+0.094341 test-rmse:46.168652+0.401127
## [75] train-rmse:45.765028+0.094839 test-rmse:46.167232+0.397421
## [76] train-rmse:45.753767+0.092104 test-rmse:46.160619+0.397639
## [77] train-rmse:45.746474+0.093464 test-rmse:46.157053+0.398226
## [78] train-rmse:45.740511+0.094070 test-rmse:46.157086+0.395058
## [79] train-rmse:45.727780+0.087412 test-rmse:46.145632+0.401226
## [80] train-rmse:45.722180+0.090232 test-rmse:46.142101+0.400225
## [81] train-rmse:45.706887+0.087424 test-rmse:46.132990+0.403812
## [82] train-rmse:45.694583+0.087184 test-rmse:46.122985+0.403369
## [83] train-rmse:45.687182+0.088302 test-rmse:46.124319+0.405468
## [84] train-rmse:45.677532+0.095221 test-rmse:46.118461+0.398551
## [85] train-rmse:45.669465+0.092271 test-rmse:46.113824+0.399925
## [86] train-rmse:45.660502+0.090982 test-rmse:46.110171+0.402120
## [87] train-rmse:45.649854+0.093473 test-rmse:46.102174+0.399998
## [88] train-rmse:45.642434+0.095430 test-rmse:46.095586+0.397447
## [89] train-rmse:45.635644+0.097198 test-rmse:46.089239+0.394088
## [90] train-rmse:45.628621+0.098191 test-rmse:46.084952+0.390051
## [91] train-rmse:45.621129+0.096414 test-rmse:46.084302+0.396479
## [92] train-rmse:45.614175+0.094711 test-rmse:46.081956+0.397250
## [93] train-rmse:45.608479+0.096503 test-rmse:46.075993+0.397114
## [94] train-rmse:45.600951+0.096525 test-rmse:46.073662+0.396759
## [95] train-rmse:45.593455+0.099667 test-rmse:46.070815+0.394864
## [96] train-rmse:45.588517+0.101237 test-rmse:46.070628+0.395557
## [97] train-rmse:45.580947+0.102931 test-rmse:46.068772+0.394096
## [98] train-rmse:45.575790+0.104361 test-rmse:46.067573+0.394005
## [99] train-rmse:45.567061+0.106474 test-rmse:46.062012+0.388595
## [100] train-rmse:45.560835+0.105843 test-rmse:46.061743+0.387334
## [101] train-rmse:45.553590+0.107416 test-rmse:46.063715+0.382370
## [102] train-rmse:45.545368+0.109430 test-rmse:46.059541+0.379140
## [103] train-rmse:45.537080+0.110874 test-rmse:46.058841+0.380323
## [104] train-rmse:45.523985+0.106184 test-rmse:46.047968+0.384215
## [105] train-rmse:45.519158+0.106633 test-rmse:46.047930+0.384796
## [106] train-rmse:45.511317+0.105058 test-rmse:46.045711+0.384642
## [107] train-rmse:45.502442+0.110509 test-rmse:46.041018+0.383337
## [108] train-rmse:45.495631+0.111530 test-rmse:46.040226+0.381890
## [109] train-rmse:45.489730+0.111894 test-rmse:46.038902+0.382750
## [110] train-rmse:45.480850+0.112727 test-rmse:46.035070+0.384782
## [111] train-rmse:45.472446+0.111709 test-rmse:46.030100+0.387983
## [112] train-rmse:45.467838+0.111339 test-rmse:46.029602+0.389600
## [113] train-rmse:45.464172+0.110385 test-rmse:46.030675+0.390175
## [114] train-rmse:45.461120+0.110866 test-rmse:46.029289+0.389483
## [115] train-rmse:45.453600+0.114780 test-rmse:46.024248+0.384036
## [116] train-rmse:45.447788+0.115955 test-rmse:46.021204+0.380277
## [117] train-rmse:45.442711+0.116291 test-rmse:46.017592+0.378985
## [118] train-rmse:45.436595+0.116952 test-rmse:46.016781+0.377853
## [119] train-rmse:45.431622+0.118288 test-rmse:46.014191+0.377601
## [120] train-rmse:45.424943+0.117829 test-rmse:46.010490+0.378689
## [121] train-rmse:45.415434+0.119387 test-rmse:46.012794+0.375269
## [122] train-rmse:45.408988+0.122145 test-rmse:46.008726+0.372006
## [123] train-rmse:45.403700+0.121970 test-rmse:46.005849+0.374101
## [124] train-rmse:45.396910+0.121602 test-rmse:46.005533+0.374079
## [125] train-rmse:45.391816+0.121186 test-rmse:46.009264+0.376061
## [126] train-rmse:45.384869+0.119488 test-rmse:46.007458+0.375526
## [127] train-rmse:45.380136+0.117342 test-rmse:46.006281+0.376821
## [128] train-rmse:45.374454+0.116486 test-rmse:46.004682+0.374953
## [129] train-rmse:45.367886+0.116091 test-rmse:46.000495+0.376947
## [130] train-rmse:45.362262+0.116303 test-rmse:46.002694+0.377098
## [131] train-rmse:45.356953+0.114891 test-rmse:46.000604+0.378858
## [132] train-rmse:45.349868+0.113813 test-rmse:45.995538+0.379066
## [133] train-rmse:45.345124+0.113126 test-rmse:45.997368+0.376386
## [134] train-rmse:45.339609+0.112229 test-rmse:45.994669+0.373599
## [135] train-rmse:45.334460+0.113855 test-rmse:45.994750+0.373977
## [136] train-rmse:45.329550+0.112310 test-rmse:45.992655+0.374772
## [137] train-rmse:45.325485+0.111417 test-rmse:45.997994+0.376721
## [138] train-rmse:45.318647+0.112062 test-rmse:45.993290+0.374061
## [139] train-rmse:45.313020+0.112085 test-rmse:45.994011+0.373814
## [140] train-rmse:45.309931+0.111983 test-rmse:45.992916+0.377432
## [141] train-rmse:45.303262+0.109068 test-rmse:45.989247+0.379591
## [142] train-rmse:45.297079+0.107220 test-rmse:45.981874+0.382213
## [143] train-rmse:45.293496+0.107134 test-rmse:45.984377+0.386339
## [144] train-rmse:45.288324+0.107523 test-rmse:45.984695+0.384871
## [145] train-rmse:45.284608+0.108439 test-rmse:45.984770+0.384042
## [146] train-rmse:45.280708+0.107661 test-rmse:45.985532+0.382651
## [147] train-rmse:45.277760+0.107509 test-rmse:45.982391+0.381723
## [148] train-rmse:45.271954+0.109191 test-rmse:45.982878+0.382488
## [149] train-rmse:45.265923+0.107376 test-rmse:45.981250+0.380495
## [150] train-rmse:45.262325+0.106957 test-rmse:45.980341+0.379965
## [151] train-rmse:45.257532+0.107106 test-rmse:45.982121+0.379787
## [152] train-rmse:45.253326+0.105689 test-rmse:45.979599+0.381825
## [153] train-rmse:45.245923+0.103926 test-rmse:45.976024+0.385324
## [154] train-rmse:45.238992+0.104069 test-rmse:45.979707+0.388182
## [155] train-rmse:45.232452+0.106430 test-rmse:45.976478+0.384663
## [156] train-rmse:45.228419+0.106536 test-rmse:45.976254+0.383833
## [157] train-rmse:45.220896+0.107421 test-rmse:45.974044+0.385273
## [158] train-rmse:45.215640+0.104557 test-rmse:45.969373+0.385891
## [159] train-rmse:45.211986+0.104489 test-rmse:45.972313+0.384611
## [160] train-rmse:45.206991+0.104936 test-rmse:45.970402+0.381735
## [161] train-rmse:45.203828+0.105907 test-rmse:45.971630+0.381604
## [162] train-rmse:45.200187+0.106739 test-rmse:45.972697+0.381342
## [163] train-rmse:45.194403+0.104078 test-rmse:45.970814+0.381737
## [164] train-rmse:45.190692+0.106271 test-rmse:45.969705+0.379677
## [165] train-rmse:45.185835+0.105541 test-rmse:45.969140+0.379014
## [166] train-rmse:45.183518+0.105773 test-rmse:45.966428+0.377351
## [167] train-rmse:45.178522+0.107149 test-rmse:45.963833+0.377128
## [168] train-rmse:45.171958+0.108809 test-rmse:45.956678+0.379922
## [169] train-rmse:45.167024+0.107581 test-rmse:45.955903+0.381604
## [170] train-rmse:45.162596+0.108407 test-rmse:45.954861+0.380690
## [171] train-rmse:45.156151+0.105399 test-rmse:45.954975+0.382678
## [172] train-rmse:45.151859+0.106585 test-rmse:45.959971+0.383140
## [173] train-rmse:45.147759+0.105442 test-rmse:45.957592+0.383399
## [174] train-rmse:45.143932+0.105284 test-rmse:45.956418+0.383989
## [175] train-rmse:45.140186+0.104551 test-rmse:45.953863+0.384547
## [176] train-rmse:45.132823+0.103931 test-rmse:45.954833+0.382546
## [177] train-rmse:45.128508+0.103835 test-rmse:45.954583+0.382270
## [178] train-rmse:45.124426+0.104018 test-rmse:45.954285+0.382404
## [179] train-rmse:45.118513+0.104674 test-rmse:45.952400+0.383345
## [180] train-rmse:45.114346+0.102893 test-rmse:45.950305+0.388413
## [181] train-rmse:45.110950+0.102957 test-rmse:45.949064+0.390227
## [182] train-rmse:45.105926+0.103420 test-rmse:45.944723+0.390256
## [183] train-rmse:45.101180+0.103310 test-rmse:45.947958+0.388871
## [184] train-rmse:45.096868+0.102308 test-rmse:45.947850+0.389898
## [185] train-rmse:45.094074+0.102612 test-rmse:45.949082+0.387970
## [186] train-rmse:45.090502+0.102335 test-rmse:45.950592+0.384964
## [187] train-rmse:45.087356+0.102407 test-rmse:45.950372+0.387534
## [188] train-rmse:45.082444+0.101214 test-rmse:45.952331+0.387627
## [189] train-rmse:45.077558+0.100186 test-rmse:45.949451+0.388117
## [190] train-rmse:45.072720+0.102014 test-rmse:45.950515+0.387894
## [191] train-rmse:45.069019+0.103195 test-rmse:45.949208+0.386652
## [192] train-rmse:45.065781+0.102559 test-rmse:45.950189+0.386796
## Stopping. Best iteration:
## [182] train-rmse:45.105926+0.103420 test-rmse:45.944723+0.390256
yo_preds <- predict(gb_dt_final,dtest)
Working !
yo_preds %>% head
## [1] 135.6260 155.6035 161.2986 121.7719 152.7737 200.6128
data.fe.reduced.1<-data.fe[,-c("delta.selection.departure","delta.selection.presentation","emergency.vehicle.selection")]
Validation
set.seed(4321)
trainIndex <- createDataPartition(data.fe.reduced.1$delta.departure.presentation, p = 0.8, list= FALSE, times = 1)
train=data.fe.reduced.1[trainIndex,]
valid=data.fe.reduced.1[-trainIndex,]
foo <- train %>% select(-delta.departure.presentation)
bar <- valid %>% select(-delta.departure.presentation)
dtrain <- xgb.DMatrix(data.matrix(foo),label = train$delta.departure.presentation)
dvalid <- xgb.DMatrix(data.matrix(bar),label = valid$delta.departure.presentation)
gb_params_final <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = "rmse",
objective = "reg:linear",
seed = 4321
)
watchlist <- list(train=dtrain, valid=dvalid)
set.seed(4321)
gb_dt_final <- xgb.train(params = xgb_params,
data = dtrain,
print_every_n = 500,
watchlist = watchlist,
nrounds = 300)
## Warning in xgb.train(params = xgb_params, data = dtrain, print_every_n = 500, :
## xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.
## [01:39:59] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:281.581818 valid-rmse:281.096405
## [300] train-rmse:121.854195 valid-rmse:123.892708
After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 10, nfold = 5, nrounds=1500)
## [01:40:25] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [01:40:27] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:282.703515+2.044958 test-rmse:282.702063+1.886475
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:218.929465+1.463350 test-rmse:218.955466+1.366829
## [3] train-rmse:179.388782+0.922182 test-rmse:179.439578+1.124537
## [4] train-rmse:156.524942+0.601086 test-rmse:156.619220+1.412488
## [5] train-rmse:143.643131+0.368277 test-rmse:143.771579+1.294501
## [6] train-rmse:136.784097+0.374349 test-rmse:136.925977+1.336595
## [7] train-rmse:133.034320+0.245124 test-rmse:133.173413+1.127329
## [8] train-rmse:131.106418+0.238518 test-rmse:131.271310+1.080970
## [9] train-rmse:129.921249+0.203774 test-rmse:130.094870+0.971673
## [10] train-rmse:129.221500+0.212954 test-rmse:129.418523+0.924257
## [11] train-rmse:128.797443+0.186786 test-rmse:129.030980+0.916174
## [12] train-rmse:128.471445+0.166120 test-rmse:128.712482+0.925961
## [13] train-rmse:128.235010+0.168303 test-rmse:128.484287+0.915703
## [14] train-rmse:128.029650+0.155856 test-rmse:128.283845+0.936030
## [15] train-rmse:127.870262+0.168881 test-rmse:128.135141+0.928278
## [16] train-rmse:127.731982+0.178498 test-rmse:128.017876+0.908642
## [17] train-rmse:127.591865+0.195014 test-rmse:127.892786+0.891753
## [18] train-rmse:127.458642+0.161967 test-rmse:127.761642+0.934695
## [19] train-rmse:127.325392+0.184251 test-rmse:127.630139+0.902832
## [20] train-rmse:127.199688+0.200809 test-rmse:127.518605+0.897174
## [21] train-rmse:127.122252+0.198951 test-rmse:127.448506+0.898978
## [22] train-rmse:127.042662+0.207185 test-rmse:127.381604+0.887989
## [23] train-rmse:126.952083+0.203202 test-rmse:127.308566+0.892925
## [24] train-rmse:126.841679+0.223267 test-rmse:127.190930+0.865984
## [25] train-rmse:126.771108+0.234352 test-rmse:127.133368+0.847293
## [26] train-rmse:126.697027+0.245487 test-rmse:127.072899+0.835085
## [27] train-rmse:126.624141+0.256402 test-rmse:127.001717+0.816570
## [28] train-rmse:126.558125+0.242914 test-rmse:126.950719+0.814846
## [29] train-rmse:126.486377+0.241149 test-rmse:126.893077+0.808848
## [30] train-rmse:126.433842+0.225320 test-rmse:126.861282+0.817011
## [31] train-rmse:126.374431+0.221572 test-rmse:126.801906+0.818437
## [32] train-rmse:126.326585+0.226684 test-rmse:126.765683+0.801218
## [33] train-rmse:126.267470+0.211786 test-rmse:126.715581+0.818008
## [34] train-rmse:126.203683+0.230844 test-rmse:126.670879+0.806378
## [35] train-rmse:126.155077+0.238875 test-rmse:126.640695+0.798791
## [36] train-rmse:126.115436+0.241824 test-rmse:126.611569+0.797617
## [37] train-rmse:126.061203+0.230213 test-rmse:126.565912+0.816225
## [38] train-rmse:126.010155+0.229980 test-rmse:126.525464+0.818805
## [39] train-rmse:125.970544+0.229664 test-rmse:126.497513+0.818607
## [40] train-rmse:125.923646+0.213149 test-rmse:126.461865+0.833388
## [41] train-rmse:125.880949+0.203930 test-rmse:126.427788+0.833575
## [42] train-rmse:125.838098+0.207238 test-rmse:126.383000+0.842389
## [43] train-rmse:125.785091+0.195117 test-rmse:126.340544+0.838548
## [44] train-rmse:125.747461+0.181158 test-rmse:126.309015+0.844894
## [45] train-rmse:125.691655+0.192677 test-rmse:126.259523+0.830306
## [46] train-rmse:125.639765+0.188534 test-rmse:126.218234+0.832650
## [47] train-rmse:125.605250+0.197371 test-rmse:126.191342+0.827306
## [48] train-rmse:125.569556+0.193846 test-rmse:126.163077+0.831638
## [49] train-rmse:125.534003+0.199433 test-rmse:126.138173+0.829639
## [50] train-rmse:125.498117+0.198589 test-rmse:126.116321+0.836811
## [51] train-rmse:125.469231+0.193087 test-rmse:126.093979+0.837198
## [52] train-rmse:125.424967+0.188518 test-rmse:126.067302+0.838145
## [53] train-rmse:125.386867+0.205427 test-rmse:126.031917+0.811170
## [54] train-rmse:125.345363+0.199397 test-rmse:126.001122+0.814600
## [55] train-rmse:125.305042+0.203959 test-rmse:125.961874+0.802213
## [56] train-rmse:125.277211+0.201368 test-rmse:125.943977+0.805934
## [57] train-rmse:125.226792+0.190045 test-rmse:125.901520+0.826776
## [58] train-rmse:125.190959+0.189146 test-rmse:125.875571+0.834384
## [59] train-rmse:125.169673+0.185201 test-rmse:125.861639+0.835608
## [60] train-rmse:125.144896+0.195900 test-rmse:125.841919+0.826982
## [61] train-rmse:125.106685+0.193277 test-rmse:125.820224+0.831770
## [62] train-rmse:125.081052+0.189589 test-rmse:125.802141+0.834579
## [63] train-rmse:125.050079+0.197859 test-rmse:125.788423+0.834169
## [64] train-rmse:125.010675+0.210347 test-rmse:125.762556+0.830253
## [65] train-rmse:124.986939+0.216104 test-rmse:125.749217+0.819989
## [66] train-rmse:124.960171+0.217233 test-rmse:125.733715+0.820735
## [67] train-rmse:124.939699+0.215158 test-rmse:125.716045+0.819372
## [68] train-rmse:124.915331+0.219360 test-rmse:125.696565+0.822305
## [69] train-rmse:124.897646+0.217097 test-rmse:125.687781+0.827521
## [70] train-rmse:124.876439+0.210361 test-rmse:125.676044+0.838766
## [71] train-rmse:124.856493+0.214544 test-rmse:125.663767+0.836737
## [72] train-rmse:124.818239+0.209000 test-rmse:125.633214+0.846154
## [73] train-rmse:124.782941+0.198264 test-rmse:125.595021+0.858412
## [74] train-rmse:124.749162+0.206003 test-rmse:125.573341+0.852352
## [75] train-rmse:124.716267+0.216702 test-rmse:125.548451+0.838944
## [76] train-rmse:124.697811+0.209454 test-rmse:125.536493+0.845908
## [77] train-rmse:124.673810+0.203714 test-rmse:125.520654+0.854118
## [78] train-rmse:124.652530+0.201940 test-rmse:125.503529+0.857069
## [79] train-rmse:124.638992+0.202696 test-rmse:125.497584+0.853056
## [80] train-rmse:124.616330+0.207956 test-rmse:125.490674+0.838532
## [81] train-rmse:124.601074+0.207526 test-rmse:125.484125+0.840142
## [82] train-rmse:124.574916+0.206030 test-rmse:125.470949+0.845956
## [83] train-rmse:124.554855+0.211355 test-rmse:125.467580+0.836485
## [84] train-rmse:124.532333+0.228419 test-rmse:125.455859+0.817522
## [85] train-rmse:124.516373+0.226317 test-rmse:125.446210+0.815057
## [86] train-rmse:124.489075+0.216611 test-rmse:125.431657+0.820862
## [87] train-rmse:124.460928+0.223182 test-rmse:125.411769+0.815325
## [88] train-rmse:124.441197+0.220566 test-rmse:125.404826+0.820964
## [89] train-rmse:124.414539+0.223098 test-rmse:125.388025+0.821629
## [90] train-rmse:124.390385+0.209885 test-rmse:125.379727+0.833791
## [91] train-rmse:124.368117+0.202071 test-rmse:125.366159+0.839067
## [92] train-rmse:124.346083+0.205148 test-rmse:125.357337+0.845494
## [93] train-rmse:124.329237+0.201805 test-rmse:125.348677+0.844546
## [94] train-rmse:124.300706+0.208897 test-rmse:125.334479+0.834123
## [95] train-rmse:124.280487+0.203953 test-rmse:125.321832+0.844934
## [96] train-rmse:124.268350+0.203069 test-rmse:125.318417+0.842680
## [97] train-rmse:124.244969+0.212763 test-rmse:125.296135+0.834938
## [98] train-rmse:124.219591+0.212004 test-rmse:125.273372+0.834482
## [99] train-rmse:124.198354+0.207942 test-rmse:125.254198+0.841005
## [100] train-rmse:124.169476+0.213721 test-rmse:125.235422+0.838064
## [101] train-rmse:124.150229+0.210763 test-rmse:125.230492+0.842126
## [102] train-rmse:124.136728+0.209839 test-rmse:125.226146+0.848534
## [103] train-rmse:124.114752+0.209701 test-rmse:125.215111+0.854951
## [104] train-rmse:124.094069+0.212273 test-rmse:125.192407+0.854568
## [105] train-rmse:124.071657+0.221118 test-rmse:125.190526+0.851523
## [106] train-rmse:124.056296+0.217983 test-rmse:125.184055+0.853253
## [107] train-rmse:124.037360+0.214577 test-rmse:125.179395+0.856461
## [108] train-rmse:124.023198+0.217073 test-rmse:125.174644+0.856415
## [109] train-rmse:123.994272+0.215885 test-rmse:125.152083+0.858963
## [110] train-rmse:123.977367+0.223131 test-rmse:125.146132+0.849085
## [111] train-rmse:123.954877+0.232329 test-rmse:125.129021+0.843463
## [112] train-rmse:123.928111+0.236565 test-rmse:125.118430+0.835341
## [113] train-rmse:123.907115+0.238853 test-rmse:125.112030+0.834479
## [114] train-rmse:123.890954+0.240151 test-rmse:125.107970+0.828642
## [115] train-rmse:123.873549+0.233621 test-rmse:125.103181+0.836711
## [116] train-rmse:123.859955+0.236917 test-rmse:125.102930+0.839444
## [117] train-rmse:123.848044+0.237979 test-rmse:125.097391+0.837442
## [118] train-rmse:123.836887+0.231834 test-rmse:125.093994+0.843417
## [119] train-rmse:123.815416+0.237891 test-rmse:125.080812+0.839786
## [120] train-rmse:123.795982+0.235941 test-rmse:125.077800+0.839567
## [121] train-rmse:123.778194+0.230650 test-rmse:125.071588+0.841475
## [122] train-rmse:123.755693+0.222981 test-rmse:125.055562+0.850295
## [123] train-rmse:123.736705+0.219992 test-rmse:125.050154+0.843424
## [124] train-rmse:123.718704+0.216090 test-rmse:125.043178+0.845905
## [125] train-rmse:123.700588+0.210073 test-rmse:125.027925+0.853678
## [126] train-rmse:123.675894+0.210793 test-rmse:125.007852+0.849176
## [127] train-rmse:123.664406+0.210704 test-rmse:125.004581+0.848155
## [128] train-rmse:123.641428+0.213283 test-rmse:124.986501+0.850398
## [129] train-rmse:123.622099+0.215967 test-rmse:124.979970+0.854711
## [130] train-rmse:123.609581+0.211986 test-rmse:124.983769+0.860617
## [131] train-rmse:123.593855+0.214852 test-rmse:124.979529+0.859937
## [132] train-rmse:123.582040+0.213258 test-rmse:124.976103+0.857515
## [133] train-rmse:123.561973+0.217257 test-rmse:124.961093+0.855444
## [134] train-rmse:123.549158+0.217683 test-rmse:124.952711+0.858713
## [135] train-rmse:123.536905+0.216968 test-rmse:124.951578+0.862451
## [136] train-rmse:123.520061+0.217543 test-rmse:124.943797+0.860185
## [137] train-rmse:123.508148+0.219179 test-rmse:124.939552+0.860109
## [138] train-rmse:123.491881+0.223549 test-rmse:124.936464+0.851047
## [139] train-rmse:123.476079+0.231766 test-rmse:124.931679+0.840744
## [140] train-rmse:123.460965+0.237687 test-rmse:124.926697+0.838108
## [141] train-rmse:123.446237+0.243254 test-rmse:124.922507+0.835648
## [142] train-rmse:123.432312+0.241219 test-rmse:124.918463+0.836149
## [143] train-rmse:123.418791+0.241527 test-rmse:124.913319+0.835665
## [144] train-rmse:123.406230+0.240045 test-rmse:124.911740+0.837834
## [145] train-rmse:123.386075+0.243063 test-rmse:124.905865+0.834632
## [146] train-rmse:123.373619+0.248370 test-rmse:124.900865+0.832215
## [147] train-rmse:123.357594+0.248850 test-rmse:124.890523+0.831822
## [148] train-rmse:123.340611+0.247387 test-rmse:124.877173+0.828970
## [149] train-rmse:123.329089+0.244405 test-rmse:124.875403+0.829551
## [150] train-rmse:123.315657+0.244560 test-rmse:124.868761+0.826650
## [151] train-rmse:123.299795+0.239596 test-rmse:124.860689+0.826716
## [152] train-rmse:123.286383+0.232553 test-rmse:124.858412+0.826691
## [153] train-rmse:123.275934+0.238177 test-rmse:124.856793+0.825227
## [154] train-rmse:123.260172+0.238620 test-rmse:124.853363+0.823940
## [155] train-rmse:123.251682+0.237573 test-rmse:124.846019+0.825654
## [156] train-rmse:123.236238+0.243332 test-rmse:124.833725+0.814806
## [157] train-rmse:123.223924+0.240597 test-rmse:124.831728+0.815028
## [158] train-rmse:123.211646+0.234858 test-rmse:124.827231+0.815885
## [159] train-rmse:123.198674+0.235413 test-rmse:124.826764+0.817849
## [160] train-rmse:123.186759+0.237144 test-rmse:124.826462+0.815489
## [161] train-rmse:123.176599+0.237559 test-rmse:124.820580+0.814578
## [162] train-rmse:123.162610+0.237635 test-rmse:124.815803+0.816051
## [163] train-rmse:123.150671+0.240134 test-rmse:124.815997+0.816690
## [164] train-rmse:123.134381+0.235918 test-rmse:124.805919+0.818836
## [165] train-rmse:123.123396+0.237924 test-rmse:124.802463+0.817056
## [166] train-rmse:123.113959+0.237513 test-rmse:124.803487+0.819214
## [167] train-rmse:123.102180+0.238519 test-rmse:124.796207+0.815642
## [168] train-rmse:123.088808+0.236735 test-rmse:124.790979+0.813379
## [169] train-rmse:123.077006+0.237550 test-rmse:124.791929+0.821566
## [170] train-rmse:123.068076+0.235357 test-rmse:124.794110+0.821590
## [171] train-rmse:123.056482+0.241593 test-rmse:124.794679+0.819583
## [172] train-rmse:123.046887+0.243126 test-rmse:124.793047+0.816887
## [173] train-rmse:123.032726+0.241747 test-rmse:124.786713+0.813126
## [174] train-rmse:123.013573+0.236626 test-rmse:124.775322+0.816170
## [175] train-rmse:123.004489+0.239485 test-rmse:124.770921+0.819393
## [176] train-rmse:122.988657+0.243757 test-rmse:124.763905+0.823959
## [177] train-rmse:122.977400+0.247604 test-rmse:124.766617+0.818279
## [178] train-rmse:122.965173+0.249178 test-rmse:124.765701+0.818603
## [179] train-rmse:122.947548+0.247722 test-rmse:124.751733+0.817962
## [180] train-rmse:122.930499+0.249827 test-rmse:124.754097+0.817266
## [181] train-rmse:122.918335+0.249450 test-rmse:124.747942+0.818912
## [182] train-rmse:122.910393+0.250046 test-rmse:124.752158+0.815445
## [183] train-rmse:122.896339+0.254423 test-rmse:124.747087+0.809309
## [184] train-rmse:122.881923+0.253999 test-rmse:124.737135+0.805836
## [185] train-rmse:122.866689+0.253076 test-rmse:124.732588+0.806601
## [186] train-rmse:122.855983+0.253636 test-rmse:124.730281+0.808322
## [187] train-rmse:122.843066+0.254177 test-rmse:124.725697+0.812037
## [188] train-rmse:122.833920+0.257494 test-rmse:124.719104+0.806738
## [189] train-rmse:122.824617+0.256561 test-rmse:124.718858+0.802002
## [190] train-rmse:122.808149+0.263662 test-rmse:124.711884+0.795042
## [191] train-rmse:122.797507+0.261261 test-rmse:124.703532+0.798595
## [192] train-rmse:122.786642+0.264055 test-rmse:124.704935+0.801269
## [193] train-rmse:122.778674+0.262511 test-rmse:124.703882+0.804858
## [194] train-rmse:122.771313+0.261746 test-rmse:124.704663+0.800943
## [195] train-rmse:122.762102+0.262220 test-rmse:124.705229+0.799998
## [196] train-rmse:122.751213+0.261713 test-rmse:124.703248+0.803419
## [197] train-rmse:122.742424+0.257413 test-rmse:124.707344+0.809145
## [198] train-rmse:122.729915+0.252411 test-rmse:124.704372+0.812882
## [199] train-rmse:122.718350+0.249875 test-rmse:124.698671+0.822985
## [200] train-rmse:122.706691+0.251270 test-rmse:124.695094+0.814665
## [201] train-rmse:122.696220+0.252974 test-rmse:124.686610+0.814117
## [202] train-rmse:122.684596+0.252984 test-rmse:124.688141+0.814081
## [203] train-rmse:122.671252+0.250797 test-rmse:124.685690+0.815344
## [204] train-rmse:122.660486+0.249585 test-rmse:124.685730+0.815407
## [205] train-rmse:122.648993+0.250871 test-rmse:124.680603+0.817994
## [206] train-rmse:122.638028+0.251300 test-rmse:124.676453+0.818420
## [207] train-rmse:122.626321+0.248848 test-rmse:124.672256+0.822444
## [208] train-rmse:122.613440+0.250855 test-rmse:124.669411+0.823279
## [209] train-rmse:122.606397+0.250544 test-rmse:124.665654+0.820984
## [210] train-rmse:122.595605+0.251522 test-rmse:124.667932+0.818849
## [211] train-rmse:122.582585+0.251372 test-rmse:124.661792+0.816192
## [212] train-rmse:122.572668+0.250903 test-rmse:124.660811+0.814114
## [213] train-rmse:122.558652+0.251068 test-rmse:124.658711+0.814343
## [214] train-rmse:122.548652+0.250852 test-rmse:124.653522+0.820415
## [215] train-rmse:122.536250+0.252358 test-rmse:124.651098+0.824412
## [216] train-rmse:122.519780+0.253022 test-rmse:124.641399+0.824554
## [217] train-rmse:122.510254+0.251972 test-rmse:124.637253+0.824870
## [218] train-rmse:122.491554+0.253490 test-rmse:124.629941+0.823772
## [219] train-rmse:122.480890+0.250413 test-rmse:124.630271+0.823457
## [220] train-rmse:122.470212+0.249207 test-rmse:124.632410+0.824940
## [221] train-rmse:122.461147+0.249431 test-rmse:124.628890+0.825197
## [222] train-rmse:122.448788+0.252908 test-rmse:124.625536+0.822702
## [223] train-rmse:122.435811+0.256006 test-rmse:124.625096+0.825732
## [224] train-rmse:122.423877+0.258507 test-rmse:124.616591+0.824761
## [225] train-rmse:122.406162+0.258602 test-rmse:124.602289+0.826366
## [226] train-rmse:122.393289+0.256130 test-rmse:124.599245+0.827483
## [227] train-rmse:122.385548+0.255267 test-rmse:124.595605+0.824456
## [228] train-rmse:122.377731+0.254012 test-rmse:124.597319+0.827041
## [229] train-rmse:122.364464+0.255340 test-rmse:124.587914+0.825821
## [230] train-rmse:122.352924+0.251895 test-rmse:124.586563+0.830881
## [231] train-rmse:122.340653+0.251020 test-rmse:124.579807+0.839325
## [232] train-rmse:122.326439+0.251651 test-rmse:124.573164+0.840034
## [233] train-rmse:122.312049+0.252558 test-rmse:124.571487+0.846668
## [234] train-rmse:122.305028+0.252084 test-rmse:124.571297+0.844620
## [235] train-rmse:122.293327+0.251807 test-rmse:124.562296+0.843526
## [236] train-rmse:122.282838+0.251742 test-rmse:124.564496+0.845161
## [237] train-rmse:122.272408+0.250095 test-rmse:124.559239+0.849590
## [238] train-rmse:122.257593+0.253910 test-rmse:124.553262+0.846872
## [239] train-rmse:122.244575+0.252006 test-rmse:124.548497+0.843821
## [240] train-rmse:122.236809+0.253924 test-rmse:124.551383+0.840719
## [241] train-rmse:122.227168+0.250516 test-rmse:124.552479+0.847094
## [242] train-rmse:122.216953+0.252109 test-rmse:124.551027+0.845204
## [243] train-rmse:122.206653+0.251993 test-rmse:124.551749+0.846246
## [244] train-rmse:122.198625+0.250489 test-rmse:124.547787+0.853991
## [245] train-rmse:122.191705+0.249759 test-rmse:124.548543+0.854197
## [246] train-rmse:122.184835+0.251527 test-rmse:124.549274+0.855004
## [247] train-rmse:122.175757+0.250663 test-rmse:124.547740+0.853821
## [248] train-rmse:122.167264+0.249044 test-rmse:124.541863+0.851845
## [249] train-rmse:122.160173+0.249516 test-rmse:124.543366+0.849160
## [250] train-rmse:122.150351+0.249389 test-rmse:124.534178+0.847989
## [251] train-rmse:122.143443+0.246605 test-rmse:124.530742+0.847613
## [252] train-rmse:122.132130+0.248239 test-rmse:124.525398+0.847810
## [253] train-rmse:122.120404+0.252135 test-rmse:124.525087+0.844577
## [254] train-rmse:122.111832+0.251076 test-rmse:124.527005+0.846833
## [255] train-rmse:122.102901+0.252172 test-rmse:124.527394+0.844938
## [256] train-rmse:122.094453+0.250904 test-rmse:124.524959+0.848794
## [257] train-rmse:122.080596+0.252071 test-rmse:124.521792+0.857851
## [258] train-rmse:122.066800+0.253042 test-rmse:124.516440+0.857129
## [259] train-rmse:122.055365+0.254517 test-rmse:124.514583+0.853380
## [260] train-rmse:122.044949+0.255907 test-rmse:124.516748+0.854487
## [261] train-rmse:122.036412+0.257214 test-rmse:124.518443+0.850077
## [262] train-rmse:122.025475+0.257732 test-rmse:124.514955+0.845710
## [263] train-rmse:122.018311+0.258120 test-rmse:124.513653+0.847132
## [264] train-rmse:122.010173+0.255542 test-rmse:124.512569+0.850095
## [265] train-rmse:122.001848+0.251652 test-rmse:124.510217+0.851443
## [266] train-rmse:121.989041+0.250583 test-rmse:124.502580+0.856601
## [267] train-rmse:121.979224+0.254724 test-rmse:124.499126+0.855581
## [268] train-rmse:121.965256+0.250613 test-rmse:124.493661+0.860872
## [269] train-rmse:121.955314+0.250351 test-rmse:124.493501+0.857668
## [270] train-rmse:121.943658+0.249594 test-rmse:124.492421+0.860339
## [271] train-rmse:121.934070+0.246624 test-rmse:124.489186+0.861841
## [272] train-rmse:121.926930+0.245577 test-rmse:124.487033+0.860466
## [273] train-rmse:121.916984+0.244214 test-rmse:124.486890+0.864847
## [274] train-rmse:121.908665+0.244890 test-rmse:124.488899+0.866079
## [275] train-rmse:121.899449+0.246839 test-rmse:124.488402+0.862195
## [276] train-rmse:121.889856+0.246889 test-rmse:124.487773+0.870973
## [277] train-rmse:121.881825+0.248588 test-rmse:124.488286+0.869108
## [278] train-rmse:121.874648+0.249380 test-rmse:124.485077+0.871577
## [279] train-rmse:121.862349+0.249979 test-rmse:124.478630+0.876256
## [280] train-rmse:121.853583+0.245501 test-rmse:124.474550+0.879420
## [281] train-rmse:121.843787+0.245976 test-rmse:124.477582+0.877575
## [282] train-rmse:121.831903+0.245720 test-rmse:124.479935+0.882763
## [283] train-rmse:121.825481+0.248103 test-rmse:124.480264+0.876562
## [284] train-rmse:121.816826+0.249538 test-rmse:124.483537+0.876783
## [285] train-rmse:121.809236+0.250408 test-rmse:124.486954+0.872295
## [286] train-rmse:121.802962+0.251660 test-rmse:124.486531+0.870028
## [287] train-rmse:121.791768+0.249469 test-rmse:124.481787+0.869768
## [288] train-rmse:121.784113+0.250063 test-rmse:124.478834+0.865037
## [289] train-rmse:121.773612+0.251047 test-rmse:124.475958+0.866396
## [290] train-rmse:121.759785+0.247887 test-rmse:124.471977+0.865374
## [291] train-rmse:121.749579+0.247080 test-rmse:124.472113+0.866959
## [292] train-rmse:121.742268+0.242696 test-rmse:124.472479+0.868902
## [293] train-rmse:121.733507+0.239906 test-rmse:124.470796+0.873836
## [294] train-rmse:121.724316+0.239329 test-rmse:124.471655+0.875208
## [295] train-rmse:121.716275+0.239074 test-rmse:124.475843+0.874817
## [296] train-rmse:121.706418+0.235611 test-rmse:124.474968+0.874960
## [297] train-rmse:121.697996+0.235786 test-rmse:124.475877+0.878080
## [298] train-rmse:121.690814+0.237220 test-rmse:124.476650+0.878388
## [299] train-rmse:121.681207+0.236652 test-rmse:124.478523+0.876216
## [300] train-rmse:121.672929+0.235002 test-rmse:124.480212+0.877682
## [301] train-rmse:121.662836+0.234977 test-rmse:124.480942+0.882108
## [302] train-rmse:121.652294+0.235627 test-rmse:124.480798+0.883366
## [303] train-rmse:121.643468+0.234400 test-rmse:124.483426+0.884490
## Stopping. Best iteration:
## [293] train-rmse:121.733507+0.239906 test-rmse:124.470796+0.873836
y1_preds <- predict(gb_dt_final,dtest)
y1_preds %>% head
## [1] 356.4333 361.2431 286.2780 266.6343 268.5483 345.5236
pred_ys=yo_preds+y1_preds
pred_final=data.frame(yo_preds,y1_preds,pred_ys)
pred_final %>% head
## yo_preds y1_preds pred_ys
## 1 135.6260 356.4333 492.0593
## 2 155.6035 361.2431 516.8466
## 3 161.2986 286.2780 447.5766
## 4 121.7719 266.6343 388.4062
## 5 152.7737 268.5483 421.3221
## 6 200.6128 345.5236 546.1363
pred_final$id<-x_test$emergency.vehicle.selection
Change columns order
pred_final <- pred_final[, c(4, 1, 2, 3)]
Retrieve order from original
pred_final %>% head
## id yo_preds y1_preds pred_ys
## 1 4715068 135.6260 356.4333 492.0593
## 2 4714816 155.6035 361.2431 516.8466
## 3 4713710 161.2986 286.2780 447.5766
## 4 4713748 121.7719 266.6343 388.4062
## 5 4713778 152.7737 268.5483 421.3221
## 6 4713812 200.6128 345.5236 546.1363
pred_final %>% head
## id yo_preds y1_preds pred_ys
## 1 4715068 135.6260 356.4333 492.0593
## 2 4714816 155.6035 361.2431 516.8466
## 3 4713710 161.2986 286.2780 447.5766
## 4 4713748 121.7719 266.6343 388.4062
## 5 4713778 152.7737 268.5483 421.3221
## 6 4713812 200.6128 345.5236 546.1363
id_order %>% head
## x_test.emergency.vehicle.selection rank rank
## 1 5271704 1 1
## 2 5092931 2 2
## 3 5153756 3 3
## 4 5355572 4 4
## 5 5178915 5 5
## 6 5206885 6 6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
## x_test.emergency.vehicle.selection rank rank rank
## 1 5271704 1 1 1
## 2 5092931 2 2 2
## 3 5153756 3 3 3
## 4 5355572 4 4 4
## 5 5178915 5 5 5
## 6 5206885 6 6 6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
## id yo_preds y1_preds pred_ys rank rank.1 rank.2
## 1 4713710 161.2986 286.2780 447.5766 14082 14082 14082
## 2 4713748 121.7719 266.6343 388.4062 25965 25965 25965
## 3 4713778 152.7737 268.5483 421.3221 14465 14465 14465
## 4 4713812 200.6128 345.5236 546.1363 79530 79530 79530
## 5 4713821 135.7495 531.8088 667.5584 47824 47824 47824
## 6 4713863 152.5117 234.3796 386.8913 2072 2072 2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
## id yo_preds y1_preds pred_ys rank rank.1 rank.2
## 77897 5271704 116.7321 407.9039 524.6360 1 1 1
## 59352 5092931 134.7929 401.6206 536.4135 2 2 2
## 68483 5153756 108.1999 238.5139 346.7138 3 3 3
## 91242 5355572 220.5143 296.4528 516.9671 4 4 4
## 72358 5178915 126.3587 364.0570 490.4157 5 5 5
## 77062 5206885 125.2095 376.1835 501.3930 6 6 6
y_final %>% setDT
y_final=y_final[,-c("rank")]
y_final %>% head
## id yo_preds y1_preds pred_ys rank.1 rank.2
## 1: 5271704 116.7321 407.9039 524.6360 1 1
## 2: 5092931 134.7929 401.6206 536.4135 2 2
## 3: 5153756 108.1999 238.5139 346.7138 3 3
## 4: 5355572 220.5143 296.4528 516.9671 4 4
## 5: 5178915 126.3587 364.0570 490.4157 5 5
## 6: 5206885 125.2095 376.1835 501.3930 6 6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0
Write csv file. Go to pyhton script for generate good csv.
fwrite(y_final, "y_test.csv",sep=",")
Drop in R2 ## 10. Neural Net
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 21 minutes
## H2O cluster timezone: Europe/Paris
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 1 month and 5 days
## H2O cluster name: H2O_started_from_R_swp_kqp692
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.55 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.3 (2020-10-10)
h2o.train <- as.h2o(data.fe.reduced.0)
##
|
| | 0%
|
|======================================================================| 100%
h2o.test <- as.h2o(x_test.reduced)
##
|
| | 0%
|
|======================================================================| 100%
h2o.model <- h2o.deeplearning(x=setdiff(names(data.fe.reduced.0),c("delta.selection.departure")),y ="delta.selection.departure",training_frame = h2o.train,standardize = TRUE,hidden = c(100, 100,100),rate = 0.01,epochs = 1000,seed = 1234)
## Warning in .h2o.processResponseWarnings(res): rate cannot be specified if adaptive_rate is enabled..
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|======================================================================| 100%
h2o.prediction.y0 <- as.data.frame(h2o.predict(h2o.model, h2o.test))
##
|
| | 0%
|
|======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle' has levels not trained on: ["1862", "1933",
## "1935", "1976", "2000", "2046", "2061", "2162", "2163", "2214", "2299", "2490",
## "2497", "2510", "2516", "2530", "2538", "2553", "2559", "2564", "2603", "2606",
## "2610", "2617", "2620", "2667", "2702", "2706", "2707", "2713", "2714", "2725",
## "2827", "2858", "2937", "3013", "3023", "3035", "3061", "3062", "3063", "3072",
## "3116", "3120", "3124", "3297", "3391", "3392", "3407", "3417", "3425", "3427",
## "3548", "3552", "4217", "4352", "4358", "4385", "4431", "4440", "4459", "4472",
## "4487", "4490", "4500", "4531", "4561", "4562", "4864", "4881", "4885", "4925",
## "5261", "5262", "5271", "5278", "5655", "5656", "5668", "5769", "5777", "5782",
## "5784", "5786", "5790", "5818", "5824", "5826", "5832", "5835", "5838", "5841",
## "5844", "5847", "5855", "5856", "5868", "5872", "5873", "5874", "5878", "5922",
## "5926", "5939", "5946", "5950", "5961", "5983", "5993", "5994", "6034", "6066"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'location.of.the.event' has levels not trained on: ["152", "253",
## "302", "314"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'rescue.center' has levels not trained on: ["2453", "266267",
## "266268", "266269", "266270", "266276", "266278", "266279", "266295", "266298",
## "266320", "266322", "266326"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle.type' has levels not trained on: ["CESD",
## "CSP", "ESAVI", "SP BALLON", "UMH", "UMH 75", "UMH 92", "UMH 93", "UMH 94", "UMH
## BEAUJ", "UMH BOBI", "UMH DEBREPED", "UMH DIEU", "UMH GARC", "UMH LARIB", "UMH
## MONDOR", "UMH NECK", "UMH PITIE", "VE2I", "VEC", "VELD", "VES", "VIGI", "VIMP",
## "VPB", "VPC GFIS", "VPC GIS", "VRCH BSPP", "VRCP", "VRSD", "VSTI"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'n' has levels not trained on: ["13", "9"]
h2o.prediction.y0 %>% head
## predict
## 1 263.7359
## 2 112.7732
## 3 130.5915
## 4 166.9196
## 5 131.5314
## 6 165.8459
h2o.train.1 <- as.h2o(data.fe.reduced.1)
##
|
| | 0%
|
|======================================================================| 100%
h2o.model <- h2o.deeplearning(x=setdiff(names(data.fe.reduced.1),c("delta.departure.presentation")),y ="delta.departure.presentation",training_frame = h2o.train.1,standardize = TRUE,hidden = c(100, 100,100),rate = 0.01,epochs = 1000,seed = 1234)
## Warning in .h2o.processResponseWarnings(res): rate cannot be specified if adaptive_rate is enabled..
##
|
| | 0%
|
| | 1%
|
|======================================================================| 100%
h2o.prediction.y1 <- as.data.frame(h2o.predict(h2o.model, h2o.test))
##
|
| | 0%
|
|======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle' has levels not trained on: ["1862", "1933",
## "1935", "1976", "2000", "2046", "2061", "2162", "2163", "2214", "2299", "2490",
## "2497", "2510", "2516", "2530", "2538", "2553", "2559", "2564", "2603", "2606",
## "2610", "2617", "2620", "2667", "2702", "2706", "2707", "2713", "2714", "2725",
## "2827", "2858", "2937", "3013", "3023", "3035", "3061", "3062", "3063", "3072",
## "3116", "3120", "3124", "3297", "3391", "3392", "3407", "3417", "3425", "3427",
## "3548", "3552", "4217", "4352", "4358", "4385", "4431", "4440", "4459", "4472",
## "4487", "4490", "4500", "4531", "4561", "4562", "4864", "4881", "4885", "4925",
## "5261", "5262", "5271", "5278", "5655", "5656", "5668", "5769", "5777", "5782",
## "5784", "5786", "5790", "5818", "5824", "5826", "5832", "5835", "5838", "5841",
## "5844", "5847", "5855", "5856", "5868", "5872", "5873", "5874", "5878", "5922",
## "5926", "5939", "5946", "5950", "5961", "5983", "5993", "5994", "6034", "6066"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'location.of.the.event' has levels not trained on: ["152", "253",
## "302", "314"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'rescue.center' has levels not trained on: ["2453", "266267",
## "266268", "266269", "266270", "266276", "266278", "266279", "266295", "266298",
## "266320", "266322", "266326"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'emergency.vehicle.type' has levels not trained on: ["CESD",
## "CSP", "ESAVI", "SP BALLON", "UMH", "UMH 75", "UMH 92", "UMH 93", "UMH 94", "UMH
## BEAUJ", "UMH BOBI", "UMH DEBREPED", "UMH DIEU", "UMH GARC", "UMH LARIB", "UMH
## MONDOR", "UMH NECK", "UMH PITIE", "VE2I", "VEC", "VELD", "VES", "VIGI", "VIMP",
## "VPB", "VPC GFIS", "VPC GIS", "VRCH BSPP", "VRCP", "VRSD", "VSTI"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'n' has levels not trained on: ["13", "9"]
h20_ys=h2o.prediction.y0+h2o.prediction.y1
pred_final=data.frame(h2o.prediction.y0,h2o.prediction.y1,h20_ys)
pred_final %>% head
## predict predict.1 predict.2
## 1 263.7359 346.5793 610.3153
## 2 112.7732 413.5207 526.2940
## 3 130.5915 246.1286 376.7201
## 4 166.9196 283.9014 450.8210
## 5 131.5314 348.4830 480.0144
## 6 165.8459 285.0595 450.9053
pred_final$id<-x_test$emergency.vehicle.selection
Change columns order
pred_final <- pred_final[, c(4, 1, 2, 3)]
Retrieve order from original
pred_final %>% head
## id predict predict.1 predict.2
## 1 4715068 263.7359 346.5793 610.3153
## 2 4714816 112.7732 413.5207 526.2940
## 3 4713710 130.5915 246.1286 376.7201
## 4 4713748 166.9196 283.9014 450.8210
## 5 4713778 131.5314 348.4830 480.0144
## 6 4713812 165.8459 285.0595 450.9053
pred_final %>% head
## id predict predict.1 predict.2
## 1 4715068 263.7359 346.5793 610.3153
## 2 4714816 112.7732 413.5207 526.2940
## 3 4713710 130.5915 246.1286 376.7201
## 4 4713748 166.9196 283.9014 450.8210
## 5 4713778 131.5314 348.4830 480.0144
## 6 4713812 165.8459 285.0595 450.9053
id_order %>% head
## x_test.emergency.vehicle.selection rank rank rank
## 1 5271704 1 1 1
## 2 5092931 2 2 2
## 3 5153756 3 3 3
## 4 5355572 4 4 4
## 5 5178915 5 5 5
## 6 5206885 6 6 6
len <- dim(id_order)[1]
id_order <- cbind(id_order, rank=1:len)
id_order %>% head
## x_test.emergency.vehicle.selection rank rank rank rank
## 1 5271704 1 1 1 1
## 2 5092931 2 2 2 2
## 3 5153756 3 3 3 3
## 4 5355572 4 4 4 4
## 5 5178915 5 5 5 5
## 6 5206885 6 6 6 6
y_final=merge(pred_final,id_order, by.x = 'id', by.y = 'x_test.emergency.vehicle.selection', all = FALSE)
y_final %>% head
## id predict predict.1 predict.2 rank rank.1 rank.2 rank.3
## 1 4713710 130.5915 246.1286 376.7201 14082 14082 14082 14082
## 2 4713748 166.9196 283.9014 450.8210 25965 25965 25965 25965
## 3 4713778 131.5314 348.4830 480.0144 14465 14465 14465 14465
## 4 4713812 165.8459 285.0595 450.9053 79530 79530 79530 79530
## 5 4713821 151.7333 572.5734 724.3068 47824 47824 47824 47824
## 6 4713863 124.6063 148.1335 272.7398 2072 2072 2072 2072
y_final=y_final[order(y_final[,'rank']),]
y_final %>% head
## id predict predict.1 predict.2 rank rank.1 rank.2 rank.3
## 77897 5271704 112.6084 411.4832 524.0916 1 1 1 1
## 59352 5092931 120.0579 409.7404 529.7983 2 2 2 2
## 68483 5153756 122.7251 238.9925 361.7176 3 3 3 3
## 91242 5355572 247.5476 221.9852 469.5328 4 4 4 4
## 72358 5178915 140.4392 264.6780 405.1172 5 5 5 5
## 77062 5206885 130.5567 258.7095 389.2663 6 6 6 6
y_final %>% setDT
y_final=y_final[,-c("rank.1")]
y_final %>% head
## id predict predict.1 predict.2 rank rank.2 rank.3
## 1: 5271704 112.6084 411.4832 524.0916 1 1 1
## 2: 5092931 120.0579 409.7404 529.7983 2 2 2
## 3: 5153756 122.7251 238.9925 361.7176 3 3 3
## 4: 5355572 247.5476 221.9852 469.5328 4 4 4
## 5: 5178915 140.4392 264.6780 405.1172 5 5 5
## 6: 5206885 130.5567 258.7095 389.2663 6 6 6
sum(is.na(y_final))
## [1] 0
which(is.na(y_final))
## integer(0)
y_final[is.na(y_final)] <- 0
Write csv file. Go to pyhton script for generate good csv.
fwrite(y_final, "y_test.csv",sep=",")
What could be done in order to increase our R2 ?
Increase Feature Buidling: add supp files from data Coordinate : lot of thing to do - Direction, km from Paris Center (traffic issue) Mean speed / vehicule type Mean speed / rescue center Holidays or not (traffic issue) We have removed GPS data because of NA but can be important ? Vehicule type
Add a real CV and model selection part : Once the best model identified,
go to hyperparameters tunning.
The following is a list of helpful contributor.
Thank you for reading !