BAKERS is a software development company that has been existing since the 1980s. With the current evolution in AI and business analytic, they are far behind the competition in terms of productivity. The CEO believes the issue comes from his employees so he decides to hire a new human resource (HR) manager. The new HR manager is who just arrived seeks to understand the employee structure of the company to determine if the employees could be the reason why the company is not very competitive in today’s software business industry and how to solve this issue if it is the case. To this end, the HR manager learns that there is a database that contains all information on the employees in the company and also learns that there is a business analysis team in the company that provides analytic support to other departments. HR decides to reach out to have their informed opinion on the current situation of employees in the company and what could be done to improve the situation.
You are the team of business analysts working for BAKERS. Based on the HR’s request, you are expected to make at least 15 recommendations on how to improve the state and conditions of employees in the company to make it more competitive in its industry. Be creative - your advice can be anything meaningful for the HR (e.g. hire new staff, increase salaries for a given group of employees, fire some employees, etc.). All your justifications MUST be based on evidence from the database. Each suggestion should be presented in four subsections:
HR has also provided a list of requests that must be completed. You are expected to provide a minimum of 10 recommendations based on the requests of the HR, and a minimum of five recommendations based on queries you conceived as an expert. One query can lead to more than one recommendation.
Follow the instructions carefully. Do not provide irrelevant columns. All your column titles should bare meaningful and well-formatted headings. For example, do not provide employee ID if it is not necessary/required to understand your suggestion. However, your tables should have rich content for the manager. For example, if the list of employees with the lowest salary is requested, it will be better if you provide the name of the employee, the salary earned, and add the department and/or job title or years of experience to help develop interesting insights and arguments.
List of queries that must be completed for the HR. The HR manager will like you to please provide the following information (remember: the queries and the results (provide first 10 rows if the result is a table)):
To meet your objectives, the database administrator has provided you with a MySQL database for employees. Figure 1 presents the database schema. You are expected to query the database provided to obtain the information necessary for you to advise the new HR. To setup the database, it suffices to import the “employees.sql” file in the employees zipped folder provided. DO NOT DELETE ANY FILE IN THE FOLDER PROVIDED! Make sure to use MySQL workbench or MySQL shell to avoid incompatibilities.
The department’s analytic team would like to specify the main hypothesis after receiving this database. All the analyses are carried out at the date 2020. We therefore place ourselves in the situation where the database we received from HR is up to date for this year.
List of employees with the following information: First name, last name, gender, age, number of years spent in the company, department, the year they joined the department, and their current job title.
SELECT first_name, last_name, gender, dept_name, TIMESTAMPDIFF(YEAR,birth_date,NOW()) AS age, TIMESTAMPDIFF(YEAR,hire_date,NOW()) AS years_in_the_company, dept_name, YEAR(de.from_date) AS year_of_entrance_in_the_department, title
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
INNER JOIN titles AS t
ON e.emp_no = t.emp_no
WHERE t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
AND de.to_date > CURDATE() #esclude the people who are now retired
ORDER BY 1, 2
LIMIT 10;
The database given by the database administrator is well structured and seems to be of good quality as we can clearly read on this first query.
This first query provides an overview of the employees by aggregating information from different tables.
If it is not already done, it would be possible to export this table to the HR department in csv or excel so that the department has a database that can be used by them and updated.
SELECT dept_name, COUNT(*) AS nb_of_employes
FROM dept_emp AS de
INNER JOIN departments AS d
ON d.dept_no = de.dept_no
WHERE de.to_date > CURDATE() #exclude the people who are not anymore in the department (retired, who left the company, who changed the department...)
GROUP BY d.dept_no
ORDER BY 2 DESC;
Baker’s HR structure seems to rely primarily on an active pool of developers and producers. Supported by the third largest department in terms of employees, Sales, these two departments deploy and produce software and then sell it.
This seems to be a very classic structure of a software company developing in the 2000s. However, we can notice the very low weight of the research department, essential in the AI sector and the customer service department, crucial in the analytics business (understanding and accompanying the customer needs and not only selling at any price).
Thus, one could recommend more interdepartmental transfers. Encourage developers to go to the research department for some time or foster recruitment in under-represented but strategic departments.
SELECT first_name, last_name, dept_name, salary, title
FROM employees AS e
INNER JOIN dept_emp AS de
ON de.emp_no = e.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
INNER JOIN salaries AS s
ON e.emp_no = s.emp_no
INNER JOIN titles AS t
ON e.emp_no = t.emp_no
WHERE de.to_date > CURDATE() #exclude the people who are not anymore in the department (retired, who left the company, who changed the department...)
AND t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
AND s.to_date > CURDATE() #include only current salary
ORDER BY 3, 2, 1
LIMIT 10;
Make a separate list for each department.
SELECT departments.dept_name as 'Department', GROUP_CONCAT(
CONCAT_WS(', ', employees.first_name, employees.last_name, titles.title, salaries.salary) SEPARATOR " / " )
as 'Employees, their positions and their salaries'
FROM employees
INNER JOIN dept_emp ON dept_emp.emp_no = employees.emp_no
LEFT JOIN departments ON departments.dept_no = dept_emp.dept_no
INNER JOIN titles ON titles.emp_no = employees.emp_no
INNER JOIN salaries ON salaries.emp_no = employees.emp_no AND salaries.to_date = '9999-01-01'
GROUP BY departments.dept_no;
These lists can be sent to the HR department in csv or xls format for further internal analysis.
SELECT title, ROUND(AVG(salary)) AS avg_salary
FROM salaries AS s
INNER JOIN titles AS t
ON s.emp_no = t.emp_no
WHERE s.to_date > CURDATE() #include only current salary
AND t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1
ORDER BY 2 DESC;
According to indeed.com data, the average salary for a engineer is $71,220 per year in the United States (november 2020) and $107,376 per year for Software Engineer. Here we can observe a considerable gap between these average figures and the average engineer’s salary. One could assume that salaries are not updated to 2020, but this would contradict our initial hypothesis. In any case this gap seems significant with market prices.
Talent recruitment is a key point to maintain productivity and competitiveness in the software market.
Thus, a first recommendation would be to increase the salary level of engineers and review recruitment strategies in order to attract the most competitive engineers on the market.
For the other positions, the analysis team does not have enough data to truly make definitive recommendations. However, Bakers HR department could review its salary policy to ensure that staff and technical leaders, who are better paid than the average engineer, bring truly productivity among teams.
SELECT AVG(TIMESTAMPDIFF(year,birth_date,NOW())) AS avg_age
FROM employees AS e;
This average age is extremely high for a technology company. According to Business Insider, the average age of IBM employee was 38 in 2017 and 33 years old for Microsoft.
Are “young people just smarter”, as Mark Zuckerberg claimed in 2007, or are young people better able to bring about technological change and increase their company’s competitiveness? We leave this answer to the Strategic teams. It seems clear, however, that Backer has a real need to review its recruitment policy, which for the most part allowed its former employees recruited in the 1980s to end their careers here, closing the doors to younger people.
SELECT dept_name, AVG(TIMESTAMPDIFF(year,birth_date,now())) AS avg_age
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de. emp_no
INNER JOIN departments AS d
ON d.dept_no = de.dept_no
WHERE de.to_date > CURDATE() #exclude the people who are not anymore in the department (retired, who left the company, who changed the department...)
GROUP BY 1
ORDER BY 2;
The average age per department is homogeneous, there are no departments that are really younger or older. The analysis observed in 5.1 is confirmed. Let’s analyse average age by job position.
Same Recommendations 5.1.
SELECT title, AVG(TIMESTAMPDIFF(year,birth_date,now())) AS avg_age
FROM employees AS e
INNER JOIN titles AS t
ON e.emp_no = t. emp_no
WHERE t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1
ORDER BY 2;
Managers are on average 1 year younger than other employees of the company. This small difference does not change analysis 5.1, which remains the main analysis. The structure of middle-aged positions remains very old in line with other companies in the sector.
Recruitment of younger people, more trainings between senior and junior,
SELECT dept_name, SUM(CASE WHEN gender = "M" THEN 1 ELSE 0 END)/SUM(CASE WHEN gender = "F" THEN 1 ELSE 0 END) AS men_women_ratio
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de. emp_no
INNER JOIN departments AS d
ON d.dept_no = de.dept_no
WHERE de.to_date > CURDATE() #exclude the people who are not anymore in the department (retired, who left the company, who changed the department...)
GROUP BY 1
ORDER BY 2;
As many technology companies, the ratio of men employed in our company is more important than the ratio of women. However, although it requires greater efforts to erase this gap, we can note that the ratio in our company is not as important as the ratio in GAFAM. Indeed, according to Statistica, in 2020 about 25% of GAFAM’s employees are female and the ratio of men to women in engineering is 5:1. The average ratio within our company seems to be around 1.5.
Technology, and more particularly Artificial Intelligence, cannot be achieved without a true cultural and gender mix. Tomorrow’s AI will have to talk to everyone. Recruitment that is more attentive to gender diversity is essential, as well as support for the training of women developers and will bring key factors of competitiveness to the market.
SELECT title, SUM(CASE WHEN gender = "M" THEN 1 ELSE 0 END)/SUM(CASE WHEN gender = "F" THEN 1 ELSE 0 END) AS men_women_ratio
FROM employees AS e
INNER JOIN titles AS t
ON e.emp_no = t.emp_no
WHERE t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1
ORDER BY 2;
We can observe a significant reduction in the ratio of men in management positions.
This decrease is to be congratulated and encouraged for all management positions.
SELECT first_name, last_name, min_salary, gender, TIMESTAMPDIFF(YEAR,birth_date,NOW()) AS age, TIMESTAMPDIFF(YEAR,hire_date,NOW()) AS years_in_the_company
FROM employees AS e
INNER JOIN salaries AS s
ON e.emp_no = s.emp_no
INNER JOIN (SELECT MIN(salary) AS min_salary FROM salaries) AS m
ON s.salary = m.min_salary
WHERE s.to_date > CURDATE() #include only current salary
ORDER BY 3;
SELECT first_name, last_name, max_salary, gender, TIMESTAMPDIFF(YEAR,birth_date,NOW()) AS age, TIMESTAMPDIFF(YEAR,hire_date,NOW()) AS years_in_the_company
FROM employees AS e
INNER JOIN salaries AS s
ON e.emp_no = s.emp_no
INNER JOIN (SELECT MAX(salary) AS max_salary FROM salaries) AS m
ON s.salary = m.max_salary
WHERE s.to_date > CURDATE() #include only current salary
ORDER BY 3 DESC;
SELECT first_name, last_name, salary, gender, TIMESTAMPDIFF(YEAR,birth_date,NOW()) AS age, TIMESTAMPDIFF(YEAR,hire_date,NOW()) AS years_in_the_company
FROM employees AS e
INNER JOIN salaries AS s
ON e.emp_no = s.emp_no
WHERE salary > (SELECT AVG(salary) FROM employees.salaries)
AND s.to_date > CURDATE() #include only current salary
ORDER BY 3 DESC
LIMIT 10
In general, the most important factor influencing the salary is the number of years in the company: in fact, the one with the highest salary in the company has worked 9 years longer in the company than the one with the lowest salary in the company. In addition, those who have the highest salary in the company have worked more than 30 years in the company. Another factor that influences the salary is gender: it can be seen that among those who have the above average salary, most are male.
The proportionality between the number of years worked in the company and the salary seems logical. Be more careful that for the same number of years worked and role, there is no gender difference, so that there are about the same number of males and females in the above-average and below-average salaries.
SELECT first_name, last_name, hire_date,title
FROM employees AS e, titles AS t
WHERE EXTRACT(MONTH FROM hire_date) = 12 AND e.emp_no=t.emp_no
ORDER BY 3 DESC
LIMIT 10;
The majority of the new recruitment concern staff. For two new engineers, one assistant and one senior are recruited with one technique leader. This suggests that there is already a change in the company’s recruitment policy. By recruiting more staff, and not more engineers, the company is relying on its pool of engineers.
Recruitment of more young engineers, decelerate the recruitment of senior engineers, promote assistant and senior training.
SELECT first_name, last_name, dept_name, YEAR(CURDATE())-YEAR(hire_date) AS 'Years in the company'
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
WHERE de.to_date > CURDATE()
AND ((dept_name, (YEAR(CURDATE())-YEAR(hire_date))) IN (
SELECT dept_name, max(YEAR(CURDATE()) - YEAR(hire_date)) AS 'Most experimented'
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
WHERE de.to_date > CURDATE()
GROUP BY 1
))
ORDER BY 4 DESC
LIMIT 10;
All of the most experienced employees belong to the Customer Service department with 35 years of experience in the company. This shows a very high level of loyalty to the company of these employees who most likely have a great deal of valuable expertise.
However, if a change in HR policy is confirmed by recruiting more young developers and focusing more on the AI sectors, a renewal of employees within Customer Service is to be considered. This is indeed justified by the increasing demand for support of companies within the IA and BA.
SELECT first_name, last_name, dept_name, YEAR(CURDATE())-YEAR(hire_date) AS 'Years in the company'
FROM employees AS e
INNER JOIN dept_emp AS de
ON e.emp_no = de.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
WHERE de.to_date > CURDATE() AND YEAR(CURDATE())-YEAR(hire_date) = (
SELECT MIN(YEAR(CURDATE())-YEAR(hire_date))
FROM employees)
ORDER BY 4, 3
LIMIT 10;
No new recruitment since 20 years.
Either the database is good and here our analytic team urgently recommends to hire new engineers. Or our team recommends an improvement in the communication regarding the HR database. It must be urgently updated by the database administrator for a complete and accurate analysis.
SELECT title, COUNT(title) AS nb_of_employees, CONCAT(COUNT(title)*100/(SELECT COUNT(*) FROM titles WHERE to_date > CURDATE()), " %") AS percentage
FROM titles
WHERE to_date > CURDATE()
GROUP BY 1
ORDER BY 1;
Job positions at Bakers are mainly Senior Engineer and Senior Staff (about three quarters of the total).
A first recommendation is to reduce the gap between the number of seniors and the number of engineers and staff in order to rejuvenate the company in terms of age and to have fresh forces ready to work at the top of the senior category. A second recommendation is to increase the number of managers a bit, as only 9 managers running a company with more than 100000 employees seem to be very few.
SELECT title, gender, ROUND(AVG(salary)) AS avg_salary
FROM salaries AS s
INNER JOIN titles AS t
ON s.emp_no = t.emp_no
INNER JOIN employees AS e
ON s.emp_no = e.emp_no
WHERE s.to_date > CURDATE() #include only current salary
AND t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1, 2
ORDER BY 1, 2;
The average salary in all job categories is very similar between males and females, except for the managers category.
In order to avoid gender inequalities, it is necessary to try to have the same salary for men and women in every job category, especially managers (if they are at the same level), because a manager who earns less than another just because of gender can negatively affect the whole environment.
SELECT YEAR(CURDATE())-YEAR(hire_date) AS 'Years in the company', ROUND(AVG(salary)) AS avg_salary
FROM employees AS e
INNER JOIN titles AS t
ON e.emp_no = t.emp_no
INNER JOIN salaries AS s
ON s.emp_no = e.emp_no
GROUP BY 1
ORDER BY 1;
There is an almost perfect linear correlation between the years worked in the company and the salary.
It seems to us a logical criterion to base the salary on the number of years worked in the company so we have no recommendations to make in this respect.
SELECT title, d_name.dept_name, ROUND(MIN(salary)) AS min_salary, ROUND(MAX(salary)) AS max_salary, ROUND(AVG(salary)) AS avg_salary
FROM salaries AS s
INNER JOIN titles AS t
ON s.emp_no = t.emp_no
INNER JOIN dept_emp AS d
ON d.emp_no = s.emp_no
INNER JOIN departments AS d_name ON d_name.dept_no = d.dept_no
WHERE d.to_date > CURDATE() #include onlu current d
AND s.to_date > CURDATE() #include only current salary
AND t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1, 2
ORDER BY 1, 2 DESC
LIMIT 10;
Here, we can see that a job position does not have the same average salary in all the departments which can increase the inequality between collaborators. In addition, we can see that a job position salary does not have the same distribution (minimum - maximum) in all the departments which can increase the inequality between collaborators.
Redistribute wages for the same position such as the salary does not depend on the department.
SELECT title, d_name.dept_name, gender, ROUND(MIN(salary)) AS min_salary, ROUND(MAX(salary)) AS max_salary , ROUND(AVG(salary)) AS avg_salary
FROM employees AS e INNER JOIN
salaries AS s ON e.emp_no = s.emp_no
INNER JOIN titles AS t
ON s.emp_no = t.emp_no
INNER JOIN dept_emp AS d
ON d.emp_no = s.emp_no
INNER JOIN departments AS d_name ON d_name.dept_no = d.dept_no
WHERE d.to_date > CURDATE() #include only current d
AND s.to_date > CURDATE() #include only current salary
AND t.to_date > CURDATE() #include only current job title for those who had more than one job title in the company
GROUP BY 1, 2, 3
ORDER BY 1, 2 , 3 DESC
LIMIT 10;
Here, we can see that a job position salary per gender and per department does not have the same distribution in all the departments which can increase the inequality between collaborators.
We can see that, for some positions, there is only one gender.
Rebalances wages for both gender on the same position such as the maximum, the minimum salary, the average salary will not depend on the department. Hire some new collaborators and transfer others in order that positions could be equal in terms of gender.
The mayor of New York will like to reduce pollution in the city by increasing accessibility using public bicycles. You are provided with a “trips” collection in the “citybike” database found in your MongoDB server. It contains bike trips data from the New York City Citibike service. The documents are composed of: Bicycle unique identifier, Trip start and stop time and date, Trip start and end stations names and geospatial location, User information such as gender, year of birth, and service type (Customer or Subscriber).
A. Write MongoDB queries that will do the following:
1. Display all documents in the “trips” collection.
2. Display all documents in the collection except the “tripduration” and “usertype.
3. Display the”start station name" and “end station name” of all documents in the collection.
4. Display all the trips to “South St & Whitehall St”.
5. Display the first 5 trips made by “Subscriber” (usertype).
6. Display the time of the day with the highest number of trips. 7. Display trips made to end station locations with latitude values greater than 40.735238.
8. Propose 3 queries that could help understand the behavior of city bikers.
B. Propose 7 additional queries that would help the mayor understand the movement of New York city bikers.
C. Based on all the queries above, make suggestions on how New York city biking can be improved. Results should be presented as follows:
a. Query:
b. Result:
c. Analysis/interpretation:
d. Recommendations:
Shell
db.trips.find().pretty()
Atlas UI No filter
New York City’s citybike database is well-structured. It contains 1 990 273 trips in a Document Store structure encoded into JSON format. The description delivered above of the database corresponds well to the fields in our database.
Document Store in MongoDB is a very good database storage solution. In case New York City is looking to change its database structure, it could consider Amazon DynamoDB and a Key-Value Store structure (key= travel id). In this case, it will have to pay attention to the lack of consistency at the database level (no ACID test).
Shell
db.trips.find({}, {tripduration :0, usertype:0}).pretty()
Atlas UI
PROJECT:
{tripduration :0, usertype:0}
No particular analysis for this query.
Shell
db.trips.find({}, {'start station name' :1, 'end station name':1, _id:0}).pretty()
Atlas UI
PROJECT:
{‘start station name’ :1, ‘end station name’:1, _id:0}
Each start station name and end station name are well written.
It would be interesting to draw up a map of the trips thanks to the coordinates of each station.
Shell
db.trips.find({"end station name":"South St & Whitehall St"}).pretty()
Atlas UI
PROJECT:
{“end station name”: “South St & Whitehall St”}
South St & Whitehall St bike station doesn’t seem to be a very busy station. Compared to the total number of trips, it totals only 3287 i.e 0.16 % of total trips.
New York City Hall should be attentive to underused stations as well as to stations with strong influence during peak hours. Thus, it would be interesting for the town hall to implement more bike transfers from less used stations to the most used stations. This could lead to more savings.
Shell
db.trips.aggregate([{ $match : {usertype : "Subscriber"}}, { $group: { _id : {start_station_name :"$start station name", end_station_name : "$end station name", usertype : "$usertype"}, count:{$sum:1}}}, {$sort : { count : -1 }}, {$limit : 5} ], {allowDiskUse: true})
Shell
db.trips.aggregate([{ $project : { hour : { $hour : "$start time"}}}, { $group: { _id : "$hour", count:{$sum:1}}}, {$sort : { "count" : -1}}, {$limit : 1}])
Atlas UI
In aggregations
Our query gives us the following result: the id here represents the time slot with the highest number of trips. The analysis is therefore as follows: the time slot during which there is the greatest use of the service is between 5:00 p.m. and 6:00 p.m., for 200561 trips (i.e. 10% of the trips). This likely corresponds to the end of the working day.
The bike service must therefore be particularly vigilant during this period of time by maximizing the number of bikes available at each station. The service can also consider increasing its visibility for this particular period. For instance, offering lower prices for this time slot, or partnering with New York City companies to promote cycling for work-to-home commuting.
Shell
db.trips.find({ "end station location.coordinates.1" : { $gt: 40.735238}}).pretty()
Atlas UI
FILTER: {“end station location.coordinates.1” : { $gt: 40.735238}}
1 106 649 trips ended at a station with a latitude greater than 40.735238 i.e. more than 55% of all trips. This corresponds loosely to the neighborhoods above Brooklyn and one half of the Queens. We can analyze that there is a weak correlation between the number of inhabitants and the number of trips. Indeed, Manhattan, Queens and Bronx represent more than 2 /3 of the population of New York and here only half of the trips.
Shell
db.trips.aggregate([{ $group: { _id : {start_station_name :"$start station name", end_station_name : "$end station name"}, count:{$sum:1}}}, {$sort : { count : -1 }} ])
The first and most frequent circuit is around Central Park. Pastoral circuit, it is probably used more for leisure purposes than for commuting to and from work.
Concerning the second most frequent circuit, we notice the departure from Pershing Square which is located in front of the Grand Central Terminal to go to W33 St and 7 AVE. This could concern more newyorkers commuters.
In general, we notice that the bike trips cross the major avenues that New York knows well.
Because of the great use in Central park, it would be interesting to encourage the installation of bike station within and around the park. A more general recommendation here regarding the use of bicycles on major avenues : a special bike lane installation on the major avenues raised here would be a great advantage for cyclists and could reinforce its use. This could be a priority for the mayor of New York.
Shell
db.trips.aggregate([{ $group: { _id : {start_station_name :"$start station name", end_station_name : "$end station name", usertype : "$usertype"}, count:{$sum:1}}}, {$sort : { count : -1 }} ], {allowDiskUse: true})
This query confirms 8.1 conclusion about Central Park circuit. We can indeed note that the circuit around Central park is mostly used by customers, either tourists or non regular users.
In order to promote the use of bicycles for tourism or leisure, New York City could promote the advertising of the service by tourist agencies or hotels. Propose for example other circuit to make by bike, special 1-week offer, reduced prices with museums etc…
db.trips.aggregate([{ $group: { _id : {start_station_name :"$start station name", end_station_name : "$end station name", hour : { $hour : "$start time"}}, count:{$sum:1}}}, {$sort : { count : -1 }} ], {allowDiskUse: true})
From this query, we can decompose two main types of start :
First, departures from Pershing Square North and Penn Station Valet stations are close to major New York City train stations (respectively Central Station and Pennsylvania Station). This concerns both morning and evening departures.
Then, departures such as the one from E7 st and Avenue A. These departures are made by the inhabitants of the southern district of Manhattan who leave for working in the morning. Just like the departure from Broadway and 24 st, we will call these departures, “direct departures”. In contrast to “indirect departures” from bicycle stations close to train stations .
First of all, concerning indirect departures, our team would like to stress out that this is a good sign of mutli-mobility service : in order to relieve congestion in the city center, use public transportation for long trips and bicycles for short trips. Thus, the main aim is to strengthen the bicycle supply around New York City’s main central stations and encourage people to use bicycles for short trips in the city.
For direct departures from residential areas, such as morning departures, it is important to reinforce the presence of stations and available bicycles in different departure areas.
Let’s strengthen our analysis now with additional queries !
db.trips.aggregate([{ $group: { _id:0, tripAvg: {$avg: { $divide : [{$subtract: ["$stop time", "$start time"]}, 3600000]}}}}])
The average trip time is 0.26 h i.e 15.6 minutes. This is relatively low usage time. Indeed, compared to the average usage time of the Velib’ service used in Paris which is 36 minutes, this shows a low use of the service.
If we take the average speed of 16km/h of a cyclist. This corresponds to an average distance traveled of 4 km which is very low.
Several recommendations :
in addition to the increase of stations number, the price could have a dissuasive effect on the time of use. For instance, offering 30 minutes of free ride like the Velib service could have a real effect on the time of use and thus the distance cycled in NY City.
db.trips.aggregate([{ $group: { _id : {start_station_name :"$start station name", end_station_name : "$end station name", gender : "$gender"}, count:{$sum:1}}}, {$sort : { count : -1 }} ], {allowDiskUse: true})
For our most frequent tour, gender unknown. This seems normal at this stage: we had already analyzed that this type of circuit is mainly used by Customers (tourists or non-regular users) who are not registered in the platform and who do not inform their gender.
For the following circuits, very interesting point: the gender 1, male is over-represented. This could indicate an under-use of the service by women.
New query to do to confirm this analysis in order to count the number of women using the network. If the analysis is confirmed, the city hall must make it a priority to foster the number of women using the service.
This could be due in part to the lack of bicycle safety in New York City. The city should implement more bike-friendly routes and green mobility.
db.trips.aggregate([{ $group: { _id : {usertype :"$usertype", gender : "$gender"}, count:{$sum:1}}}, {$sort : { count : -1 }} ])
The first user of the New York citibyke service is a male subscriber to the service (1 430 749). They represent more than 70% of the users. The previous analysis of the under-representation of women is therefore confirmed here. Women, in total, account for only 20% of users. With occasional users, they represent only 7% of the service.
db.trips.aggregate([{ $match : {"birth year": {$type:"int"}}}, {$group: { _id : 0, AvgAge : {$avg : { $subtract : [ {$convert : {input : { $year : "$$NOW"}, to : "long"}}, {$convert : {input : "$birth year", to : "long"}}]}}}}])
The average age is 43. Our average user of the service is therefore a 43-year-old male subscriber to the service.
db.trips.distinct( "bikeid" ).length
In total, over the period studied, 7 771 bicycles were used. For 1 990 273 trips, everything else being equal, this represents a rate of use per bike of 256 trips per bike during the 3 month-period or 3 trips per bike per day.
Although this utilization rate per bike appears to be quite low, our analytic team recommends to the city council not to give into the pressure to reduce the number of available bicycles. On the contrary, the more bikes available (and particularly at the hot spots identified above), the time of usage and therefore the higher the rate of use per bike.
db.trips.distinct( "start station id" ).length
cf Results below.
db.trips.distinct( "end station id" ).length
We must be careful in our analysis. There are no more stations on one side than on the other since there is a fixed number of stations.
This request, however, gives us the following information: there is more point of arrival than point of departure in the use of the service.
In other words, cyclists leave more from the same place (probably from places close to railway stations) and disperse more on arrival (probably to go home, to work or to their meeting point).
This confirms the use of the service identified above for work - home commuters and reinforces our team’s recommendation to promote the multi-mobility of users (departures from major stations) to workplaces in the city. This, notably by promoting the more dispersed installation of bicycle stations in the city.
db.trips.aggregate([{ $group: { _id : {day : { $dayOfWeek : "$start time"}}, count:{$sum:1}}}, {$sort : { count : -1 }} ])
We consider the standard ISO 8601 for the number of the day.
From this request, we can establish the following classification in terms of use per day: Friday comes first, then Thursday, Wednesday, Saturday, Tuesday, Sunday and then Monday .
So we can make the analysis that the bike is much more used at the end of the work week. Maybe the workers feel more stressed, or have more time, or are in better shape to get on the bike.
For weekends, we can see a low number of users, which reflects low usage outside of commuting.
For the first point, one recommendation would be to install more electric bicycles. Indeed, perhaps installing more of this type of bike would encourage users to use it every day, having more confidence in this means of transportation in terms of the time needed for their trip or because it would require less energy in the morning or when returning home in the evening.
As for the second point, it seems clear now that New York City should encourage more cycling outside the home-work circuits.
Thank you for reading !