4 minute read

This Python project is the final capstone in Google’s Advanced Data Analytics Course. It focused on developing a predictive model to identify employees at risk of leaving (employee churn) Salifort Motors. Using Exploratory Data Analysis (EDA) and a Random Forest classifier, the analysis revealed key factors that influence employee attrition, such as satisfaction, workload, and tenure.

This project was original created in a GitHub Repository. If you experience any formatting issues, please view the original GitHub project.

Table of Contents

  1. Define - Overview
  2. Method - Dataset
  3. Analyze - Exploratory Data Analysis
  4. Improve - Model Creation
  5. Control - Conclusion and Stakeholder Suggestions

Define - Overview

The objective of this project was to develop a predictive model to identify employees at risk of leaving Salifort Motors. Leveraging exploratory data analysis (EDA) and a Random Forest algorithm, the analysis focused on internal HR and employee survey data to uncover the key drivers of turnover.

The project followed the DMAIC framework—Define, Measure, Analyze, Improve, and Control—to ensure a structured and data-driven approach. The final model achieved strong performance, highlighting employee satisfaction, tenure, and workload as the top predictors of turnover. These insights offer actionable guidance for Salifort Motors leadership to proactively identify retention risks.

All code used to create this project was written by me and can be found can be found here, or in the GitHub repository. The code offers deeper EDA insights and additional graphs.


Method - Dataset

The dataset provided contains 14,999 rows with employees’ self-reported information. It contains 10 columns featuring data on satisfaction levels, time spent with company, promotions, department, salary and left. The full dataset can be found on Kaggle.

image


Analyze - Exploratory Data Analysis

EDA was conducted fully within Python. This section will highlight the most important insights gathered during the data exploration phase.

1. Correlation Heatmap

A Correlation Heatmap was made to identify which features should be investigated. By creating the heatmap, we understand that Last Evaluation, Number of Projects, Average Monthly Hours, and Satisfaction level all potentially correlate with employees leaving the company.

image

2. Monthly Hours compared to Number of Projects Boxplot

  • Strong Correlation between high workload (measured by monthly hours and number of projects).
  • Burnout risk increases with high project count (6-7) and high working hours (over 200 hours per month).
  • Employees leave company when assigned under 2 projects per month or over 5 projects per month. image

3. Monthly Hours Worked Compared to Satisfaction Level

  • Employees who worked over under 175 hours per month, and employees who worked more than 230 hours both have very low satisfaction ratings and high churn-rates.

image

4. Satisfaction Level compared to Tenure

  • Employees that have worked over 7 years are unlikely to leave.
  • Employees that have been with the company from 1 - 4 years are more likely to churn.
  • Employees that have worked 4 and 5 years have the lowest satisfaction score. This could be worth looking into for further analysis.

image

5. Latest evaluation compared to Average Monthly Hours

  • Employees that work over 260 hours are more likely to churn, depsite high evaluation ratings.
  • Evaluation ratings are biased to be higher the longer a employee works each month.
  • Employees who work less than 150 hours have a lower evaluation and higher churn rate than employees working in the 150 - 220 hours range.

image


Improve - Model Creation

A random forest model comprising 100 decision trees was used to determine feature importance in which employees were more likely to churn (leave the company). The plot below shows that satisfaction level, tenure, number of projects, Average monthly hours, and last evaluation were the top 6 most important reasons in determining whether a employee is likely to leave the company.

image

The created model performed very well with a accuracry score of 97% and a precision score of 98%. The model was verified using a confusion matrix which showed 10 false positives, and 49 false negatives, compared to 2491 accurate positives, and 448 accurate negatives.

image


Control - Conclusion and Stakeholder Suggestions

Our model and feature importance results confirms that employees are currently being overworked leading to higher churn rates for the company. To prevent the employees from churning, the companies stakeholders should focus on:

  1. Cap number of projects a individual can work on to a maximum of 5 per month.
  2. Employees that work over 250 hours per month have a high churn rate. Reward employees who work more projects and hours. Most of the employees in this company worked over 167 hours per month.
  3. Consider promoting employees or rewarding employees from years 1 - 4 as they have a higher churn rate. Why are 4 - 5 year employees unproportionately dissatisfied?
  4. Employees with high evaluation scores are exclusively reserved for employees who work more than 200 hours currently. Evaluation scores should be based on quality of work, rewarding 250 hours or more.
  5. If expectations regarding high workload and time off have not been clearly communicated, take steps to clarify them.
  6. If extended work hours are expected, ensure they are appropriately compensated. If not, avoid making them a requirement.