Salifort Motors Project Showcase

Data-Driven Insights for HR Improvement

Project Overview

Welcome to the Salifort Motors Project Showcase! My name is Tobin Zolkowski, and this project is a deep dive into understanding what makes employees stay or leave a company. Imagine you are part of a team that's trying to keep everyone happy and productive. This project helps us figure out how to do that using data.

This project includes exploratory data analysis (EDA), feature engineering, and predictive modeling to identify key factors influencing employee retention and turnover. The project is structured into several stages following the PACE framework: Plan, Analyze, Construct, and Execute.

Understanding the Business Scenario and Problem

Salifort Motors is facing a significant challenge with high employee turnover rates. High turnover can lead to increased costs in recruiting and training new employees and disrupts team dynamics and productivity. The HR department sought a data-driven approach to identify key factors that influence employees' decisions to leave, with the goal of improving retention strategies and enhancing overall employee satisfaction.

Data Exploration and Visualization

Our journey starts with exploring the data. Think of it like getting to know the team better by looking at different aspects of their work life.

Data Preparation

The data preparation stage involved cleaning the dataset and encoding categorical variables using one-hot encoding. This process ensures that our data is ready for analysis and modeling. Categorical variables like 'Department' and 'salary' were transformed into numerical values that models can understand.

Data Introduction

The dataset provided by the HR department contains 15,000 rows and 10 columns, including:

Initial Exploratory Data Analysis (EDA) and Data Cleaning

The first step was to understand the dataset by performing EDA. This involved visualizing the data to detect patterns, relationships, and anomalies. We also cleaned the data to handle any missing values and outliers.

Descriptive Statistics

Here are some key statistics about the dataset:

Reflections on Data Exploration

During the initial exploration, several interesting patterns emerged. For example, employees with lower satisfaction levels and those who were not promoted in the last 5 years showed higher turnover rates. Anomalies, such as very high or low average monthly hours, were also examined to understand their impact on turnover.

Boxplot of Tenure
Boxplot of Tenure
Shows the distribution of employee tenure.

This boxplot shows us how long employees have been with the company. Most have been here between 3 to 4 years, but there are some who have been with us much longer. The median tenure is around 3 years, indicating a relatively young workforce.

Correlation Heatmap
Correlation Heatmap
Displays correlation between various factors.

This heatmap reveals how different factors like satisfaction, evaluation, and workload relate to each other. For example, we see that satisfaction has a moderate negative correlation (-0.35) with employees leaving the company, indicating that less satisfied employees are more likely to leave. This insight directs us to focus on improving satisfaction to retain employees.

Employees Leaving by Department
Employees Leaving by Department
Shows the distribution of employees leaving by department.

Here we see which departments have higher turnover rates. For instance, the sales and technical departments have the highest number of employees who left. Specifically, 30% of employees from the sales department and 25% from the technical department have left the company. This helps us know where to focus our efforts to improve retention.

Employees Leaving by Salary Level
Employees Leaving by Salary Level
Illustrates the distribution of employees leaving by salary level.

Salary is a big factor in job satisfaction. This chart shows us how many people leave based on their salary level. We notice that employees with lower salaries have the highest number of departures (41%), indicating the need to review and possibly adjust our compensation strategies. In contrast, only 7% of employees with high salaries have left.

Distribution of Satisfaction Levels
Distribution of Satisfaction Levels
Displays the distribution of employee satisfaction levels.

Understanding satisfaction levels helps us know how happy our team is. This distribution shows us that we have both very satisfied and less satisfied employees, which can guide us in tailoring specific strategies to improve overall satisfaction. The average satisfaction level is 0.629, with a notable portion of employees having satisfaction levels below 0.5.

Distribution of Last Evaluation Scores
Distribution of Last Evaluation Scores
Shows the distribution of employee last evaluation scores.

Performance evaluations tell us how employees are doing at their jobs. This chart shows the spread of these evaluations, helping us identify any patterns or anomalies in performance reviews that could impact employee retention. Most employees have evaluation scores between 0.55 and 0.75, with a mean score of 0.717.

Number of Projects Undertaken by Employees
Number of Projects Undertaken by Employees
Illustrates the number of projects employees are involved in.

Project involvement can indicate workload and stress. This chart shows the number of projects employees are handling. Employees with too many projects might feel overwhelmed, while those with too few might feel underutilized. The average number of projects per employee is 3.8.

Distribution of Average Monthly Hours
Distribution of Average Monthly Hours
Displays the distribution of average monthly hours worked by employees.

How many hours employees work each month can impact their satisfaction and performance. This distribution gives us a clear picture, showing that most employees work between 150 and 250 hours per month. Outliers on either end could indicate overwork or under-engagement. The average monthly hours is 200.

Number of Years Employees Have Spent with the Company
Number of Years Employees Have Spent with the Company
Shows the number of years employees have spent with the company.

Tenure shows us how long employees stay with the company. Most employees have been with us for 3 years. This information helps us understand the typical employee lifecycle and identify when interventions might be needed to retain staff. The median tenure is 3 years, and employees with 5 years or more are more likely to leave.

Promotions in the Last 5 Years vs. Leaving
Promotions in the Last 5 Years vs. Leaving
Illustrates the relationship between promotions and leaving.

Promotions can influence an employee's decision to stay or leave. This chart shows the impact of promotions over the past 5 years. Employees who have not been promoted are more likely to leave, suggesting that career advancement opportunities are crucial for retention. Only 2% of employees who received promotions left, compared to 26% of those who didn't receive promotions.

Further Analysis and Insights

After exploring the data, we dig deeper to uncover more insights.

PACE Framework

The Plan stage involved understanding the business scenario and problem. The HR department at Salifort Motors wanted to improve employee satisfaction levels and reduce turnover. They asked for data-driven suggestions to identify factors influencing employees' decisions to leave.

In the Analyze stage, I conducted an in-depth exploratory data analysis (EDA). This involved visualizing relationships between variables, detecting patterns, and identifying key factors influencing employee retention. For example, by looking at the correlation heatmap, we saw that satisfaction levels had a significant impact on whether employees stayed or left.

During the Construct stage, I built predictive models to forecast employee turnover. Several models were considered, including Logistic Regression, Random Forest, and XGBoost. Logistic Regression is a simple model that helps us understand the impact of each feature on the outcome. Random Forest and XGBoost are more complex models that can capture intricate patterns in the data. These models were chosen for their balance between interpretability and predictive power.

Logistic Regression: This model is like a more advanced version of linear regression but used for classification. It's great for understanding the influence of different features but might not capture complex patterns.

Random Forest: Imagine a forest with many trees, where each tree makes a prediction. The forest combines these predictions for a more accurate result. It helps capture complex relationships in the data.

XGBoost: This is a powerful model that improves on the predictions of previous models by learning from their mistakes. It’s like a very diligent student who learns quickly from every error.

In the Execute stage, I interpreted the model results and made actionable recommendations. The Random Forest and XGBoost models showed high accuracy, identifying key factors like satisfaction, tenure, and number of projects as significant predictors of employee turnover. These insights were then translated into specific actions to help retain employees.

Model Evaluation

Accuracy: The proportion of correct predictions out of all predictions made.

Precision: The proportion of true positive predictions out of all positive predictions made. This tells us how many of the predicted leavers actually left.

Recall: The proportion of true positive predictions out of all actual positives. This tells us how many of the actual leavers were correctly identified.

F1-Score: The harmonic mean of precision and recall, providing a balance between the two metrics.

Here are the evaluation metrics for each model:

Logistic Regression

Random Forest

XGBoost

Hyperparameter Tuning

We applied GridSearchCV to fine-tune the Random Forest model. Hyperparameters are settings that can be adjusted to optimize the model's performance. GridSearchCV tests different combinations of these settings to find the best configuration. After tuning, the Random Forest model's performance improved, showing better accuracy and recall.

Feature Importance Analysis

Identifying the most important features helps us understand what factors contribute most to employee churn. Both the Random Forest and XGBoost models indicated that satisfaction level, number of projects, and last evaluation scores were the most influential features.

Cross-Validation

To ensure our model's performance is reliable, we employed k-fold cross-validation. This technique splits the data into k subsets and trains the model k times, each time using a different subset as the test set. This helps in obtaining a more accurate estimate of the model's performance.

Underfitting and Overfitting

In predictive modeling, it's crucial to strike a balance between underfitting and overfitting:

To determine if a model is underfitting or overfitting, we can look at the performance metrics on both training and validation datasets:

In this project, we used techniques like cross-validation and hyperparameter tuning to mitigate underfitting and overfitting:

Distribution of Salary Levels
Distribution of Salary Levels
Shows the distribution of salary levels among employees.

This chart breaks down the distribution of salary levels among our employees. We see that a large chunk of employees fall under the 'low' salary bracket, with medium and high salary brackets being lesser in comparison. The number of employees that leave is significantly higher in the low salary bracket (41%) compared to the medium (17%) and high (7%) brackets.

Work Accidents vs. Leaving
Work Accidents vs. Leaving
Illustrates the relationship between work accidents and leaving.

We also look at the impact of work accidents on employee turnover. Safety is crucial for job satisfaction and retention. Most employees have not experienced work accidents. A smaller proportion of those who experienced work accidents (14%) have left the company compared to those who haven't (23%).

Top 10 Important Features - Random Forest
Top 10 Important Features - Random Forest
Displays the top 10 important features in the Random Forest model.

Using a Random Forest model, we identify the top factors that predict employee turnover. Satisfaction, tenure, and projects are key indicators. This tells us that how satisfied employees are, how long they've been with us, and how many projects they handle are crucial in understanding why they might leave. For instance, satisfaction level has a feature importance score of 0.34, making it the most significant factor.

Top 10 Important Features - XGBoost
Top 10 Important Features - XGBoost
Displays the top 10 important features in the XGBoost model.

The XGBoost model provides similar insights, highlighting the importance of projects and satisfaction. This reinforces our understanding of the factors influencing employee turnover. Satisfaction level also has the highest importance in the XGBoost model, followed by number of projects and last evaluation scores.

ROC Curve
ROC Curve
Shows the ROC curve for Random Forest and XGBoost models.

The ROC curve shows us how well our models can predict employee turnover. Both models perform exceptionally well, with high accuracy, indicating their reliability in predicting which employees might leave. The ROC AUC (Area Under the Curve) for both models is approximately 0.98, indicating strong predictive power.

Model Comparison and Best Model Selection

Among the three models, XGBoost showed the highest accuracy (97.96%) and F1-Score (94%). This model balances precision and recall well, making it the best choice for predicting employee churn. While Random Forest also performed well, XGBoost's slight edge in predictive power and its ability to learn from previous errors makes it the most suitable model for this task.

Conclusion and Recommendations

Our analysis reveals that employee satisfaction, workload, tenure, and evaluation scores are critical in predicting turnover. To improve retention, Salifort Motors should focus on enhancing employee satisfaction, balancing workloads, and recognizing and promoting employees timely.

Recommendations:

Next Steps:

About the Project

This project represents a thorough analysis of employee retention factors at Salifort Motors. It combines data science techniques with practical business insights to provide actionable recommendations for the HR department. For those interested, the complete code and analysis can be found in the accompanying Python code.

Note: This project was part of the Google Advanced Data Analytics Capstone course on Coursera. The dataset used in this project is credited to the course.

Full-Screen Visualization