WellAlly Logo
WellAlly康心伴
Development

Building a Burnout Detector: Analyzing Your Git Commit Patterns with Python & Scikit-learn

A developer-centric guide to mental wellness. Learn how to use the GitHub API, Pandas, and Scikit-learn to analyze your commit history and build a machine learning model to identify patterns correlated with potential burnout.

W
2025-12-11
12 min read

As developers, our lives are written in Git history. Each commit tells a story of a problem solved, a feature built, or a bug fixed. But what if those same commits could tell a different story—one about our well-being? In the tech industry, burnout is a pervasive issue. The "joyless commit," where you're pushing code but feeling no sense of accomplishment, is a real phenomenon.

This tutorial takes a unique, developer-centric approach to mental wellness. We'll build a Python script that uses your GitHub commit history to identify patterns that might indicate a risk of burnout. By analyzing data like commit frequency, commit message length, and the time of day you're pushing code, we can create a model that learns your "normal" and flags potential deviations.

Prerequisites:

  • Python 3.7+ installed.
  • Familiarity with Python, including installing packages with pip.
  • A GitHub account with some commit history.
  • Basic understanding of Pandas and machine learning concepts is helpful but not required.

Why this matters to developers:

This isn't about creating a perfect predictor of burnout. It's about using the tools we use every day to foster self-awareness. By turning a lens on our own data, we can start to ask important questions about our work-life balance and mental health.

Understanding the Problem

Developer burnout is often a gradual process. It can manifest as a change in work patterns long before it's consciously recognized. Some potential indicators in Git data could be:

  • Late-night commits: A sudden increase in commits outside of normal working hours.
  • Erratic commit frequency: Swinging from periods of intense activity to radio silence.
  • Short, uninformative commit messages: A potential sign of disengagement or rushing.
  • A decrease in commit volume: While not always negative, a significant drop could indicate a lack of motivation.

We'll treat this as a classification problem. We'll label historical data (e.g., "normal" vs. "burnout" periods, based on our own recollection) and train a machine learning model to recognize the patterns associated with each.

Prerequisites

First, let's set up our environment. You'll need a GitHub personal access token to use their API.

  1. Generate a GitHub Personal Access Token:

    • Go to your GitHub Settings > Developer settings > Personal access tokens.
    • Click "Generate new token".
    • Give it a descriptive name (e.g., "Burnout Detector").
    • Under "Scopes," select the repo scope. This will give you read access to your repositories.
    • Click "Generate token" and copy it somewhere safe. You won't be able to see it again.
  2. Install Python libraries:

    code
    pip install requests pandas scikit-learn matplotlib seaborn
    
    Code collapsed

Step 1: Fetching Your Commit History

What we're doing

We'll use the requests library to connect to the GitHub API and pull the commit history for a specific repository.

Implementation

Create a new Python file, burnout_detector.py.

code
# burnout_detector.py
import requests
import pandas as pd
import os

# --- Configuration ---
GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN') # Best practice to use environment variables
REPO_OWNER = 'your-github-username'
REPO_NAME = 'your-repo-name'
# ---------------------

def fetch_all_commits(owner, repo, token):
    """Fetches all commits for a given repository."""
    commits = []
    page = 1
    per_page = 100
    headers = {'Authorization': f'token {token}'}
    
    while True:
        url = f'https://api.github.com/repos/{owner}/{repo}/commits?page={page}&per_page={per_page}'
        response = requests.get(url, headers=headers)
        
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.json())
            break
            
        data = response.json()
        if not data:
            break
            
        commits.extend(data)
        page += 1
        
    return commits

# --- Main execution ---
if __name__ == '__main__':
    if not GITHUB_TOKEN:
        raise ValueError("Please set the GITHUB_TOKEN environment variable.")
        
    print(f"Fetching commits for {REPO_OWNER}/{REPO_NAME}...")
    all_commits = fetch_all_commits(REPO_OWNER, REPO_NAME, GITHUB_TOKEN)
    print(f"Found {len(all_commits)} commits.")

    # Convert to DataFrame for easier manipulation
    df = pd.json_normalize(all_commits)
    print(df.head())
Code collapsed

Before you run:

  • Replace 'your-github-username' and 'your-repo-name' with your details.
  • Set the GITHUB_TOKEN as an environment variable for security:
    • Mac/Linux: export GITHUB_TOKEN='your_token_here'
    • Windows: set GITHUB_TOKEN='your_token_here'

Run the script: python burnout_detector.py

How it works

We make paginated GET requests to the GitHub Commits API endpoint. The while True loop continues to fetch pages of commits until the API returns an empty list. We then use pandas.json_normalize to flatten the nested JSON response into a DataFrame, which is a table-like structure perfect for data analysis.

Step 2: Feature Engineering with Pandas

What we're doing

"Feature engineering" is the process of creating meaningful input variables (features) for our machine learning model from the raw data. We'll use Pandas to create features that might correlate with burnout.

Implementation

Add the following functions to your script:

code
# ... (keep the previous code)

def create_features(df):
    """Engineers features from the raw commit data."""
    # Select and rename relevant columns
    df = df[['sha', 'commit.author.date', 'commit.message']].rename(columns={
        'commit.author.date': 'timestamp',
        'commit.message': 'message'
    })

    # Convert timestamp to datetime objects and set as index
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df = df.set_index('timestamp')

    # Feature 1: Hour of the day
    df['hour'] = df.index.hour

    # Feature 2: Day of the week (0=Monday, 6=Sunday)
    df['day_of_week'] = df.index.dayofweek

    # Feature 3: Commit message length
    df['message_length'] = df['message'].apply(len)
    
    # Feature 4: Commits per day
    commits_per_day = df.resample('D').size().rename('commits_on_day')
    df = df.join(commits_per_day, on=df.index.date)

    return df

# --- Main execution ---
if __name__ == '__main__':
    # ... (fetching code remains the same)
    
    df_raw = pd.json_normalize(all_commits)
    
    if df_raw.empty:
        print("No commits found.")
    else:
        df_features = create_features(df_raw.copy())
        print("\n--- Features ---")
        print(df_features.head())
Code collapsed

How it works

  1. Timestamp Conversion: We convert the timestamp column to a proper datetime format, which allows us to perform time-based operations.
  2. Time-based Features: We extract the hour and day_of_week directly from the datetime index. This will help us spot patterns like consistent late-night or weekend work.
  3. Message Length: We calculate the length of each commit message.
  4. Resampling: The resample('D').size() is a powerful Pandas time-series function. It groups the commits by day (D) and counts how many occurred in each group, giving us the daily commit frequency. We then join this information back to our main DataFrame.

Step 3: Labeling the Data (The Human Element)

What we're doing

This is the most subjective but crucial part. We need to create a "target variable"—a label that tells our model which periods represent "burnout" and which are "normal." Since we don't have an external measure of burnout, we'll have to create our own labels based on our memory of past projects.

Implementation

This will be unique to you. The goal is to create a new column, is_burnout, with 1 for periods you remember being stressful or leading to burnout, and 0 for normal periods.

code
# ... (keep previous code)

def label_data(df):
    """Labels data based on known periods of burnout."""
    df['is_burnout'] = 0 # Default to 'normal'
    
    # --- !!! THIS IS THE PART YOU MUST CUSTOMIZE !!! ---
    # Example: Labeling a specific date range as a 'burnout' period
    burnout_start = '2023-10-01'
    burnout_end = '2023-10-31'
    df.loc[burnout_start:burnout_end, 'is_burnout'] = 1

    # Example: Labeling another stressful period
    # burnout_start_2 = '2024-03-15'
    # burnout_end_2 = '2024-04-05'
    # df.loc[burnout_start_2:burnout_end_2, 'is_burnout'] = 1
    
    return df

# --- Main execution ---
if __name__ == '__main__':
    # ...
    else:
        df_features = create_features(df_raw.copy())
        df_labeled = label_data(df_features.copy())
        
        print("\n--- Labeled Data ---")
        # Check the distribution of your labels
        print(df_labeled['is_burnout'].value_counts())
        
        # Display some of the labeled data
        print(df_labeled[df_labeled['is_burnout'] == 1].head())
Code collapsed

How it works

Think back to specific projects or deadlines that were particularly draining. Use date-range slicing with .loc to assign a 1 to the is_burnout column for those periods. Be honest! The quality of your model depends entirely on the quality of your labels.

Common Pitfall: Try to have a reasonable balance between 0s and 1s. If you only label a few days as burnout, the model will have a hard time learning.

Step 4: Training a Classification Model

What we're doing

Now for the machine learning! We'll use scikit-learn to train a RandomForestClassifier. This model is an "ensemble" of decision trees, making it robust and good at finding complex patterns.

Implementation

code
# ... (add imports at the top)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


# ... (add new function)
def train_model(df):
    """Trains a model to predict burnout based on commit features."""
    # Define features (X) and target (y)
    features = ['hour', 'day_of_week', 'message_length', 'commits_on_day']
    X = df[features]
    y = df['is_burnout']

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Initialize and train the Random Forest Classifier
    model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
    model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print("\n--- Model Evaluation ---")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Visualize feature importances
    feature_imp = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x=feature_imp, y=feature_imp.index)
    plt.xlabel('Feature Importance Score')
    plt.ylabel('Features')
    plt.title("Visualizing Important Features")
    plt.show()

    return model

# --- Main execution ---
if __name__ == '__main__':
    # ...
    else:
        df_features = create_features(df_raw.copy())
        df_labeled = label_data(df_features.copy())
        
        if df_labeled['is_burnout'].sum() > 0:
            train_model(df_labeled)
        else:
            print("\nSkipping model training: No data labeled as 'burnout'.")
Code collapsed

How it works

  1. Splitting Data: We divide our labeled data into a training set (70%) and a testing set (30%). The model learns from the training set and is then evaluated on the unseen testing set to gauge its real-world performance.
  2. Training: model.fit(X_train, y_train) is where the magic happens. The Random Forest algorithm analyzes the features in the training data to find the patterns that best distinguish between "normal" and "burnout" commits. class_weight='balanced' tells the model to pay more attention to the minority class (usually the "burnout" class), which is important for imbalanced datasets.
  3. Evaluation: We use a classification report to see how well the model performed.
  4. Feature Importance: The plot shows us which features the model found most predictive. This is incredibly insightful! You might find that hour and commits_on_day are the strongest indicators in your personal data.

Alternative Approaches

  • Logistic Regression: A simpler, more interpretable model. It's a good baseline to compare against. You can swap RandomForestClassifier for LogisticRegression from sklearn.linear_model.
  • Time-Series Anomaly Detection: Instead of classification, you could treat this as an anomaly detection problem, where you model your "normal" behavior and flag any significant deviations.
  • Sentiment Analysis: Analyze the text of commit messages for changes in sentiment over time. Positive sentiment in commit messages can correlate with a healthier work environment.

Conclusion

We've successfully built a tool that leverages our own digital footprint to promote self-awareness and mental well-being. You've learned how to:

  1. Fetch data from the GitHub API using Python.
  2. Perform feature engineering on time-series data with Pandas.
  3. Label data for a supervised machine learning task.
  4. Train and evaluate a classification model with Scikit-learn.

This project is a starting point. The real power comes from adapting it to your own life and work patterns. Use the feature importance chart to understand your personal triggers and habits. Remember, the goal isn't to create a perfect alarm system but to build a data-informed mirror that helps you stay mindful of your mental health.

Next Steps:

  • Analyze commits across all your repositories.
  • Deploy this script to run automatically (e.g., once a week) and send you a summary.
  • Experiment with different features, like the time between commits or the number of files changed.

Resources

#

Article Tags

pythondatasciencemachinelearningmentalhealth
W

WellAlly's core development team, comprised of healthcare professionals, software engineers, and UX designers committed to revolutionizing digital health management.

Expertise

Healthcare TechnologySoftware DevelopmentUser ExperienceAI & Machine Learning

Found this article helpful?

Try KangXinBan and start your health management journey

© 2024 康心伴 WellAlly · Professional Health Management