WellAlly Logo
WellAlly康心伴
Data Science

Discovering User Chronotypes with Python and Unsupervised Learning

A deep dive into using unsupervised clustering (K-Means & DBSCAN) in Python to automatically segment users into chronotypes like 'night owls' and 'early birds' from raw sleep data.

W
2025-12-14
10 min read

Key Takeaways

  • Chronotype discovery uses unsupervised clustering on sleep midpoint and duration features
  • K-Means (k=3) effectively segments users into early birds, night owls, and standard sleepers
  • Setup takes ~45 minutes using Python, Pandas, and Scikit-learn
  • DBSCAN adds value by detecting outliers and irregular sleep patterns
  • Feature engineering (sleep midpoint) is more important than algorithm choice

TL;DR: Discover user chronotypes using Python and unsupervised clustering in ~45 minutes. Extract sleep midpoint and duration features, then apply K-Means (k=3) to segment users into early birds, night owls, and standard sleepers. DBSCAN adds outlier detection for irregular patterns.

Key Takeaways

  • Approach: Unsupervised clustering on engineered sleep features (midpoint, duration)
  • Setup Time: ~45 minutes with Python, Pandas, and Scikit-learn
  • Algorithms: K-Means for main segmentation (k=3), DBSCAN for outlier detection
  • Key Feature: Sleep midpoint (halfway between bedtime and wake time) is most predictive
  • Use Case: Personalized app experiences based on chronotype patterns

Are you a night owl who thrives after midnight, or an early bird who's most productive at dawn? These natural tendencies are known as chronotypes, and they significantly impact our lives, from work performance to health. What if we could automatically discover these patterns from user sleep data?

In this deep dive, we'll walk through a complete data science project to uncover user chronotypes from a dataset of sleep timings. We'll use unsupervised clustering algorithms, specifically K-Means and DBSCAN, to segment users into meaningful groups like "early birds," "night owls," and "standard sleepers."

Prerequisites: You should have a basic understanding of Python and familiarity with data manipulation using Pandas. Some prior knowledge of machine learning concepts will be helpful but is not strictly required. You'll need Python 3, along with the Pandas, Scikit-learn, and Matplotlib libraries installed.

Why this matters to developers: Understanding user behavior is crucial for building personalized experiences. For health and wellness apps, segmenting users by chronotype can enable tailored recommendations for optimal sleep schedules, workout times, and even productivity hacks.

Note: This example uses synthetic/simulated data for demonstration. In production, ensure all sleep data is anonymized and handled in compliance with HIPAA/GDPR.

Understanding the Problem

Manually identifying chronotypes through questionnaires can be subjective and time-consuming. Unsupervised machine learning offers a data-driven approach to identify these patterns from passively collected sleep data. The main challenge lies in transforming raw sleep timings into meaningful features that clustering algorithms can effectively use to group similar users.

Wearable technology and health apps now provide vast amounts of sleep data. Researchers and companies are increasingly using machine learning to analyze this data for insights into sleep quality and patterns. Instead of relying on self-reported information, we'll use objective sleep timing data to uncover natural groupings of users.

Chronotype Clustering Architecture

The following diagram shows our unsupervised learning pipeline for discovering user chronotypes:

Rendering diagram...
graph TB
    A[Raw Sleep Data] -->|Load CSV| B[Pandas DataFrame]
    B -->|Extract Features| C[Feature Engineering]
    C -->|Midpoint + Duration| D[Feature Matrix X]
    D -->|StandardScaler| E[Normalized Features]
    E -->|K-Means k=3| F[Cluster Assignments]
    E -->|DBSCAN eps=0.5| G[Density-Based Clusters]
    F -->|Interpret Mean Values| H[Early Bird / Night Owl / Standard]
    G -->|Detect Outliers| I[Noise Points -1]
    style C fill:#74c0fc,stroke:#333
    style F fill:#ffd43b,stroke:#333

Prerequisites

Required tools/libraries:

  • Python 3.x
  • Pandas
  • Scikit-learn
  • Matplotlib
  • Seaborn

Version compatibility notes: The code in this tutorial was written using Python 3.8, Pandas 1.3, Scikit-learn 1.0, Matplotlib 3.5, and Seaborn 0.11. Minor version differences should not cause issues.

To install the necessary libraries, run the following command in your terminal:

code
pip install pandas scikit-learn matplotlib seaborn
Code collapsed

Load and Preprocess Sleep Timing Data

What we're doing

First, we need to load our dataset and prepare it for feature engineering. We'll use a synthetic "Student Sleep Patterns" dataset from Kaggle which is perfect for our analysis as it includes weekday and weekend sleep and wake times. We will clean up the data, handle any inconsistencies, and convert time formats into something more usable.

Implementation

code
# src/data_preprocessing.py
import pandas as pd

# Load the dataset
try:
    df = pd.read_csv('student_sleep_patterns.csv')
except FileNotFoundError:
    print("Please download the dataset from: https://www.kaggle.com/datasets/ssooni/student-sleep-patterns-dataset")
    exit()


# Display the first few rows and basic info
print("Original DataFrame head:")
print(df.head())
print("\nDataFrame Info:")
df.info()

# For this analysis, we are interested in the sleep timing columns
sleep_timing_df = df[['Student_ID', 'Weekday_Sleep_Start', 'Weekday_Sleep_End', 'Weekend_Sleep_Start', 'Weekend_Sleep_End']].copy()

print("\nSleep Timing DataFrame head:")
print(sleep_timing_df.head())
Code collapsed

How it works

We load the CSV file into a Pandas DataFrame. The .head() function shows us the first few rows to understand the structure, and .info() gives us an overview of the data types and non-null values. We then create a new DataFrame sleep_timing_df containing only the columns relevant to our chronotype analysis.

Common pitfalls

  • File Not Found: Make sure the CSV file is in the same directory as your Python script or provide the correct file path.
  • Inconsistent Time Formats: In real-world data, time formats can be messy (e.g., '10:00 PM', '22:00', '10pm'). Our dataset uses a 24-hour float format, which simplifies things. For other formats, you might need to use pd.to_datetime.

Engineer Sleep Features for Clustering

What we're doing

To cluster users by chronotype, we need to create features that represent their sleep behavior. Two of the most effective features for this are:

  1. Midpoint of Sleep: The halfway point between bedtime and wake-up time. This is a strong indicator of a person's natural sleep timing.
  2. Sleep Duration: The total time spent asleep.

We will calculate these for both weekdays and weekends to capture any differences in behavior.

Input: DataFrame with columns Weekday_Sleep_Start, Weekday_Sleep_End, Weekend_Sleep_Start, Weekend_Sleep_End Output: DataFrame with engineered features Weekday_Midpoint, Weekday_Duration, Weekend_Midpoint, Weekend_Duration

Implementation

code
# src/feature_engineering.py
import numpy as np

def calculate_sleep_features(df):
    """Calculates midpoint of sleep and sleep duration."""
    features_df = df.copy()

    for period in ['Weekday', 'Weekend']:
        start_col = f'{period}_Sleep_Start'
        end_col = f'{period}_Sleep_End'

        # Convert times to hours from midnight
        start_time = features_df[start_col]
        end_time = features_df[end_col]

        # Handle overnight sleep
        duration = np.where(end_time > start_time,
                            end_time - start_time,
                            (24 - start_time) + end_time)

        midpoint = (start_time + duration / 2) % 24

        features_df[f'{period}_Duration'] = duration
        features_df[f'{period}_Midpoint'] = midpoint

    return features_df

features_df = calculate_sleep_features(sleep_timing_df)

print("\nDataFrame with Engineered Features:")
print(features_df.head())
Code collapsed

How it works

  • For both "Weekday" and "Weekend", we calculate the sleep duration. We have to account for sleeping past midnight. If the end_time is less than the start_time, it means the person woke up the next day. In this case, the duration is (24 - start_time) + end_time.
  • The midpoint of sleep is calculated by adding half the duration to the start time. We use the modulo operator (% 24) to ensure the midpoint wraps around a 24-hour clock.

Common pitfalls

  • Ignoring Overnight Spans: A simple end_time - start_time will give a negative duration for overnight sleep. The np.where function is a clean way to handle this conditional logic.

Cluster Users with K-Means Algorithm

What we're doing

K-Means is a popular clustering algorithm that groups data points into a pre-defined number of clusters (k). It's a good starting point for our chronotype discovery. We will use the "elbow method" to determine the optimal number of clusters and then fit the K-Means model to our engineered features.

Input: Feature matrix X with columns [Weekday_Midpoint, Weekday_Duration, Weekend_Midpoint, Weekend_Duration] **Output`: Cluster assignments (0, 1, 2) for each user representing their chronotype

Implementation

code
# src/kmeans_clustering.py
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Select features for clustering
X = features_df[['Weekday_Midpoint', 'Weekday_Duration', 'Weekend_Midpoint', 'Weekend_Duration']]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow method to find the optimal k
inertia = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

# Based on the elbow plot, let's choose k=3
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
features_df['KMeans_Cluster'] = kmeans.fit_predict(X_scaled)

print("\nDataFrame with K-Means Clusters:")
print(features_df.head())

# Visualize the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(data=features_df, x='Weekday_Midpoint', y='Weekend_Midpoint', hue='KMeans_Cluster', palette='viridis', s=100)
plt.title('Chronotype Clusters (K-Means)')
plt.xlabel('Weekday Sleep Midpoint')
plt.ylabel('Weekend Sleep Midpoint')
plt.legend(title='Cluster')
plt.show()
Code collapsed

How it works

  1. Feature Scaling: We use StandardScaler to normalize our features. This is important for distance-based algorithms like K-Means, as it prevents features with larger scales from dominating the clustering process.
  2. Elbow Method: We calculate the inertia (sum of squared distances of samples to their closest cluster center) for different values of k. The "elbow" in the plot, where the rate of decrease in inertia slows down, suggests the optimal number of clusters. For this data, it's around k=3.
  3. Clustering and Visualization: We fit the K-Means model with k=3 and assign the cluster labels back to our DataFrame. The scatter plot helps us visualize the resulting clusters based on weekday and weekend sleep midpoints.

Common pitfalls

  • Forgetting to Scale Features: This can lead to skewed clusters if the feature scales are very different.
  • Misinterpreting the Elbow: Sometimes the elbow is not very clear. In such cases, you might need to use other methods like the silhouette score or rely on domain knowledge to choose k.

Detect Outliers with DBSCAN Clustering

What we're doing

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another powerful clustering algorithm that can identify clusters of arbitrary shapes and is also robust to outliers. Unlike K-Means, it doesn't require us to specify the number of clusters beforehand. This can be very useful when we don't know how many natural groups exist in the data.

Input: Scaled feature matrix X_scaled Output: Cluster assignments with -1 indicating noise/outlier points

Implementation

code
# src/dbscan_clustering.py
from sklearn.cluster import DBSCAN

# Using the same scaled features
dbscan = DBSCAN(eps=0.5, min_samples=5)
features_df['DBSCAN_Cluster'] = dbscan.fit_predict(X_scaled)

print("\nDataFrame with DBSCAN Clusters:")
print(features_df.head())
print("\nDBSCAN Cluster Counts:")
print(features_df['DBSCAN_Cluster'].value_counts())


# Visualize the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(data=features_df, x='Weekday_Midpoint', y='Weekend_Midpoint', hue='DBSCAN_Cluster', palette='viridis', s=100)
plt.title('Chronotype Clusters (DBSCAN)')
plt.xlabel('Weekday Sleep Midpoint')
plt.ylabel('Weekend Sleep Midpoint')
plt.legend(title='Cluster')
plt.show()
Code collapsed

How it works

  • DBSCAN Parameters: DBSCAN has two main parameters: eps (the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples (the number of samples in a neighborhood for a point to be considered as a core point). Tuning these can be tricky and may require some experimentation.
  • Noise Detection: DBSCAN labels noise points as -1. This is a key advantage as it can isolate outliers that don't belong to any cluster.

Common pitfalls

  • Parameter Tuning: The performance of DBSCAN is sensitive to eps and min_samples. Poor choices can result in all points being in one cluster or all points being labeled as noise. You may need to experiment with different values to get meaningful results.

Putting It All Together

Let's analyze and interpret the clusters we've found. We can group our DataFrame by the cluster labels and look at the mean values of our features for each group.

code
# src/analyze_clusters.py

# Analyze K-Means clusters
kmeans_summary = features_df.groupby('KMeans_Cluster')[['Weekday_Midpoint', 'Weekend_Midpoint', 'Weekday_Duration', 'Weekend_Duration']].mean()
print("\nK-Means Cluster Summary:")
print(kmeans_summary)

# Analyze DBSCAN clusters
dbscan_summary = features_df[features_df['DBSCAN_Cluster'] != -1].groupby('DBSCAN_Cluster')[['Weekday_Midpoint', 'Weekend_Midpoint', 'Weekday_Duration', 'Weekend_Duration']].mean()
print("\nDBSCAN Cluster Summary:")
print(dbscan_summary)
Code collapsed

Interpreting the K-Means Clusters: Based on the output of the K-Means summary, we might find three distinct groups:

  • Cluster 0: Early Birds: Low weekday and weekend midpoints (e.g., around 3-4 AM), suggesting they go to bed and wake up early consistently.
  • Cluster 1: Night Owls: High weekday and weekend midpoints (e.g., around 6-7 AM), indicating a preference for later sleep times.
  • Cluster 2: Standard Sleepers: Midpoints in between the other two groups, representing a more typical sleep schedule.

Alternative Approaches

  • Hierarchical Clustering: This method creates a tree of clusters and can be useful for visualizing how groups are nested. It doesn't require a predefined number of clusters, but you do need to decide where to "cut" the tree.
  • Gaussian Mixture Models (GMM): GMMs are a probabilistic model that assumes the data points are generated from a mixture of a finite number of Gaussian distributions. This can be more flexible than K-Means as it allows for clusters that are not spherical.

Conclusion

  • Summary of achievements: We've successfully built an end-to-end pipeline to discover user chronotypes from sleep timing data. We've cleaned the data, engineered meaningful features, and applied two different unsupervised clustering algorithms, K-Means and DBSCAN, to segment users into distinct groups.
  • Next steps for readers: Try applying this methodology to a different behavioral dataset. You could also experiment with other clustering algorithms or engineer additional features, such as the difference between weekday and weekend sleep patterns (an indicator of "social jetlag").
  • Call to action: Try out the code and see what you can discover in your own data! Share your findings or any interesting modifications you've made in the comments below.

Resources


Frequently Asked Questions

How do I determine the optimal number of clusters?

The elbow method (plotting inertia vs k) is the most common approach. Look for the "elbow" where the rate of decrease slows. Alternatively, use silhouette score (0.5-0.7 indicates good clustering) or rely on domain knowledge—chronotypes naturally suggest 3 groups.

What's the difference between K-Means and DBSCAN?

K-Means requires specifying k and creates spherical clusters based on distance. DBSCAN is density-based, finds arbitrarily-shaped clusters, and detects outliers (noise points labeled -1). Use K-Means for main segmentation and DBSCAN to identify irregular sleep patterns.

Can I use this for real-time chronotype detection?

For real-time, you'd need to retrain or update clusters periodically as new data arrives. Consider online clustering algorithms or batch processing nightly. For individual users, compare their current sleep patterns to pre-computed cluster centroids.

What other features can improve clustering?

Consider adding: sleep regularity (standard deviation of sleep times), social jetlag (weekend vs weekday difference), sleep efficiency, and self-reported morningness scores. More features can capture nuances but may require dimensionality reduction.

Is chronotype clustering clinically validated?

Chronotypes are well-established in sleep research (Horne-Östberg Morningness-Eveningness Questionnaire). However, clustering-based discovery from passive data is an emerging approach. Always validate clusters against subjective user reports for product applications.

How do I handle users with inconsistent sleep schedules?

These users often appear as DBSCAN noise points (cluster -1). Consider creating a separate "irregular" chronotype category, or track consistency scores over time. Extremely irregular patterns may indicate sleep disorders requiring clinical attention.

Can I apply this to wearable device data?

Yes! Fitbit, Apple Watch, and Oura all provide sleep timing data. The same feature engineering applies—extract bedtime, wake time, calculate midpoint and duration. You may get more granular data but the clustering approach remains identical.

#

Article Tags

python
datascience
machinelearning
healthtech

Related Tools

Scikit-learn

Machine learning library with K-Means and DBSCAN implementations

Pandas

Data manipulation and analysis library

Matplotlib

Visualization library for cluster exploration

W

WellAlly's core development team, comprised of healthcare professionals, software engineers, and UX designers committed to revolutionizing digital health management.

Expertise

Healthcare Technology
Software Development
User Experience
AI & Machine Learning

Found this article helpful?

Try KangXinBan and start your health management journey