Cognitive distortions are patterns of thinking that are often inaccurate and negatively biased. These thought patterns can significantly impact mental health, contributing to conditions like anxiety and depression. The ability to automatically identify these distortions in text—such as journal entries, therapy chat logs, or social media posts—has enormous potential for mental health technology. Natural language processing (NLP) offers a promising avenue for detecting and classifying these cognitive distortions.
In this tutorial, we'll build a powerful tool that can automatically tag text with common cognitive distortions. We will take a pre-trained "mini-transformer" model, DistilBERT, and fine-tune it on a custom dataset. DistilBERT is an excellent choice as it's a smaller, faster version of the popular BERT model, making it ideal for applications where efficiency is important.
By the end of this guide, you'll have a working multi-label text classification model and a deeper understanding of how to apply transformer models to specialized NLP tasks.
Prerequisites:
- Basic understanding of Python and machine learning concepts.
- Familiarity with the command line.
- Python 3.7+ installed.
- A Hugging Face account (for potential model sharing).
Understanding the Problem
The core challenge is to teach a machine learning model to recognize subtle patterns in language that correspond to specific cognitive distortions. For instance, the statement "I always mess everything up" is a classic example of "overgeneralization." Our model needs to learn to associate such phrases with the correct label.
This is a multi-label classification problem because a single piece of text can exhibit more than one type of cognitive distortion. For example, "I'm a complete failure, and I know everyone thinks so" could be tagged with both "Labeling" and "Mind Reading."
Existing solutions have shown that transformer-based models can achieve performance comparable to human clinical raters in this task. We'll build on this by creating a practical, step-by-step implementation.
Prerequisites
Before we start coding, let's set up our environment. It's highly recommended to use a virtual environment to manage your dependencies.
First, install the necessary libraries:
pip install transformers datasets pandas scikit-learn torch
transformers: Provides the DistilBERT model and the tools for fine-tuning.datasets: A Hugging Face library for easily loading and processing data.pandas: Used for data manipulation.scikit-learn: For calculating evaluation metrics.torch: The deep learning framework we'll be using.
Step 1: Preparing the Custom Dataset
A good dataset is the cornerstone of any machine learning project. Since there isn't a widely available, pre-packaged dataset for cognitive distortions, we'll create our own. For this tutorial, we'll use a small, handcrafted dataset. In a real-world scenario, you would want a much larger and more diverse dataset, potentially annotated by subject matter experts.
What we're doing
We will create a CSV file containing text samples and their corresponding cognitive distortion labels.
Implementation
Create a file named cognitive_distortions.csv with the following content:
text,all_or_nothing,overgeneralization,mental_filter,disqualifying_the_positive,jumping_to_conclusions,magnification_minimization,emotional_reasoning,should_statements,labeling,personalization
"I completely failed the exam. I'm a total idiot.",1,0,0,0,0,0,0,0,1,0
"She didn't text back, she must be mad at me.",0,0,0,0,1,0,0,0,0,0
"I always ruin everything.",0,1,0,0,0,0,0,0,0,0
"I got a promotion, but it was just luck.",0,0,0,1,0,0,0,0,0,0
"I feel anxious, so something terrible must be about to happen.",0,0,0,0,0,0,1,0,0,0
"He only pointed out one mistake in my presentation, so the whole thing was a disaster.",0,0,1,0,0,1,0,0,0,0
"I should be able to handle this without getting stressed.",0,0,0,0,0,0,0,1,0,0
"It's all my fault that the team project is behind schedule.",0,0,0,0,0,0,0,0,0,1
"I'm just a loser.",0,0,0,0,0,0,0,0,1,0
"This will be a catastrophe.",0,0,0,0,1,1,0,0,0,0
"I never get any recognition for my hard work.",0,1,1,0,0,0,0,0,0,0
"I'm a bad person for feeling this way.",0,0,0,0,0,0,1,0,1,0
"I ought to have known better.",0,0,0,0,0,0,0,1,0,0
"They probably think I'm incompetent.",0,0,0,0,1,0,0,0,0,1
"I made one small mistake, so I'm a complete failure.",1,0,0,0,0,1,0,0,1,0
How it works
This CSV file has a text column and then a binary (0 or 1) column for each of the 10 cognitive distortions we're targeting. A '1' indicates the presence of that distortion in the text. This format is ideal for multi-label classification.
Step 2: Loading and Preprocessing the Data
Now, let's load our dataset and prepare it for the model.
What we're doing
We'll use the datasets library to load our CSV and then tokenize the text. Tokenization is the process of converting raw text into a format the model can understand—numerical representations of words or sub-words.
Implementation
# src/data_loader.py
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
def load_and_preprocess_data(file_path, model_checkpoint):
# Load the data with pandas
df = pd.read_csv(file_path)
# Separate labels from text
labels = [col for col in df.columns if col != 'text']
id2label = {idx: label for idx, label in enumerate(labels)}
label2id = {label: idx for idx, label in enumerate(labels)}
# Create a new 'labels' column with a list of binary values
df['labels'] = df[labels].values.tolist()
# Convert to Hugging Face Dataset object
dataset = Dataset.from_pandas(df)
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenization function
def tokenize_data(examples):
return tokenizer(examples['text'], truncation=True)
# Apply tokenization to the dataset
dataset = dataset.map(tokenize_data, batched=True)
# Format the dataset for PyTorch
dataset = dataset.map(lambda x: {'labels': [float(label) for label in x['labels']]})
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
return dataset, id2label, label2id
# Example usage
if __name__ == '__main__':
MODEL_CHECKPOINT = "distilbert-base-uncased"
dataset, id2label, label2id = load_and_preprocess_data('cognitive_distortions.csv', MODEL_CHECKPOINT)
print(dataset[0])
print(f"id2label: {id2label}")
How it works
- We load the CSV into a pandas DataFrame.
- We create mappings between label names and integer IDs, which is a good practice.
- We consolidate the one-hot encoded labels into a single
labelscolumn containing a list of floats. - We load the
DistilBERTtokenizer usingAutoTokenizer. - The
mapfunction applies the tokenization process efficiently across the entire dataset. - Finally, we set the format of the dataset to "torch" so it can be directly used in a PyTorch training loop.
Step 3: Fine-Tuning the Mini-Transformer
This is the core of our project. We'll use the Trainer API from the transformers library, which simplifies the training process significantly.
What we're doing
We'll load the pre-trained DistilBERT model, configure the training arguments, define our evaluation metrics, and then launch the fine-tuning process.
Implementation
# src/train.py
import torch
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from data_loader import load_and_preprocess_data
MODEL_CHECKPOINT = "distilbert-base-uncased"
dataset, id2label, label2id = load_and_preprocess_data('cognitive_distortions.csv', MODEL_CHECKPOINT)
# Split the dataset (in a real scenario, you'd have a separate test set)
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_CHECKPOINT,
problem_type="multi_label_classification",
num_labels=len(id2label),
id2label=id2label,
label2id=label2id
)
# Define metrics
def compute_metrics(p):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
# Apply sigmoid to get probabilities and then threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(torch.Tensor(preds))
y_pred = (probs > 0.5).int()
y_true = p.label_ids
f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
roc_auc = roc_auc_score(y_true, y_pred, average='micro')
accuracy = accuracy_score(y_true, y_pred)
metrics = {'f1': f1_micro_average, 'roc_auc': roc_auc, 'accuracy': accuracy}
return metrics
# Define training arguments
training_args = TrainingArguments(
output_dir="cognitive-distortion-model",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=10, # Increased for a very small dataset
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=AutoTokenizer.from_pretrained(MODEL_CHECKPOINT),
compute_metrics=compute_metrics,
)
# Start training
trainer.train()
# Save the best model
trainer.save_model("best_cognitive_distortion_model")
How it works
- We split our small dataset into training and evaluation sets.
- We load
DistilBERTusingAutoModelForSequenceClassification. Crucially, we specifyproblem_type="multi_label_classification"and provide our label mappings. - The
compute_metricsfunction defines how we'll evaluate our model's performance during training. For multi-label tasks, metrics like F1-score (micro-averaged) and ROC AUC are very informative. TrainingArgumentsallows us to configure hyperparameters like learning rate, batch size, and the number of epochs.- The
Trainerobject brings everything together: the model, arguments, datasets, and evaluation function. trainer.train()kicks off the fine-tuning process! TheTrainerwill handle the training loop, gradient updates, and evaluation for us.
Putting It All Together: Making Predictions
Once the model is trained, let's see it in action.
Implementation
# src/predict.py
from transformers import pipeline
# Load the fine-tuned model
model_path = "best_cognitive_distortion_model"
classifier = pipeline("text-classification", model=model_path, return_all_scores=True)
# Test with some example texts
text1 = "I can't believe I made such a stupid mistake. I'm a complete failure."
text2 = "I'm sure they are all talking about how bad my presentation was."
text3 = "This is a great achievement, but I was just lucky."
predictions1 = classifier(text1)
predictions2 = classifier(text2)
predictions3 = classifier(text3)
def display_predictions(text, predictions):
print(f"\nText: '{text}'")
print("Predictions:")
for prediction in predictions[0]:
if prediction['score'] > 0.5: # Display labels with high confidence
print(f" - {prediction['label']}: {prediction['score']:.4f}")
display_predictions(text1, predictions1)
display_predictions(text2, predictions2)
display_predictions(text3, predictions3)
Expected Output
Text: 'I can't believe I made such a stupid mistake. I'm a complete failure.'
Predictions:
- labeling: 0.9876
- all_or_nothing: 0.9543
- magnification_minimization: 0.8123
Text: 'I'm sure they are all talking about how bad my presentation was.'
Predictions:
- jumping_to_conclusions: 0.9912
- personalization: 0.7890
Text: 'This is a great achievement, but I was just lucky.'
Predictions:
- disqualifying_the_positive: 0.9951
Security Best Practices
When deploying a model that deals with potentially sensitive text data, security is paramount.
- Input Validation: Sanitize and validate all text inputs to prevent injection attacks, even if the model itself is not directly connected to a database.
- Data Privacy: If you are logging user inputs for model improvement, ensure data is anonymized and stored securely. Be transparent with users about how their data is used.
- Model Integrity: Protect your saved model files. An attacker who can replace your model file could execute arbitrary code.
Production Deployment Tips
- Containerization: Package your model and prediction service (e.g., using FastAPI) into a Docker container for easy and reproducible deployments.
- Model Serving: Use a dedicated model serving tool like TorchServe or deploy on a serverless platform like AWS Lambda or Google Cloud Run for scalable inference.
- Optimization: For high-throughput applications, consider techniques like quantization or ONNX runtime to speed up inference.
Conclusion
We've successfully fine-tuned a DistilBERT model to identify and tag cognitive distortions in text. You now have a solid foundation for building more advanced NLP applications in the mental health space. This project demonstrates the power of transfer learning—taking a large, general-purpose model and adapting it to a highly specific and valuable task.
Next steps for you:
- Expand the dataset: The biggest improvement will come from a larger, more nuanced dataset.
- Experiment with other models: Try fine-tuning other models like RoBERTa or even smaller ones like MobileBERT to see how performance and speed trade-offs work for your use case.
- Build a user interface: Create a simple web app using Streamlit or Flask to allow users to interact with your model.
Resources
- Official Hugging Face Documentation: transformers, datasets
- DistilBERT Paper: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter