WellAlly Logo
WellAlly康心伴
Development

A Blueprint for a HIPAA-Compliant AWS Data Lake: From Ingestion to Analytics

Learn to architect a production-grade, HIPAA-compliant data lake on AWS. This step-by-step guide provides a blueprint for a secure, serverless pipeline using S3, Lambda, Lake Formation, and Glue to ingest, govern, and process sensitive health data.

W
2025-12-11
11 min read

The healthcare industry is sitting on a goldmine of data. From wearables to electronic health records (EHRs), this information holds the key to unlocking personalized health insights, predictive diagnostics, and revolutionary treatments. But there's a huge challenge: this data is highly sensitive and protected by the Health Insurance Portability and Accountability Act (HIPAA). Mishandling Protected Health Information (PHI) can lead to massive fines, loss of reputation, and most importantly, a breach of patient trust.

So, how can we build a platform that harnesses the power of this data while upholding the highest standards of security and compliance?

In this tutorial, we will build a production-grade, HIPAA-compliant data lake on AWS. We'll construct a secure, serverless pipeline from data ingestion to analytics, ensuring every component is locked down according to HIPAA's technical safeguards. We'll build a system that securely ingests patient data, stores it in an encrypted and audited data lake, governs access with precision, and processes it for analysis without exposing sensitive PHI.

This is a blueprint for developers and DevOps engineers tasked with handling sensitive health data in the cloud. We'll focus on infrastructure and security, giving you the confidence to build robust and compliant healthcare applications.

Understanding the Problem

Handling PHI in the cloud isn't just about encrypting data; it's about creating a verifiable chain of custody and enforcing the principle of least privilege at every step. The core challenges we need to solve are:

  • Secure Ingestion: How do we get data from sources (like a patient-facing app or a third-party provider) into our system without exposing it to threats?
  • Immutable & Auditable Storage: How do we store raw PHI so that it's encrypted, tamper-proof, and every access attempt is logged?
  • Granular Access Control: How do we ensure that a data scientist working on a machine learning model can't access a patient's name or social security number, while a compliance officer can?
  • Safe Data Processing: How can we run analytics or ETL jobs on the data without exposing the raw PHI to developers or the processing engine itself?

Our serverless approach using API Gateway, Lambda, S3, Lake Formation, and Glue directly addresses these challenges by layering security controls and abstracting away the underlying infrastructure, reducing the surface area for attack.

Prerequisites

Before we begin, you'll need the following:

  • An AWS Account: You must have an active AWS account with administrator privileges.
  • AWS Business Associate Addendum (BAA): You must execute a BAA with AWS. You can do this through the AWS Artifact console. This is a legal requirement for handling PHI on AWS.
  • AWS CLI: The AWS Command Line Interface installed and configured on your machine. We'll use this for creating and configuring resources.
  • Basic Knowledge: Familiarity with AWS IAM, S3, Lambda, and API Gateway is recommended.
  • Software: Python 3.8+ and the AWS CDK (optional, for an alternative deployment method).

Step 1: Secure and Encrypted Storage with S3

The foundation of our data lake is Amazon S3. But a default S3 bucket is not HIPAA-compliant. We need to enforce several critical security controls: encryption, access logging, versioning, and strict access policies.

What we're doing

We will create two S3 buckets:

  1. phi-raw-data-<your-unique-id>: This is the landing zone where raw, encrypted PHI will be ingested. Access will be extremely restricted.
  2. phi-processed-data-<your-unique-id>: This is the analytics zone where de-identified or pseudonymized data will be stored, ready for querying.

Implementation

Run the following AWS CLI commands. Replace <your-unique-id> with a unique identifier.

code
# A unique suffix for our bucket names
UNIQUE_ID="your-unique-id-12345"

# 1. Create the RAW data bucket
aws s3api create-bucket \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --region us-east-1 \
    --object-lock-enabled-for-bucket

# 2. Block all public access
aws s3api put-public-access-block \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# 3. Enable server-side encryption by default (SSE-S3)
aws s3api put-bucket-encryption \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --server-side-encryption-configuration '{"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]}'

# 4. Enable versioning to protect against accidental deletion/modification
aws s3api put-bucket-versioning \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --versioning-configuration Status=Enabled

# 5. Enable server access logging (replace with your own logging bucket)
# First, create a logging bucket
LOG_BUCKET="phi-access-logs-${UNIQUE_ID}"
aws s3api create-bucket --bucket ${LOG_BUCKET} --region us-east-1
aws s3api put-bucket-acl --bucket ${LOG_BUCKET} --grant-write URI=http://acs.amazonaws.com/groups/s3/LogDelivery --grant-read-acp URI=http://acs.amazonaws.com/groups/s3/LogDelivery

# Now apply the logging configuration
aws s3api put-bucket-logging \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --bucket-logging-status "{\"LoggingEnabled\":{\"TargetBucket\":\"${LOG_BUCKET}\",\"TargetPrefix\":\"raw-data-logs/\"}}"

# 6. Apply a strict data retention lifecycle policy
# This policy moves non-current versions to Glacier after 30 days and deletes them after 365 days.
cat > lifecycle.json <<EOL
{
  "Rules": [
    {
      "ID": "HIPAA-Retention-Policy",
      "Status": "Enabled",
      "Filter": {},
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 365
      }
    }
  ]
}
EOL

aws s3api put-bucket-lifecycle-configuration \
    --bucket phi-raw-data-${UNIQUE_ID} \
    --lifecycle-configuration file://lifecycle.json

# Repeat the process for the 'phi-processed-data' bucket
aws s3api create-bucket --bucket phi-processed-data-${UNIQUE_ID} --region us-east-1
aws s3api put-public-access-block --bucket phi-processed-data-${UNIQUE_ID} --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
aws s3api put-bucket-encryption --bucket phi-processed-data-${UNIQUE_ID} --server-side-encryption-configuration '{"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]}'
aws s3api put-bucket-versioning --bucket phi-processed-data-${UNIQUE_ID} --versioning-configuration Status=Enabled
aws s3api put-bucket-logging --bucket phi-processed-data-${UNIQUE_ID} --bucket-logging-status "{\"LoggingEnabled\":{\"TargetBucket\":\"${LOG_BUCKET}\",\"TargetPrefix\":\"processed-data-logs/\"}}"
Code collapsed

How it works

  • Encryption at Rest: put-bucket-encryption forces all new objects to be encrypted using AES-256. This is a fundamental HIPAA requirement.
  • No Public Access: put-public-access-block is a safety mechanism to prevent accidental public exposure of PHI, a common cause of data breaches.
  • Data Integrity: put-bucket-versioning protects against both accidental deletion and malicious modification by keeping a history of every object version.
  • Audit Controls: put-bucket-logging creates an audit trail of every request made to your S3 bucket, which is essential for security analysis and compliance.
  • Lifecycle Policies: The lifecycle policy helps manage storage costs and enforce data retention policies as required by HIPAA, automatically archiving and eventually deleting old data versions.

Step 2: Secure Data Ingestion via API Gateway and Lambda

Now that we have a secure place to store data, we need a secure way to get it there. We'll create a serverless ingestion endpoint using API Gateway and a Lambda function. This approach is highly scalable, cost-effective, and secure.

What we're doing

  1. API Gateway: Create a REST API that acts as a secure front door. We'll protect it with AWS WAF (Web Application Firewall).
  2. IAM Role: Define a least-privilege IAM role for our Lambda function, allowing it only to write to the phi-raw-data bucket and create logs.
  3. Lambda Function: Write a Python function that receives data from the API Gateway, performs basic validation, and writes the raw event to our S3 raw bucket.

Implementation

1. Create the Lambda Execution Role

code
# iam-role-policy.json
cat > iam-role-policy.json <<EOL
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::phi-raw-data-${UNIQUE_ID}/*"
        }
    ]
}
EOL

# trust-policy.json
cat > trust-policy.json <<EOL
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOL

aws iam create-role --role-name PHI-Ingestion-Lambda-Role --assume-role-policy-document file://trust-policy.json
aws iam put-role-policy --role-name PHI-Ingestion-Lambda-Role --policy-name PHI-Ingestion-Lambda-Policy --policy-document file://iam-role-policy.json
Code collapsed

2. Create the Lambda Function

code
# ingest_phi_function.py
import json
import boto3
import os
import uuid
from datetime import datetime

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    """
    Handles incoming PHI data from API Gateway.
    - Validates the presence of a patient_id.
    - Stores the raw, timestamped event in the secure S3 raw bucket.
    """
    try:
        # Get the bucket name from environment variables
        raw_bucket = os.environ['RAW_BUCKET_NAME']
        
        # Parse the incoming request body
        body = json.loads(event.get('body', '{}'))

        # --- Basic Validation ---
        if 'patient_id' not in body:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'patient_id is a required field'})
            }

        # --- Prepare data for storage ---
        timestamp = datetime.utcnow()
        event_id = str(uuid.uuid4())
        
        # Add metadata to the record
        body['_metadata'] = {
            'ingestion_timestamp_utc': timestamp.isoformat(),
            'event_id': event_id,
            'source_ip': event.get('requestContext', {}).get('identity', {}).get('sourceIp')
        }

        # --- Store in S3 Raw Bucket ---
        file_key = f"year={timestamp.year}/month={timestamp.month:02}/day={timestamp.day:02}/{event_id}.json"
        
        s3_client.put_object(
            Bucket=raw_bucket,
            Key=file_key,
            Body=json.dumps(body),
            ServerSideEncryption='AES256'
        )
        
        return {
            'statusCode': 200,
            'body': json.dumps({'status': 'success', 'event_id': event_id})
        }

    except json.JSONDecodeError:
        return {
            'statusCode': 400,
            'body': json.dumps({'error': 'Invalid JSON format in request body'})
        }
    except Exception as e:
        print(f"Error processing request: {e}")
        # Generic error to avoid leaking implementation details
        return {
            'statusCode': 500,
            'body': json.dumps({'error': 'Internal server error'})
        }

Code collapsed

3. Deploy Lambda and API Gateway

We'll use a simple AWS CloudFormation (SAM) template for this.

code
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: HIPAA-compliant data ingestion API

Parameters:
  RawDataBucketName:
    Type: String
    Description: The name of the S3 bucket for raw PHI data.

Resources:
  PhiIngestionFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: ingest_phi_function.lambda_handler
      Runtime: python3.9
      CodeUri: .
      Description: Ingests PHI data and stores it in the raw S3 bucket.
      MemorySize: 256
      Timeout: 30
      Role: !GetAtt PhiIngestionLambdaRole.Arn
      Environment:
        Variables:
          RAW_BUCKET_NAME: !Ref RawDataBucketName
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /ingest
            Method: post

  PhiIngestionLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: S3AndCloudWatchPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: 'arn:aws:logs:*:*:*'
              - Effect: Allow
                Action: s3:PutObject
                Resource: !Sub 'arn:aws:s3:::${RawDataBucketName}/*'
Code collapsed

Deploy this using the AWS SAM CLI: sam build && sam deploy --guided

How it works

  • Principle of Least Privilege: The IAM role is tightly scoped. The Lambda function can only write (s3:PutObject) to the specific raw data bucket and nothing else. It cannot read data or list buckets. This minimizes the "blast radius" if the function's credentials were ever compromised.
  • Serverless Security: API Gateway and Lambda are managed services, which means AWS handles the underlying infrastructure security (patching, OS hardening, etc.) as part of the shared responsibility model.
  • Input Validation: Our Lambda function performs a basic check for patient_id. In a real-world scenario, you would expand this to a full-fledged validation against a defined schema to prevent malformed data from entering your system.
  • Auditability: API Gateway execution logs and Lambda logs (stored in CloudWatch) provide a detailed record of every invocation, which is crucial for HIPAA compliance.

Step 3: Centralized Governance with AWS Lake Formation

At this point, our raw PHI is securely stored, but nobody can access it. This is where AWS Lake Formation comes in. It acts as a centralized security layer on top of our S3 data lake, allowing us to grant fine-grained permissions (down to the column and row level) to different users and services.

What we're doing

  1. Register S3 Buckets: Tell Lake Formation to manage our S3 locations.
  2. Create a Glue Data Catalog: Use an AWS Glue Crawler to scan our raw data and create a metadata table in the Glue Data Catalog. Lake Formation uses this catalog to understand the data's schema.
  3. Set Default Permissions: Lock down permissions so that, by default, no one (not even administrators) can access the data directly. All permissions must be explicitly granted through Lake Formation.
  4. Grant Permissions: Grant our AWS Glue ETL job (which we'll create in the next step) the necessary permissions to read from the raw bucket and write to the processed bucket.

Implementation

code
# 1. Register S3 locations with Lake Formation
# The role used must have permissions on the bucket
aws lakeformation register-resource --resource-arn arn:aws:s3:::phi-raw-data-${UNIQUE_ID}
aws lakeformation register-resource --resource-arn arn:aws:s3:::phi-processed-data-${UNIQUE_ID}

# 2. Create a database in AWS Glue Data Catalog
aws glue create-database --database-input "{\"Name\":\"phi_database\"}"

# 3. Create a Glue Crawler to catalog the raw data
# Replace <YOUR_ACCOUNT_ID> and <CRAWLER_ROLE_ARN> with your actual values
aws glue create-crawler --name "phi_raw_data_crawler" \
    --role <CRAWLER_ROLE_ARN> \
    --database-name "phi_database" \
    --targets "{\"S3Targets\":[{\"Path\":\"s3://phi-raw-data-${UNIQUE_ID}/\"}]}" \
    --table-prefix "raw_"

# Run the crawler
aws glue start-crawler --name "phi_raw_data_crawler"

# After the crawler runs, it will create a table named 'raw_data' in 'phi_database'.

# 4. Grant permissions to the Glue ETL job role
# Replace <GLUE_JOB_ROLE_ARN> with the ARN of the role your Glue job will use
GLUE_JOB_ROLE_ARN="arn:aws:iam::<YOUR_ACCOUNT_ID>:role/Glue-ETL-Job-Role"

# Allow Glue job to DESCRIBE the database
aws lakeformation grant-permissions --principal "{\"DataLakePrincipalIdentifier\":\"${GLUE_JOB_ROLE_ARN}\"}" \
    --resource "{\"Database\":{\"Name\":\"phi_database\"}}" \
    --permissions "DESCRIBE"

# Allow Glue job to SELECT from the raw table
aws lakeformation grant-permissions --principal "{\"DataLakePrincipalIdentifier\":\"${GLUE_JOB_ROLE_ARN}\"}" \
    --resource "{\"Table\":{\"DatabaseName\":\"phi_database\",\"Name\":\"raw_data\"}}" \
    --permissions "SELECT"

# Allow Glue job to CREATE and WRITE to the processed location
aws lakeformation grant-permissions --principal "{\"DataLakePrincipalIdentifier\":\"${GLUE_JOB_ROLE_ARN}\"}" \
    --resource "{\"DataLocation\":{\"ResourceArn\":\"arn:aws:s3:::phi-processed-data-${UNIQUE_ID}/\"}}" \
    --permissions "DATA_LOCATION_ACCESS"
Code collapsed

How it works

Lake Formation essentially hijacks direct S3 access and forces all data access requests through its own authorization model. When a service like AWS Glue or Amazon Athena tries to read data, it first checks with Lake Formation to see if the principal (the IAM role) has the required permissions on the specific table, column, or row. This allows us to create a data governance model that is far more granular and secure than relying on IAM policies alone.

Step 4: Secure and Automated Data Processing with AWS Glue

Our final step is to process the raw PHI into a useful, de-identified format for analytics. We'll use an AWS Glue ETL job, which is a serverless Spark environment. This job will read the raw JSON files, mask or remove sensitive fields, and write the result as compressed Parquet files in our processed bucket.

What we're doing

  1. Create a Glue ETL Job: Define a serverless Spark job.
  2. Write the PySpark Script: The script will perform the de-identification logic.
  3. Run the Job: Trigger the job to process the data.

Implementation

PySpark ETL Script (deidentify_phi_job.py)

code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col, sha2, concat_ws, lit, udf
from pyspark.sql.types import StringType

# Define a UDF to mask date of birth to only show the year
def mask_dob(dob_str):
    if dob_str and isinstance(dob_str, str):
        try:
            return dob_str.split('-')[0] + "-01-01"
        except:
            return None
    return None

mask_dob_udf = udf(mask_dob, StringType())

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Read data from the Glue Data Catalog (which is managed by Lake Formation)
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="phi_database",
    table_name="raw_data",
    transformation_ctx="datasource",
)

# Convert to Spark DataFrame for easier manipulation
df = datasource.toDF()

# --- De-identification and Transformation Logic ---
# 1. Pseudonymize patient_id using a SHA-256 hash. This is not reversible.
# 2. Drop direct identifiers like name and ssn.
# 3. Mask date of birth to only the year.
# 4. Drop metadata fields.
processed_df = df.withColumn("patient_hash", sha2(col("patient_id"), 256)) \
                 .withColumn("birth_year", mask_dob_udf(col("date_of_birth"))) \
                 .drop("patient_id", "first_name", "last_name", "ssn", "address", "date_of_birth") \
                 .drop("_metadata")

# Convert back to DynamicFrame
processed_dynamic_frame = DynamicFrame.fromDF(
    processed_df, glueContext, "processed_dynamic_frame"
)

# --- Write to Processed S3 Bucket in Parquet format ---
# Parquet is a columnar format, great for analytics.
# Partitioning by a common query field (e.g., condition) improves query performance.
datasink = glueContext.write_dynamic_frame.from_options(
    frame=processed_dynamic_frame,
    connection_type="s3",
    connection_options={
        "path": "s3://phi-processed-data-<your-unique-id>/",
        "partitionKeys": ["medical_condition"],
    },
    format="parquet",
    transformation_ctx="datasink",
)

job.commit()
Code collapsed

How it works

  • Serverless ETL: AWS Glue provisions and manages the Spark cluster for us. We just provide the script and configuration.
  • Data Catalog Integration: The job reads its source schema and location from the Glue Data Catalog, which is governed by Lake Formation. This ensures the job can only read the raw_data table it has been granted access to.
  • De-identification Techniques: We use common techniques like hashing for pseudonymization and dropping columns to remove direct identifiers. The UDF shows how to apply custom masking logic. HIPAA's Safe Harbor method lists 18 identifiers that must be removed for data to be considered de-identified. Our script is a starting point for implementing this.
  • Analytics-Ready Output: Writing the data in Parquet format and partitioning it makes it highly efficient to query with services like Amazon Athena.

Security Best Practices

Building this pipeline is the first step. Maintaining compliance requires ongoing effort.

  • Comprehensive Logging: Enable AWS CloudTrail for your entire AWS account. CloudTrail logs every single API call, giving you a complete audit history of who did what, and when.
  • Monitoring and Alerting: Use Amazon CloudWatch to create alarms. For example, set up an alarm for any s3:GetObject call on the phi-raw-data bucket that doesn't come from your Glue service role. This can detect unauthorized access attempts in real-time.
  • Threat Detection: Enable Amazon GuardDuty. It's a managed threat detection service that continuously monitors for malicious activity and unauthorized behavior using machine learning.
  • Infrastructure as Code (IaC): Deploy and manage this entire infrastructure using AWS CloudFormation or Terraform. This ensures your configuration is documented, version-controlled, and can be audited and redeployed consistently.

Conclusion

We have successfully designed and implemented a secure, serverless, and HIPAA-compliant data lake on AWS. By leveraging managed services like API Gateway, Lambda, S3, Lake Formation, and Glue, we have built a robust pipeline that enforces security at every layer.

We solved the core challenges of handling PHI: secure ingestion, immutable storage, granular access control through Lake Formation, and safe, automated processing. This architecture provides a strong foundation for building applications that can generate powerful health insights while maintaining the highest standards of patient data privacy and security.

Your next steps could be to connect Amazon Athena to the processed data for interactive SQL querying, or to use Amazon SageMaker to train machine learning models on the de-identified dataset.

Resources

#

Article Tags

awssecurityarchitecturehipaaserverlesshealthtech
W

WellAlly's core development team, comprised of healthcare professionals, software engineers, and UX designers committed to revolutionizing digital health management.

Expertise

Healthcare TechnologySoftware DevelopmentUser ExperienceAI & Machine Learning

Found this article helpful?

Try KangXinBan and start your health management journey

© 2024 康心伴 WellAlly · Professional Health Management