Back to Blogs

Building an ML-Powered Notification Router on AWS: A Production Architecture Guide

Yadab Sutradhar

Yadab Sutradhar

Software Engineer

#AWS#MachineLearning#SageMaker#MLOps#ServerlessArchitecture#XGBoost#EventDriven#InfrastructureAsCode#NotificationEngineering

Introduction

Sending notifications at the wrong time is like knocking on someone's door at 3 AM. Even if you have something important to say, timing matters. In this article, I'll walk you through building a production-grade, ML-powered notification routing engine that predicts the optimal send time for each user based on their historical engagement patterns.

What we'll build:

  • Real-time ML inference system using Amazon SageMaker
  • Event-driven architecture processing millions of events
  • Automated ML training pipeline with feedback loops
  • Infrastructure as Code using AWS CDK
  • Cost-optimized serverless design (~$350/month for 1M+ events/day)

Tech Stack

AWS Lambda (Java 21)
Amazon SageMaker
AWS Glue (PySpark)
Amazon Kinesis
DynamoDB
EventBridge Scheduler
AWS CDK (TypeScript)
XGBoost

GitHub Repository

Full source code available at: github.com/Yadab-Sd/smart-notification-routing-engine

The Problem: Notification Fatigue

The Business Challenge

Modern applications send billions of notifications daily. Email, SMS, push notifications, WhatsApp messages—the channels are endless. But here's the problem:

50-70%

of notifications go unread when sent at non-optimal times

30%

increase in user churn due to notification fatigue

5-10x

variance in engagement rates depending on send time

Traditional Approaches (and Why They Fail)

"Send at 9 AM local time"

Ignores individual user behavior patterns and preferences

"Send when user was last active"

Past behavior doesn't predict future engagement

"Batch and send at fixed intervals"

Misses optimal windows entirely

The Technical Challenge

Building a smart notification router requires solving several problems:

  1. 1
    Real-time prediction: Decision latency must be <500ms
  2. 2
    Personalization: Each user has unique engagement patterns
  3. 3
    Scale: Handle millions of events per day
  4. 4
    Feedback loops: Model must improve with delivery outcomes
  5. 5
    Cost optimization: Keep AWS costs under $500/month for 1M+ events

Solution Architecture

High-Level Design

The system consists of three main flows:

📥

Event Ingestion Flow

(real-time)

🎯

Decision & Scheduling Flow

(real-time)

🤖

ML Training Pipeline

(batch, daily)

Complete Architecture Diagram

Complete System Architecture

Architecture Principles

Event-Driven & Decoupled

  • Amazon Kinesis for event streaming
  • Services communicate via events, not direct calls
  • Enables independent scaling and deployment

Serverless-First

  • Lambda for compute (auto-scaling, pay-per-use)
  • DynamoDB for state (millisecond latency)
  • No servers to manage

ML Feedback Loop

  • Delivery outcomes feed back into training
  • Model retrains daily on fresh data
  • Continuous improvement without manual intervention

Core Components Deep Dive

1. Event Ingestion: Control Plane Lambda

Purpose: Ingest user events (page views, clicks, notifications sent) and stream to Kinesis.

// services/control-plane/src/main/java/Handler.java
public class Handler implements RequestHandler<APIGatewayV2HTTPEvent,
                                               APIGatewayV2HTTPResponse> {
    private final KinesisClient kinesis;

    @Override
    public APIGatewayV2HTTPResponse handleRequest(...) {
        ObjectMapper json = new ObjectMapper();
        UserEvent userEvent = json.readValue(
            event.getBody(), UserEvent.class
        );

        // Validate and stream to Kinesis
        PutRecordRequest putReq = PutRecordRequest.builder()
            .streamName(streamName)
            .partitionKey(userEvent.getUserId())
            .body(SdkBytes.fromUtf8String(
                json.writeValueAsString(userEvent)
            ))
            .build();

        kinesis.putRecord(putReq);
        return buildResponse(200, "{\"status\":\"accepted\"}");
    }
}

Key Design Decisions:

  • Java 21 with SnapStart: Cold starts reduced from ~2s to ~200ms
  • Partition by userId: Ensures ordered processing per user
  • Async processing: API responds immediately

ML Pipeline: From Raw Events to Predictions

Pipeline Architecture

S3 Raw EventsGlue ETLS3 Curated FeaturesSageMaker TrainingSageMaker Endpoint

The ML pipeline transforms raw JSONL events into predictions through automated feature engineering, XGBoost training, and real-time inference endpoints.

Critical Lessons Learned: 5 Bugs You Must Avoid

✍️

📝 Content Under Development

This section is actively being written and refined. I'm documenting the critical production bugs we encountered and the lessons learned from building this system at scale.

🚀 Check back soon for the complete breakdown of all 5 bugs with detailed explanations, code examples, and prevention strategies!

🐛

Bug #1: Feature Mismatch Between Training and Inference

Training used features [sends_count_hour, click_rate_7d] but inference sent [hour, dow, days_since_last_seen]

📉 The Result:

AUC score plummeted from 0.82 (validation) to 0.51 (production)

💡 Lesson: Always version your feature schemas and validate at deployment time.

🐛

Bug #2: Format Mismatch (CSV vs Parquet)

Glue output CSV but training expected Parquet

💡 Lesson: Use SageMaker built-in algorithms—they handle CSV natively.

🐛

Bug #3: Wrong Event Types in ETL

Filtered for PLAY_MOVIE instead of NOTIFICATION_SENT

💡 Lesson: Validate data pipelines with unit tests.

Performance & Cost Analysis

Performance Metrics

Decision API p50180ms
SageMaker Inference45ms
Event Throughput5,000/s

Monthly Cost

Total(Estimating...)

for 1M events/day

Conclusion

Building a production ML system is 10% modeling and 90% engineering. The hard parts are building reliable pipelines, ensuring feature consistency, and creating feedback loops.

Use event-driven architecture for scale
Validate features match between training/inference
Prefer SageMaker built-in algorithms
Infrastructure as Code makes deployments reproducible
Always have a feedback loop
#AWS#MachineLearning#SageMaker#Serverless#MLOps#XGBoost