Building an ML-Powered Notification Router on AWS: A Production Architecture Guide
Yadab Sutradhar
Software Engineer
•Table of Contents
Introduction
Sending notifications at the wrong time is like knocking on someone's door at 3 AM. Even if you have something important to say, timing matters. In this article, I'll walk you through building a production-grade, ML-powered notification routing engine that predicts the optimal send time for each user based on their historical engagement patterns.
What we'll build:
- Real-time ML inference system using Amazon SageMaker
- Event-driven architecture processing millions of events
- Automated ML training pipeline with feedback loops
- Infrastructure as Code using AWS CDK
- Cost-optimized serverless design (~$350/month for 1M+ events/day)
Tech Stack
GitHub Repository
Full source code available at: github.com/Yadab-Sd/smart-notification-routing-engine
The Problem: Notification Fatigue
The Business Challenge
Modern applications send billions of notifications daily. Email, SMS, push notifications, WhatsApp messages—the channels are endless. But here's the problem:
of notifications go unread when sent at non-optimal times
increase in user churn due to notification fatigue
variance in engagement rates depending on send time
Traditional Approaches (and Why They Fail)
"Send at 9 AM local time"
Ignores individual user behavior patterns and preferences
"Send when user was last active"
Past behavior doesn't predict future engagement
"Batch and send at fixed intervals"
Misses optimal windows entirely
The Technical Challenge
Building a smart notification router requires solving several problems:
- 1Real-time prediction: Decision latency must be <500ms
- 2Personalization: Each user has unique engagement patterns
- 3Scale: Handle millions of events per day
- 4Feedback loops: Model must improve with delivery outcomes
- 5Cost optimization: Keep AWS costs under $500/month for 1M+ events
Solution Architecture
High-Level Design
The system consists of three main flows:
Event Ingestion Flow
(real-time)
Decision & Scheduling Flow
(real-time)
ML Training Pipeline
(batch, daily)
Complete System Architecture
Architecture Principles
Event-Driven & Decoupled
- Amazon Kinesis for event streaming
- Services communicate via events, not direct calls
- Enables independent scaling and deployment
Serverless-First
- Lambda for compute (auto-scaling, pay-per-use)
- DynamoDB for state (millisecond latency)
- No servers to manage
ML Feedback Loop
- Delivery outcomes feed back into training
- Model retrains daily on fresh data
- Continuous improvement without manual intervention
Core Components Deep Dive
1. Event Ingestion: Control Plane Lambda
Purpose: Ingest user events (page views, clicks, notifications sent) and stream to Kinesis.
// services/control-plane/src/main/java/Handler.java
public class Handler implements RequestHandler<APIGatewayV2HTTPEvent,
APIGatewayV2HTTPResponse> {
private final KinesisClient kinesis;
@Override
public APIGatewayV2HTTPResponse handleRequest(...) {
ObjectMapper json = new ObjectMapper();
UserEvent userEvent = json.readValue(
event.getBody(), UserEvent.class
);
// Validate and stream to Kinesis
PutRecordRequest putReq = PutRecordRequest.builder()
.streamName(streamName)
.partitionKey(userEvent.getUserId())
.body(SdkBytes.fromUtf8String(
json.writeValueAsString(userEvent)
))
.build();
kinesis.putRecord(putReq);
return buildResponse(200, "{\"status\":\"accepted\"}");
}
}Key Design Decisions:
- ✓Java 21 with SnapStart: Cold starts reduced from ~2s to ~200ms
- ✓Partition by userId: Ensures ordered processing per user
- ✓Async processing: API responds immediately
ML Pipeline: From Raw Events to Predictions
Pipeline Architecture
The ML pipeline transforms raw JSONL events into predictions through automated feature engineering, XGBoost training, and real-time inference endpoints.
Critical Lessons Learned: 5 Bugs You Must Avoid
📝 Content Under Development
This section is actively being written and refined. I'm documenting the critical production bugs we encountered and the lessons learned from building this system at scale.
🚀 Check back soon for the complete breakdown of all 5 bugs with detailed explanations, code examples, and prevention strategies!
Bug #1: Feature Mismatch Between Training and Inference
Training used features [sends_count_hour, click_rate_7d] but inference sent [hour, dow, days_since_last_seen]
📉 The Result:
AUC score plummeted from 0.82 (validation) to 0.51 (production)
💡 Lesson: Always version your feature schemas and validate at deployment time.
Bug #2: Format Mismatch (CSV vs Parquet)
Glue output CSV but training expected Parquet
💡 Lesson: Use SageMaker built-in algorithms—they handle CSV natively.
Bug #3: Wrong Event Types in ETL
Filtered for PLAY_MOVIE instead of NOTIFICATION_SENT
💡 Lesson: Validate data pipelines with unit tests.
Performance & Cost Analysis
Performance Metrics
Monthly Cost
for 1M events/day
Conclusion
Building a production ML system is 10% modeling and 90% engineering. The hard parts are building reliable pipelines, ensuring feature consistency, and creating feedback loops.