Solving the fundamental challenge of balancing system reliability checks with caller experience.
If you’re building customer experiences at scale, gracefully handling backend system failures is critical to maintaining service quality. While working with a client whose contact center handles over 200,000 calls daily across 11 self-service applications connecting to multiple backend systems, we needed to perform health checks on these systems before allowing callers to proceed—if a system was down, callers would be routed directly to agents rather than struggle through broken automated flows. The problem? The health check API responses varied wildly from 200ms to 8 seconds. At this call volume, real-time health checks would add hours of processing time daily and potentially overwhelm already-stressed backend systems during peak periods.
That’s where background health monitoring with intelligent caching comes in. By decoupling health checks from contact flow execution, we transformed unpredictable 8-second delays into consistent sub-100ms responses. The solution uses a single AWS Lambda function with dual operational modes—one for background monitoring and another for real-time queries. The result? Callers get near-instantaneous health verification while backend systems experience reduced load, creating a win-win for both customer experience and system stability.
The Technical Solution
We implemented a background health monitoring system using a single AWS Lambda function with two transaction types:
Background Monitoring (PING_ASYNC): EventBridge triggers the Lambda every 60 seconds to ping backend systems and store results in Parameter Store with timestamps. This creates a consistent monitoring rate regardless of call volume, preventing the health check load from contributing to system stress during traffic spikes.
Real-Time Queries (PING): When contact flows need health status, they invoke the same Lambda with a different transaction type. The function uses a multi-tier caching strategy: first checking an in-memory cache for results under 75 seconds old (< 1ms response), then refreshing from Parameter Store for stale cache (< 100ms), or loading fresh data on cache misses (< 100ms).
The intelligent caching ensures data freshness while eliminating the variable latency that plagued the original approach. Lambda execution context reuse maximizes cache hit rates, and Parameter Store provides persistence across container recycling.

Results and Benefits
The transformation exceeded expectations. Response times dropped from 200ms-8s to consistently under 100ms, with cache hits returning status immediately from Lambda execution context. The solution eliminated the cascade failure risk where health check load could push stressed systems over the edge during peak periods.
While Lambda execution costs decreased, the real savings came from reduced IVR time. The solution eliminated caller wait time during health checks and prevented callers from going through entire self-service flows only to discover the backend system couldn’t complete their request—forcing them into agent queues anyway. Backend systems now experience predictable, constant health check frequency rather than load proportional to call volume.
Operationally, the solution provides excellent visibility through CloudWatch metrics for both background health checks and cache hit rates, enabling proactive monitoring and alerting.
Summary
Background health monitoring with intelligent caching solved the fundamental challenge of balancing system reliability checks with caller experience. The architecture scales effortlessly with call volume while maintaining accurate health status reporting, transforming an operational pain point into a competitive advantage.
Next Steps: Start by auditing your current health check strategy—identify where real-time checks create bottlenecks or contribute to system load. Implement background monitoring for your most critical integrations, using Parameter Store or DynamoDB for state persistence and Lambda execution context for performance optimization. Monitor cache hit rates and adjust timing thresholds based on your specific reliability requirements and acceptable data freshness windows.