Cluster reviews

A weekly health review from an SRE who never sleeps.

Every Monday at 09:00 in your timezone, Radar runs a full review of every cluster in your service map. ECS task health, RDS saturation, EC2 drift, Atlas growth — the kind of audit a senior SRE would do, but actually done.

Radar lying down calmly on watch — all quiet

What you'll actually see

cluster review · acme robotics · week of may 117 findings · 2 actionable

highacme-api-pg approaching IOPS limit
Sustained 87% IOPS utilization for 6 days. Burst balance trending toward zero. Recommend bump to db.r6g.2xlarge or move read traffic to a replica.
highacme-api-pg connection saturation
max_connections at 92% during peak. PgBouncer not deployed. Recommend pgbouncer or raising the cap before the next release.
medacme-api/web ECS desired count too low
p99 task CPU > 80% for 12 of last 14 weekday peaks. Suggested desired: 8 (currently 6).
medEC2 acme-redis-2 nearing memory pressure
swap activity climbing daily. r6i.xlarge would clear it.
lowacme-worker/scheduler under-utilized
p95 CPU < 5% for 30 days. Safe to drop to t3.micro and save ~$38/mo.
infoAtlas acme-prod-mongo on M30 — fine for 90d
Storage growing 4%/wk. No action needed yet.
infoNo drift detected on acme-api log group retention
All groups still on 30-day retention as configured.

What changes for the on-call engineer

Catch slow-burn issues

The boring stuff — saturation, drift, retention — that never trips an alarm but eventually causes the 3 a.m. page.

Triaged severities

Radar tags each finding high / med / low / info, so you know what to fix this sprint vs next quarter.

Posted to Slack

Reviews land in your team channel with the actionable subset pinned. No portal to remember to check.

How it works

step · 01

Schedule

Pick a cadence per service: weekly, biweekly, monthly. Defaults to weekly on tier-1.

step · 02

Collect

Radar pulls ECS, RDS, EC2, Atlas, and CloudWatch metrics over the period and runs them through the review prompt.

step · 03

Memory

Past reviews are remembered. Recurring findings get flagged as such — no Groundhog Day.

Pairs well with

Service map

Reviews run per mapped service. Map defines the unit.

Error watch

Reviews are the slow loop. Error watch is the fast loop.

Resolve incidents in 30 seconds, not 30 minutes.

Connect your AWS account in read-only mode and let Radar take the next page.