Cluster reviews

A weekly health review from an SRE who never sleeps.

Every Monday at 09:00 in your timezone, Radar runs a full review of every cluster in your service map. ECS task health, RDS saturation, EC2 drift, Atlas growth — the kind of audit a senior SRE would do, but actually done.

What you'll actually see
cluster review · acme robotics · week of may 117 findings · 2 actionable
  • highacme-api-pg approaching IOPS limit

    Sustained 87% IOPS utilization for 6 days. Burst balance trending toward zero. Recommend bump to db.r6g.2xlarge or move read traffic to a replica.

  • highacme-api-pg connection saturation

    max_connections at 92% during peak. PgBouncer not deployed. Recommend pgbouncer or raising the cap before the next release.

  • medacme-api/web ECS desired count too low

    p99 task CPU > 80% for 12 of last 14 weekday peaks. Suggested desired: 8 (currently 6).

  • medEC2 acme-redis-2 nearing memory pressure

    swap activity climbing daily. r6i.xlarge would clear it.

  • lowacme-worker/scheduler under-utilized

    p95 CPU < 5% for 30 days. Safe to drop to t3.micro and save ~$38/mo.

  • infoAtlas acme-prod-mongo on M30 — fine for 90d

    Storage growing 4%/wk. No action needed yet.

  • infoNo drift detected on acme-api log group retention

    All groups still on 30-day retention as configured.

What changes for the on-call engineer

Catch slow-burn issues

The boring stuff — saturation, drift, retention — that never trips an alarm but eventually causes the 3 a.m. page.

Triaged severities

Radar tags each finding high / med / low / info, so you know what to fix this sprint vs next quarter.

Posted to Slack

Reviews land in your team channel with the actionable subset pinned. No portal to remember to check.

How it works

step · 01

Schedule

Pick a cadence per service: weekly, biweekly, monthly. Defaults to weekly on tier-1.

step · 02

Collect

Radar pulls ECS, RDS, EC2, Atlas, and CloudWatch metrics over the period and runs them through the review prompt.

step · 03

Memory

Past reviews are remembered. Recurring findings get flagged as such — no Groundhog Day.

Resolve incidents in 30 seconds, not 30 minutes.

Connect your AWS account in read-only mode and let Radar take the next page.