Self-healing message failure management

Messages fail. That part is inevitable.

When messages fail in your event-driven systems, Jetsam Recover salvages them — collecting from dead-letter queues, diagnosing root causes with AI, and intelligently retrying when the underlying problem resolves. Without waking anyone up.

Single container · Go + SQLite · Deploy in minutes

The Problem

You know the ritual.

It's 3:14 AM. Your phone screams. A message queue is backing up. Thousands of events are piling into a dead-letter queue, and the threshold alert just fired.

You open your laptop. Begin the ritual. SSH into the box. Pull a sample of failed messages. Squint at stack traces. Google the error. Realize it's a downstream service that deployed a bad schema twenty minutes ago. The messages themselves are fine.

You write an ad-hoc replay script. Run it. Some messages fail again because the downstream hasn't actually rolled back yet. You wait. You retry. You wait more. By 5:30 AM, you've manually nursed 2,847 messages back through the pipeline. You file a ticket to "improve DLQ handling." You go back to sleep for forty-five minutes before your alarm goes off.

Next Tuesday, it happens again.

73% Are Transient

Nearly three-quarters of message failures resolve themselves when the underlying condition clears — a service restarts, a rate limit resets, a deployment rolls back. You're being paged for problems that fix themselves. You just need someone to retry at the right moment.

47 Minutes Average MTTR

That's how long it takes a human to SSH in, sample the DLQ, diagnose the root cause, write a replay script, test it, wait for conditions to clear, and replay. For a problem that an automated system could resolve in seconds, once it knows what to look for.

Observability Without Recovery

You're not short on observability. You can see the failures. Dashboards light up, alerts fire, Slack channels scroll. What you're short on is automated recovery. The gap isn't detection. It's the part that comes after.

“Every DLQ is an admission that your system gave up.”

Jetsam picks up where your system left off.

How It Works

Three steps. Zero intervention.

Jetsam Recover watches your dead-letter queues, diagnoses why messages failed, and retries them when conditions clear. Automatically. Intelligently. While you sleep.

I

Collect

Jetsam connects to your dead-letter queues across NATS, RabbitMQ, Kafka, SQS, and Azure Service Bus. Every failed message is captured with full context — headers, metadata, timestamps, and the error that caused the failure. Nothing is lost, nothing expires silently.

II

Classify

Corpus-based classification identifies failure patterns through semantic understanding, not brittle regex. AI agents diagnose root causes, cluster related failures, and categorize each as dependency, consumer bug, schema issue, or transient. The system ships with hundreds of known failure patterns and learns yours.

III

Recover

When the underlying problem resolves — the service recovers, the schema is fixed, the rate limit resets — Jetsam replays failed messages with graduated retry: test one, then 10, then 50, then 100, then the rest. No retry storms. Business-critical messages go first. You wake up to a summary, not a crisis.

Without Jetsam

  • PagerDuty fires at 3am for a DLQ threshold breach
  • Manually inspect messages, write ad-hoc replay scripts
  • Replay in wrong order, cause downstream cascades
  • 47-minute average MTTR. Every. Single. Time.

With Jetsam

  • Failed messages intercepted and classified automatically
  • Root cause diagnosed; retry queued for when conditions clear
  • Graduated retry: 1 → 10 → 50 → 100 → rest. No storms.
  • You wake up to a summary: 2,847 messages recovered, zero lost

Features

What makes it different.

Not another monitoring dashboard. Not another alerting rule. Jetsam Recover closes the gap between detecting failures and actually fixing them.

Corpus-Based Classification

Errors are classified by semantic similarity to a learned corpus, not brittle regex patterns. New failure modes are recognized automatically. A new error format, a different service, an unexpected payload — the system adapts without new rules.

Agentic AIOps

AI agents that investigate failures the way a senior SRE would. They correlate errors across queues, check downstream health, identify root causes, and determine the right moment to retry — not on a timer, but when conditions actually improve.

Business-Aware Priority

A failed $50,000 wire transfer and a failed analytics event aren't equal. Jetsam understands business context and recovers high-value messages first. Payment events before marketing emails. Order confirmations before internal telemetry.

Single Container

One container. One Go binary. Embedded SQLite — no external databases to provision. Launches in seconds, runs anywhere containers run. No Kubernetes operator, no Helm chart sprawl. Just docker run and you're operational.

Graduated Retry

No retry storms. Jetsam tests one message first. If it succeeds, it ramps: 10, then 50, then 100, then the rest. If the first message fails, it waits and tries again later. Your recovering downstream service isn't immediately slammed with 10,000 replayed messages.

Pattern Registry

Ships with a curated registry of hundreds of known failure patterns across dependency failures, consumer bugs, data/schema issues, and transient errors. Jetsam starts smart on day one and gets smarter as it learns the specific ways your systems break.

Integrations

Every broker. One tool.

Jetsam connects to every major message broker out of the box. One deployment covers your entire event-driven architecture.

NATS

JetStream

Rabbit

MQ

Kafka

Apache

SQS

AWS

Azure

Service Bus

Single container
Written in Go
Embedded SQLite

Early Access

Stop being the retry script.

We built Jetsam Recover because we were tired of being the person who gets paged at 3am to do something a machine should handle. If that sounds familiar, join the waitlist.

Early Access — Waitlist Registration

No spam. We'll only reach out when there's something worth sharing.

Turn 3AM Pages into 9AM Summaries