Self-Diagnostic

Signs your integration is one bad day from an incident.

None of these show up in a demo. All of them show up the first time real production traffic hits an edge case.

Run This Check
  • A nightly reconciliation or sync job regularly fails partway through and someone manually patches the gap
  • A webhook or API call has silently failed before and nobody noticed until a customer or auditor did
  • Retries either don't exist, or blindly resend without checking whether the original attempt already succeeded
  • A single slow or down upstream dependency can take your whole system down with it
  • Nobody can tell you, with confidence, whether a given message was processed exactly once
  • Integration failures are diagnosed by reading raw logs after the fact, not surfaced by monitoring
Our Approach

Plan for redelivery and partial failure — they're normal operation, not edge cases.

Most integration failures aren't bugs in the happy path — they're missing discipline in the failure path. Azure Service Bus, Event Hub, and most third-party APIs are at-least-once by design, which means redelivery is normal, not exceptional. An integration that doesn't plan for it will eventually double-process a payment, double-file a record, or silently drop one.

We build idempotent handlers keyed on the entity and a revision or sequence number — not the message id — so a genuine retry is recognised as already-handled. Failures that can't be resolved automatically go to a dead-letter queue with the original payload and a structured reason, not a stack trace nobody reads until the customer calls. The concrete pattern, including the idempotency check and dead-letter handler, is published in our guide to replacing batch jobs with event-driven .NET.

Where the upstream dependency itself is the risk — slow, flaky, or rate-limited — we isolate it behind a circuit breaker and a bounded retry policy, so one bad dependency degrades gracefully instead of taking the rest of the system down with it.

Related Engagement

Where this discipline already runs in production.

We don't have a dedicated "integration rescue" case study published yet — but the same idempotency and failure-handling discipline runs underneath the third-party filing integration in our 1099 reporting pipeline, where every filing goes out over an HTTPS API to a partner whose systems we don't control.

An Integration That Keeps
Breaking in Production?

A two-week Discovery Sprint maps the failure modes and delivers a remediation plan — see all three engagement models.