Healthcare

Active/active DR for a Chicago-area academic medical center

Engagement closed 2026-04-29

ZERO

Data loss through complete site failure (RPO)

5-10min

Full primary site failover (RTO)

0.25-1ms

DCI wave failover, 400G to 100G coherent backup

Epic does not gracefully degrade. PACS image retrieval cannot stall mid-procedure. Infusion pump telemetry does not accept “the network is being failed over right now.” That set the bar.

A Chicago-area academic medical center asked us to build active/active disaster recovery for the clinical infrastructure their hospital actually runs on. Two metro datacenters separated by roughly 40 kilometers of dark fiber. A third campus site. Full automated failover for the production tier. Continuous availability through planned maintenance and unplanned site loss alike. The recovery time objective written into the acceptance criteria, not assumed from a vendor data sheet.

Twelve months of engineering. 75 in-scope commissioning tests, all executed and passed on April 29, 2026. The five additional scenarios in the runbook were documented as out of scope for this engagement (East/West policy enforcement via Cisco Secure Workload, and a customer backup platform standing up in parallel). The runbook went to the customer with a clean recommendation for cutover.

The brief

Two metro datacenters separated by roughly 40 kilometers of dark fiber, plus the main campus. Synchronous replication for the production storage tier. Stretched fabric so that workloads could move between sites without changing IP, without firewall reconfiguration, and without manual intervention from operations staff. Recovery time objective: under ten minutes for complete primary site loss. Recovery point objective: zero. Measured, not modeled.

The metro distance put the inter-site links well beyond what client-side optics can reach without amplification. The transport layer required coherent ZR-class optics on the DWDM waves, both for the 400G primary path and the 100G coherent backup on the diverse route.

The applications above the fabric did not have a tolerance for “best effort.” Epic does not gracefully degrade. PACS image retrieval cannot stall mid-procedure. The infusion pump telemetry that monitors patient drug delivery does not accept “the network is being failed over right now.” That set the bar for the network design and for the storage replication design that sat under it.

The architecture

The fabric is a Cisco ACI Multi-Site deployment, with separate ACI fabrics at the primary site, the secondary/DR site, and the campus. The three fabrics are stitched together at the policy layer by Cisco Nexus Dashboard Orchestrator, deployed in a 2+1+1 topology that survives the loss of any single site without losing cluster quorum.

Between the two metro datacenters runs the Data Center Interconnect: four 400G waves across the diverse DWDM path, with 100G backup waves on a separate physical route for the case where the entire primary DWDM transport is lost. Storage replication runs over the DWDM transport synchronously, with the replication link continuously monitored for state, latency, and split-brain protection.

North/south traffic terminates through a Cisco FTD service-chain at each site, with stateful inspection that survives the inter-site failover events the fabric is engineered for. Load balancing for the production VIP set is handled by an F5 multi-data-center cluster, with VIP continuity across the metro pair. Out-of-band management runs through a dedicated firewall HA pair, with AnyConnect VPN and Duo MFA for the on-call operations path. SD-WAN handles the campus to datacenter overlay.

The named stack:

Cisco ACI Multi-Site. Cisco Nexus Dashboard Orchestrator. Cisco FTD. F5 multi-DC cluster. Cisco SD-WAN. DWDM optical transport. All-flash storage arrays with synchronous replication. Cisco AnyConnect with Duo MFA. HSRP and vPC for the out-of-band core.

This is the same multi-vendor depth we bring to every managed network engagement, scaled up to the stakes of clinical infrastructure.

The build

A twelve-month engineering engagement, principal-engineer-led end to end. The design phase landed on Cisco ACI Multi-Site (not Multi-Pod, not OTV) because the customer needed three distinct fabrics at three distinct sites with independent control planes, stitched together by policy rather than by a single stretched fabric. The decision-log on that choice runs about four pages in the design document and we are happy to walk a technical evaluator through it on a call.

DWDM with 400G primary plus 100G backup was selected over a simpler single-wave design because the synchronous storage replication has a hard requirement on the inter-site link, and the operational cost of losing replication during a transport event was higher than the capital cost of the backup waves. We measured. We sized. We bought.

The F5 cluster sits behind ACI rather than in front of it, with a specific VIP migration pattern that lets the application owners change their HTTPS endpoints once and not again. The Citrix front-end delivery layer for Epic was a particular focus during the commissioning tests, because Citrix session state behaves differently from stateless HTTP and the failover model has to account for it.

Validation under load

The runbook on the engagement is a hundred and seventy-page document, version-controlled, with eighty distinct test scenarios. The 75 in-scope tests were executed across two days on April 29, 2026. Each test had a stated objective, a procedure, an expected outcome, a pass criterion, and an evidence-capture step. The compute, virtualization, and storage observations were captured concurrently inside each network test, because the network does not exist in isolation and the test results are not credible if the higher layers were not measured during the cutover events.

Concurrent capture is the same discipline we build into every observability platform we stand up: metrics, logs, and flow data correlated across layers so the evidence trail is complete the first time you ask for it, not the third.

The numbers that came out of the commissioning are the ones that make this case study worth writing. Loss of all four 400G waves with failover to the 100G coherent backup pair: convergence in 0.25 to 1 millisecond, well inside the sub-second budget the application layer can absorb without operator-visible impact. HSRP failover at the out-of-band core: two ping loss. NDO orchestrator failover from primary to secondary site: zero policy drift on the production fabrics after the recovery. Synchronous replication: remained active through single-wave failure, degraded gracefully through partial loss, suspended cleanly through total transport loss, and returned to RPO=0 after the recovery in every scenario.

Full primary-site failure simulation: the entire DCI partition was tested at the application layer with Epic production traffic running. The fabric failed over. The F5 VIP set moved. The Citrix front-end re-targeted. The PACS image retrieval queue resumed at the same WAL position on the secondary array. The applications above did not notice the geometry change.

The customer’s question was never “will it fail over?” They had heard that pitch from vendors before. The question was “will the radiology PACS queue come back at the same position?” That is what we engineered for, and that is what we measured.

Principal engineer on the engagement

Decisions worth naming

For a technical evaluator reading this case study, the design choices matter more than the outcome metrics. The three that came up repeatedly during the build:

Why ACI Multi-Site, not Multi-Pod. Multi-Pod would have given us a single stretched fabric across the metro pair, with a single control plane. Multi-Site gives us three independent fabrics stitched at the policy layer. We picked Multi-Site because the customer’s failure modes included scenarios where the metro pair was partitioned but each site remained internally operational. Multi-Pod would have created a single failure domain spanning the metro; Multi-Site keeps the failure domains distinct.

Why synchronous replication over DWDM, not a stretched cluster. Stretched storage clusters at metro distances are operationally simpler but constrain the storage vendor and the array placement decisions for the next refresh cycle. Synchronous replication over DWDM gives us the RPO=0 guarantee without locking the storage layer into a specific vendor’s stretched-cluster mechanism. The replication link is monitored continuously, with split-brain protection at the array layer.

Why the F5 cluster sits behind ACI, not in front. Putting the F5 in front of the fabric is the simpler integration, and it is what some reference architectures suggest. We put it behind because the VIP set needs to move with the application during failover, and the policy layer of ACI is where the VIP migration is most cleanly orchestrated. The application owners change their endpoints once, when the architecture is commissioned, and not again on every failover.

Engagement shape

Twelve months from contract signature to commissioning test sign-off. Principal engineer on every working session. Roughly forty percent of the calendar was design (architecture, vendor selection, low-level documentation), forty percent was build (fabric stand-up at all three sites, DCI commissioning, F5 cluster deployment, storage replication validation), and twenty percent was the commissioning test cycle itself.

The customer kept a four-page acceptance criteria sheet from the start of the engagement. Every entry on that sheet was tied to a specific test in the runbook. At sign-off, every entry was checked.

What you get from a build like this

A network that survives a complete site loss in single-digit minutes. Storage that comes back at the same write position the workload was at when the primary site went dark. Applications that do not notice the geometry change. A commissioning runbook that the customer keeps and re-runs on a schedule, with the same pass criteria and the same evidence-capture step.

What you do not get is a sales deck. What you do not get is “we tested it in a lab once.” What you do not get is a vendor pitch that talks about RPO and RTO without naming the mechanism that produces them.

That is the difference engineering-led infrastructure makes at the scale and stakes this work runs at.

Results

Every number was measured, not modeled.

  • ZERO

    Data loss through complete site failure (RPO)

  • 5-10min

    Full primary site failover (RTO)

  • 0.25-1ms

    DCI wave failover, 400G to 100G coherent backup

  • 75 / 75

    In-scope commissioning tests passed

  • 12months

    Engineering duration, design through cutover

  • 3

    Sites in production active/active

Let's talk

Have a project like this?

Your first call is with an engineer. Short, candid, free.

  • We reply within one business day.
  • Principal engineer on the call.
  • No pitch deck.

We reply within one business day, and your first call is with an engineer.