Should the fallback cache share Redis with the rest of the pipeline?

It can share a Redis deployment but not one in the same failure domain as the PMS. The baseline must survive the outage that trips the circuit, so use an independent node or read-replica and enable AOF persistence with an everysec fsync so a mid-outage restart does not lose the snapshot.

How does the circuit close again without a reconnect storm?

Through a half-open phase. After the cooldown the breaker admits a single trial probe rather than resuming full traffic; only when it succeeds does the circuit close and buffered deltas flush under jittered backoff, preventing a synchronized burst against the PMS and channel manager.

Fallback Routing for Downtime in Rate Parity Automation

Fallback routing is the resilience layer that decides what a distribution pipeline does the moment its property management system stops answering. Without it, a PMS latency spike, a maintenance window, or a hard outage has exactly two bad outcomes: the automation keeps pushing stale rates until an OTA flags a parity violation, or it stops pushing entirely and every channel drifts out of sync until someone notices. The revenue manager sees a room selling on Booking.com at last week’s price; the operations lead fields an overbooking dispute; the on-call engineer is paged to reconstruct which payloads were lost mid-outage. Inside the broader PMS & Channel Manager Architecture Foundations pipeline, this layer is the control plane that converts an unpredictable infrastructure failure into a bounded, auditable degradation mode. This page defines that control plane end to end: the health-probe and circuit-breaker triggers, the cached-baseline synchronization model, the delta-safe reconciliation path, the data contract that guards it, and the verification and troubleshooting practices that keep it honest in production.

The mechanics of the finite-state routing engine — healthy, degraded, and recovery phases — are covered in depth in designing fallback routes for PMS outages; this page is the subsystem-level view that ties those routes into the rest of the parity engine.

Architecture & Prerequisites

The fallback router sits between the PMS-facing ingestion layer and the outbound serializer that talks to each channel manager. Its input is the same stream of rate and inventory mutations the healthy pipeline carries; its output is a routing decision — dispatch through the primary path, dispatch a recomputed delta through the fallback path, or quarantine. The decision is driven by a continuous health signal, not by the failure of any single business request, so a transient 503 on one rate push never trips the whole engine and a genuine outage is detected within a bounded number of probes.

Two upstream contracts must already be resolved before fallback logic can run safely. Rate plan identifiers arrive pre-mapped through the rate plan taxonomy so a cached baseline references stable rate_plan_code values rather than fragile PMS-internal keys, and room-type codes are reconciled through the OTA channel mapping so a fallback push never double-counts a single physical room across channels. The baseline itself is a snapshot of the validated canonical objects produced by data schema standardization, which is what lets the fallback path assume its cached data is already well-typed and currency-normalized.

The reference implementation assumes the following environment. Pin these versions — Pydantic is a hard v2 dependency and v1 validator/dict() syntax will not run against this code:

Python 3.11+
pydantic 2.6+ (v2 field_validator / model_dump API)
httpx 0.27+ or requests 2.31+ for the probe and dispatch clients
redis 5.0+ with AOF persistence enabled for the baseline cache
structlog 24.1+ for structured, machine-readable logs
A defined per-property SLA for probe latency and failure thresholds

The baseline cache is the load-bearing dependency. It must survive the same outage that trips the circuit, so it cannot live inside the PMS or share its failure domain — a read-replica or an independently hosted Redis node is the minimum. Configure Redis with appendonly yes and an appendfsync everysec policy so baseline snapshots survive a node restart without paying a per-write fsync on the hot read path during an outage window.

Implementation

Build the router in four ordered steps: probe health continuously, trip the circuit deterministically, recompute a safe delta against the cached baseline, then dispatch or quarantine. Each step below is a self-contained, runnable unit.

Step 1 — Probe PMS health on a fixed cadence

The trigger for fallback is a health signal, decoupled from business traffic. A background probe measures response time and status against a lightweight PMS endpoint and feeds a rolling failure counter. Treat both a non-200 status and a latency breach as failures — a PMS that answers slowly is as dangerous to parity as one that does not answer at all, because slow pushes queue up and land out of order.

python

import time
import httpx
import structlog

logger = structlog.get_logger()

class HealthProbe:
    def __init__(self, pms_base_url: str, latency_threshold_ms: float = 2500.0):
        self.pms_base_url = pms_base_url.rstrip("/")
        self.latency_threshold_ms = latency_threshold_ms
        self._client = httpx.Client(timeout=3.0)

    def check(self, property_id: str) -> bool:
        start = time.perf_counter()
        try:
            resp = self._client.get(f"{self.pms_base_url}/health")
            latency_ms = (time.perf_counter() - start) * 1000
            healthy = resp.status_code == 200 and latency_ms < self.latency_threshold_ms
            logger.info(
                "pms_health_probe",
                property_id=property_id,
                status=resp.status_code,
                latency_ms=round(latency_ms, 1),
                healthy=healthy,
            )
            return healthy
        except httpx.RequestError as exc:
            logger.warning("pms_health_probe_failed", property_id=property_id, error=str(exc))
            return False

The probe client carries its own short 3-second timeout independent of the dispatch clients, so a hung probe can never itself become the cause of thread exhaustion during degradation — the failure it is meant to detect.

Step 2 — Trip and reset the circuit deterministically

The circuit breaker turns a run of failed probes into a discrete routing state. It opens after a configurable number of consecutive failures, stays open for a cooldown, then admits a single trial probe in a half-open state before either closing or re-opening. This phased recovery is what prevents a thundering-herd reconnect the instant the PMS comes back.

python

from dataclasses import dataclass
from enum import Enum

class CircuitState(str, Enum):
    CLOSED = "closed"      # primary path active
    OPEN = "open"          # fallback path active
    HALF_OPEN = "half_open"  # single trial probe in flight

@dataclass
class CircuitBreaker:
    failure_threshold: int = 3
    cooldown_seconds: float = 30.0
    state: CircuitState = CircuitState.CLOSED
    _failures: int = 0
    _opened_at: float = 0.0

    def record(self, healthy: bool) -> CircuitState:
        if healthy:
            self.state = CircuitState.CLOSED
            self._failures = 0
            return self.state
        self._failures += 1
        if self._failures >= self.failure_threshold and self.state is CircuitState.CLOSED:
            self.state = CircuitState.OPEN
            self._opened_at = time.time()
            logger.warning("circuit_opened", failures=self._failures)
        return self.state

    def allow_trial(self) -> bool:
        if self.state is CircuitState.OPEN and time.time() - self._opened_at >= self.cooldown_seconds:
            self.state = CircuitState.HALF_OPEN
            logger.info("circuit_half_open")
            return True
        return False

Separating allow_trial() from record() keeps recovery explicit: the engine must actively decide to send a trial probe after the cooldown, rather than silently flipping back to the primary path on a timer and risking a push against a PMS that is still unhealthy.

Step 3 — Recompute a safe delta against the cached baseline

When the circuit is open the fallback path does not replay queued messages verbatim — that is how stale rates reach an OTA. It recomputes the outbound payload against a known-good baseline snapshot and enforces a parity-deviation tolerance and contract-rate floors before anything leaves the process. Only the changed, in-tolerance fields are dispatched.

python

import json
from decimal import Decimal

def compute_safe_delta(incoming: dict, baseline: dict, max_deviation: Decimal = Decimal("0.03")) -> dict:
    """Return only in-tolerance rate/inventory changes; flag the payload unsafe otherwise."""
    safe_rates: dict[str, str] = {}
    for rate_plan_code, new_amount in incoming.get("rates", {}).items():
        new_amount = Decimal(str(new_amount))
        base_amount = Decimal(str(baseline.get("rates", {}).get(rate_plan_code, new_amount)))
        if base_amount == 0:
            continue
        deviation = abs(new_amount - base_amount) / base_amount
        if deviation > max_deviation:
            logger.warning(
                "parity_deviation_exceeded",
                rate_plan_code=rate_plan_code,
                deviation=float(deviation),
                base_amount=str(base_amount),
                new_amount=str(new_amount),
            )
            return {"is_safe": False, "reason": "parity_deviation_exceeded", "rate_plan_code": rate_plan_code}
        if new_amount != base_amount:
            safe_rates[rate_plan_code] = str(new_amount)
    return {
        "is_safe": True,
        "payload": {"rates": safe_rates, "inventory": incoming.get("inventory", {})},
    }

Rates are handled as Decimal from string, never float: a fractional-cent rounding error introduced by binary floating point is exactly the kind of sub-threshold drift that accumulates into a parity breach over a multi-day outage.

Step 4 — Dispatch the delta or quarantine the payload

With a safe delta in hand, the router pushes it to the channel manager under a deterministic idempotency key. Anything the delta step flagged unsafe is written to a quarantine store with its reason and never force-pushed — a quarantined payload is a bounded, reviewable event, whereas a forced bad push is an OTA penalty.

python

def route_update(incoming: dict, baseline_raw: str | None, idempotency_key: str,
                 client: httpx.Client, redis_client) -> dict | None:
    if baseline_raw is None:
        logger.error("fallback_missing_baseline", idempotency_key=idempotency_key)
        return None

    delta = compute_safe_delta(incoming, json.loads(baseline_raw))
    if not delta["is_safe"]:
        redis_client.set(
            f"quarantine:{idempotency_key}",
            json.dumps({"ts": time.time(), "reason": delta}),
            ex=86_400,
        )
        logger.warning("payload_quarantined", idempotency_key=idempotency_key, reason=delta["reason"])
        return None

    resp = client.post(
        "https://api.channelmanager.example/v1/inventory/sync",
        json=delta["payload"],
        headers={"Idempotency-Key": idempotency_key, "Content-Type": "application/json"},
        timeout=5.0,
    )
    resp.raise_for_status()
    logger.info("fallback_sync_success", idempotency_key=idempotency_key,
                rates_changed=len(delta["payload"]["rates"]))
    return delta["payload"]

The quarantine TTL matches the channel manager’s typical 24-hour idempotency window: a payload that cannot be safely dispatched within that window is stale by definition and must be re-derived from live PMS state on recovery rather than replayed.

Schema & Data Contracts

Every payload that enters the router is validated against a strict contract before any routing decision is made. Modeling the baseline snapshot and the incoming mutation with the same Pydantic v2 shape means the delta step compares like against like, and an unexpected field from a degraded PMS surfaces as a local ValidationError rather than an unroutable request.

python

from datetime import date
from decimal import Decimal
from pydantic import BaseModel, Field, field_validator, ConfigDict

class FallbackInventoryUpdate(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")

    property_id: str = Field(pattern=r"^prop_[0-9a-f]{8}$")
    room_type_code: str = Field(min_length=2, max_length=16)
    rate_plan_code: str = Field(min_length=3, max_length=24, pattern=r"^[A-Z0-9_-]+$")
    ota: str  # channel slug, e.g. "booking_com" or "expedia"

    base_amount: Decimal = Field(ge=0, max_digits=10, decimal_places=2)
    currency: str = Field(pattern=r"^[A-Z]{3}$")

    date_from: date
    date_to: date
    available_rooms: int = Field(ge=0)
    stop_sell: bool = False

    source: str = Field(default="fallback")  # provenance for the audit trail

    @field_validator("ota")
    @classmethod
    def known_channel(cls, v: str) -> str:
        allowed = {"booking_com", "expedia", "agoda", "direct"}
        if v not in allowed:
            raise ValueError(f"unknown OTA slug: {v}")
        return v

The explicit source field is what makes an outage auditable after the fact: every record dispatched through the fallback path is tagged fallback, so a post-incident query can isolate exactly which rates were served from cache versus from the live PMS.

Error Handling & Retry Strategy

Fallback routing has two distinct error surfaces, and they demand opposite responses. A failure on the PMS probe feeds the circuit breaker — you do not retry it aggressively, because a retry storm against a degraded PMS deepens the outage; the breaker’s cooldown is the backoff. A failure on the channel-manager dispatch, by contrast, is a normal transient that warrants a bounded retry.

For dispatch, distinguish status classes explicitly. A 429 or a 5xx is retryable with exponential backoff and jitter; a 4xx other than 429 is a contract error that must dead-letter, not retry, because replaying it only burns quota. This is the same taxonomy applied throughout the pipeline’s error categorization and retry logic, and it composes with the channel-level OTA API rate limits so a recovery burst does not immediately re-trip a rate ceiling.

python

import random

RETRYABLE_STATUS = {429, 500, 502, 503, 504}

def dispatch_with_backoff(send, *, max_attempts: int = 4, base_delay: float = 0.5) -> httpx.Response:
    for attempt in range(1, max_attempts + 1):
        try:
            resp = send()
            if resp.status_code in RETRYABLE_STATUS and attempt < max_attempts:
                delay = base_delay * 2 ** (attempt - 1) + random.uniform(0, base_delay)
                logger.info("dispatch_retry", attempt=attempt, status=resp.status_code, sleep_s=round(delay, 2))
                time.sleep(delay)
                continue
            resp.raise_for_status()
            return resp
        except httpx.RequestError as exc:
            if attempt == max_attempts:
                raise
            delay = base_delay * 2 ** (attempt - 1) + random.uniform(0, base_delay)
            logger.warning("dispatch_transport_retry", attempt=attempt, error=str(exc), sleep_s=round(delay, 2))
            time.sleep(delay)
    raise RuntimeError("unreachable")

The additive random.uniform(0, base_delay) jitter is deliberate: without it, every buffered fallback payload retries on the same doubling schedule and re-converges into a synchronized burst the instant the channel manager returns, which is precisely the herd effect the circuit breaker’s phased recovery works to avoid.

The idempotency key must be deterministic, derived from the business identity of the update — a hash of property_id, room_type_code, rate_plan_code, and the date range — not a random UUID. That way a payload buffered during degradation and re-dispatched on recovery reuses the same key it would have had on the primary path, and the channel manager collapses the duplicate instead of double-applying an inventory deduction.

python

import hashlib

def idempotency_key(update: "FallbackInventoryUpdate") -> str:
    identity = f"{update.property_id}:{update.room_type_code}:{update.rate_plan_code}:{update.date_from}:{update.date_to}"
    return hashlib.sha256(identity.encode()).hexdigest()[:32]

Deriving the key from business identity rather than payload content is what lets a corrected rate for the same room-night overwrite the earlier one under a single key, instead of stacking two conflicting updates the OTA would apply in arrival order.

Verification & Testing

Fallback routing is only trustworthy if you can prove it behaves before a real outage exercises it. Verify three properties: the circuit trips and recovers on the right signal, the delta step blocks out-of-tolerance rates, and dispatch is genuinely idempotent under replay.

python

def test_circuit_trips_after_threshold():
    cb = CircuitBreaker(failure_threshold=3)
    for _ in range(2):
        assert cb.record(healthy=False) is CircuitState.CLOSED
    assert cb.record(healthy=False) is CircuitState.OPEN

def test_delta_blocks_out_of_tolerance():
    incoming = {"rates": {"BAR_FLEX": "150.00"}}
    baseline = {"rates": {"BAR_FLEX": "100.00"}}  # +50% — well over 3%
    result = compute_safe_delta(incoming, baseline)
    assert result["is_safe"] is False
    assert result["reason"] == "parity_deviation_exceeded"

def test_delta_emits_only_changes():
    incoming = {"rates": {"BAR_FLEX": "101.50", "BAR_NONREF": "90.00"}}
    baseline = {"rates": {"BAR_FLEX": "100.00", "BAR_NONREF": "90.00"}}
    result = compute_safe_delta(incoming, baseline)
    assert result["is_safe"] is True
    assert set(result["payload"]["rates"]) == {"BAR_FLEX"}  # unchanged plan omitted

Beyond unit assertions, confirm behavior against live telemetry: after a simulated outage, the structured logs must show a circuit_opened event, one or more fallback_sync_success lines tagged with source=fallback, and a matching count of dispatched records against the number of in-tolerance mutations buffered. A mismatch between buffered and dispatched counts is the signal that a payload was silently dropped — reconcile it against live PMS state through async polling on recovery, and cross-check the totals with the daily batch reconciliation run.

Troubleshooting

Circuit flaps between open and closed under intermittent latency. Root cause: the failure threshold is too low or the probe cadence too tight, so a single slow response repeatedly trips and resets the breaker. Fix: raise failure_threshold, require two consecutive healthy probes before closing from half-open, and confirm the probe timeout is shorter than the latency threshold so a slow-but-successful response still counts as a failure.

Stale rates reach an OTA during an outage. Root cause: the fallback path replayed a buffered payload verbatim instead of recomputing a delta against the baseline. Fix: ensure every open-circuit dispatch flows through compute_safe_delta, and verify the baseline snapshot in Redis is fresher than the buffered payload.

Duplicate inventory deductions after recovery. Root cause: idempotency keys were generated as random UUIDs, so the same room-night dispatched on both the fallback and primary paths was applied twice. Fix: derive the key deterministically from business identity as shown above, and confirm the channel manager’s idempotency window covers your maximum outage duration.

Baseline missing when the circuit opens. Root cause: the Redis cache shares a failure domain with the PMS, or AOF persistence was disabled and the node restarted mid-outage. Fix: host the baseline cache independently, enable appendonly yes with appendfsync everysec, and treat a missing baseline as a hard quarantine — never push blind.

Recovery burst re-trips OTA rate limits. Root cause: all buffered payloads flush simultaneously on recovery. Fix: rely on the additive jitter in the backoff and coordinate the flush with the channel-level rate-limit budget so the recovery push stays under the ceiling.

FAQ

How many failed probes should open the circuit?

Three consecutive failures is a sound default for a per-property probe on a fixed cadence — long enough to ride out a single transient blip, short enough to detect a real outage within a few seconds. Tune it against your probe interval and your PMS’s observed p99 latency, not by feel. A property with a chattier PMS may warrant a higher threshold; one on a strict parity SLA may want a lower one.

Why recompute a delta instead of just replaying the queued payloads?

Because a queued payload captures intent at the moment it was created, and during an extended outage that intent goes stale. Recomputing against a known-good baseline lets the router enforce a parity-deviation tolerance and drop unchanged fields, so an OTA only ever receives in-tolerance changes. Verbatim replay is exactly how a rate that was correct an hour ago becomes a parity violation on dispatch.

Should the fallback cache live in the same Redis as the rest of the pipeline?

It can share a Redis deployment, but not one that shares a failure domain with the PMS. The baseline must survive the outage that trips the circuit, so if your primary cache is co-located with or dependent on the PMS host, use an independent node or read-replica for the baseline. Enable AOF persistence with an everysec fsync so a node restart mid-outage does not lose the snapshot.

What deviation tolerance is safe for the delta guardrail?

A 3% tolerance is a common starting point, but it must never override a negotiated contract-rate floor or a hard parity commitment. Treat the percentage as a drift guardrail for ordinary rate movement and layer explicit floors and ceilings on top; if a change breaches either, quarantine it rather than clamp it, because a clamped rate is still a wrong rate.

How does the circuit close again without causing a reconnect storm?

Through the half-open phase. After the cooldown, the breaker admits a single trial probe rather than resuming full traffic. Only when that probe succeeds — and, ideally, a second one confirms it — does the circuit close and the buffered deltas flush under jittered backoff. That phased recovery is what prevents every buffered payload from hitting the PMS and channel manager in one synchronized burst.

Designing Fallback Routes for PMS Outages — the healthy/degraded/recovery state machine and local buffering that this control plane orchestrates.
Data Schema Standardization — the validated canonical objects that seed the baseline snapshot the fallback path diffs against.
Error Categorization & Retry Logic — the 4xx-vs-5xx taxonomy the dispatch retry path reuses.
Handling OTA API Rate Limits — the channel-level budget a recovery burst must stay under.
Async Polling for Inventory Updates — how buffered-vs-dispatched counts are reconciled against live PMS state on recovery.

← Back to PMS & Channel Manager Architecture Foundations