Parsing Paginated OTA Responses with Requests

When a polling job pulls a large property’s inventory, rate plans, or restriction calendar from an OTA endpoint, the response almost never arrives in one payload — it is chunked across cursors, offsets, or next_page_url links, and a naive loop will silently read only the first page. This page is the pagination component of the broader async polling workflow: it specifies a deterministic, resumable parser built on Python’s requests library that walks every page of a channel manager or OTA inventory feed, normalizes the divergent pagination styles those platforms use, and hands validated records downstream without dropping rows or looping forever on a stale cursor.

The failure this prevents is quiet and expensive. A partial read looks like a successful sync — no exception, a 200 OK, a plausible batch of records — but the missing pages mean a revenue manager sees stale availability for half a property, and the gap only surfaces as an overbooking or a parity violation days later. Parsing pagination correctly is what makes the polling loop trustworthy.

Prerequisites & environment

This parser is intentionally built on the synchronous requests stack rather than httpx/asyncio, because a single paginated walk is inherently sequential — each request needs the previous page’s cursor — and the simpler stack is easier to reason about inside a per-property polling task.

Python 3.11+
requests 2.31+ and urllib3 2.x (the retry adapter lives in urllib3.util.retry)
pydantic 2.6+ (v2 API — model_validate, model_dump, field_validator)
structlog 24.x for key=value structured telemetry
redis 5.x (optional) if you want restart-durable pagination state instead of a local file
Read access to a channel manager or OTA inventory endpoint. Manage the credentials through OAuth2 token refresh so a mid-walk 401 never strands the job, and keep your request cadence inside the published OTA rate limits — pagination multiplies request count, so a 40-page property can exhaust a budget that a single call never would.

The records this parser emits must conform to the same canonical shape produced by standardized JSON payloads, and the rate_plan_code values it reads must already match your rate plan taxonomy — otherwise every page yields false deltas downstream.

Step-by-step implementation

The build proceeds in four self-contained steps: a retry-aware session, a validated record contract, durable pagination state, and the normalizing walk that ties them together.

Step 1 — Build a session with a retry-aware adapter

A persistent requests.Session gives you connection pooling and one place to attach a urllib3 retry strategy. Mounting a configured HTTPAdapter means every GET automatically honors Retry-After, applies backoff on throttling, and retries transient gateway errors before your loop ever sees an exception.

python

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def build_ota_session(max_retries: int = 5, backoff_factor: float = 0.5) -> requests.Session:
    session = requests.Session()
    retry_strategy = Retry(
        total=max_retries,
        backoff_factor=backoff_factor,
        # Only retry status codes that are genuinely transient; a 401/403/404
        # is a wiring or auth problem that will never succeed on replay.
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
        respect_retry_after_header=True,
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    return session

The status_forcelist deliberately excludes 4xx auth and not-found codes because retrying them just burns your rate budget; that split mirrors the 4xx vs 5xx error taxonomy the rest of the pipeline uses, and the backoff math itself is covered in implementing exponential backoff in Python.

Step 2 — Model the canonical record with Pydantic v2

Every parsed record crosses a trust boundary, so validate it at ingestion rather than deep in a reconciliation job where a bad row is far harder to trace. This Pydantic v2 model maps the OTA’s camelCase fields to the internal hospitality identifiers (property_id, room_type_code, rate_plan_code) and normalizes them on the way in.

python

from datetime import date
from pydantic import BaseModel, Field, field_validator


class InventoryRecord(BaseModel):
    property_id: str = Field(alias="propertyId")
    room_type_code: str = Field(alias="roomTypeCode")
    rate_plan_code: str = Field(alias="ratePlanCode")
    base_rate: float = Field(alias="baseRate")
    currency: str = Field(alias="currency")
    availability: int = Field(alias="inventoryCount")
    effective_date: date = Field(alias="effectiveDate")

    model_config = {"populate_by_name": True}

    @field_validator("room_type_code", "rate_plan_code")
    @classmethod
    def _normalize_codes(cls, v: str) -> str:
        # Upstream feeds vary case and padding; normalize here so the same
        # room_type_code from booking_com and expedia diffs cleanly later.
        return v.strip().upper()

Normalizing room_type_code and rate_plan_code at the field level is what lets a delta engine treat a DBL-STD from booking_com and a dbl-std from expedia as the same unit; skip it and every poll reports phantom changes. The mapping between those per-channel codes is defined in OTA channel mapping strategies.

Step 3 — Persist pagination state for idempotent resume

A multi-page walk that dies on page 30 must resume from page 30, not restart from zero and re-emit 29 pages of deltas. A tiny state tracker records the last confirmed cursor per OTA so a container restart or a crashed tick picks up where it left off.

python

import json
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Optional


class PaginationState:
    def __init__(self, state_file: Path = Path("ota_pagination_state.json")):
        self.state_file = state_file
        self._state: dict[str, Any] = self._load()

    def _load(self) -> dict[str, Any]:
        if self.state_file.exists():
            return json.loads(self.state_file.read_text())
        return {}

    def save(self, ota_id: str, cursor: Optional[str]) -> None:
        self._state[ota_id] = {
            "cursor": cursor,
            "updated_at": datetime.now(timezone.utc).isoformat(),
        }
        # Write the whole map atomically so a crash mid-write can't corrupt state.
        tmp = self.state_file.with_suffix(".tmp")
        tmp.write_text(json.dumps(self._state, indent=2))
        tmp.replace(self.state_file)

    def get(self, ota_id: str) -> Optional[str]:
        return self._state.get(ota_id, {}).get("cursor")

The write-to-temp-then-replace pattern makes the state file update atomic — without it, a process killed mid-json.dump leaves a truncated file that fails to parse on the next start, stranding the job. For multi-worker deployments, swap this class for a Redis-backed equivalent so the cursor is shared rather than local to one container.

Step 4 — Normalize pagination variants and walk every page

OTA pagination styles diverge: booking_com tends toward cursor tokens, expedia and agoda lean on offset/limit or an explicit next link. The parser collapses all of them into one loop, validates each page, saves state after every confirmed page, and — critically — terminates safely when a cursor points at an empty page.

python

import structlog
from requests.exceptions import RequestException
from typing import Generator, Optional

structlog.configure(processors=[structlog.processors.JSONRenderer()])
log = structlog.get_logger()


def _next_cursor(pagination: dict) -> Optional[str]:
    # Collapse the three common variants into one value; None means "done".
    return (
        pagination.get("next_cursor")
        or pagination.get("next_page_token")
        or pagination.get("next")
    )


class OTAPaginationParser:
    def __init__(self, ota_id: str, base_url: str, api_key: str, state: PaginationState):
        self.ota_id = ota_id
        self.base_url = base_url
        self.state = state
        self.session = build_ota_session()
        self.session.headers.update(
            {"Authorization": f"Bearer {api_key}", "Accept": "application/json"}
        )

    def _fetch_page(self, cursor: Optional[str]) -> dict:
        params: dict[str, object] = {"limit": 100}
        if cursor:
            params["cursor"] = cursor
        try:
            resp = self.session.get(f"{self.base_url}/inventory", params=params, timeout=30)
            resp.raise_for_status()
            return resp.json()  # raises ValueError on truncated / non-JSON bodies
        except (RequestException, ValueError) as exc:
            log.error("page_fetch_failed", ota=self.ota_id, cursor=cursor, error=str(exc))
            raise

    def walk(self) -> Generator[InventoryRecord, None, None]:
        cursor = self.state.get(self.ota_id)  # resume from last confirmed page
        pages = 0
        while True:
            payload = self._fetch_page(cursor)
            records = payload.get("data") or []

            # Guard against the classic infinite loop: a cursor that keeps
            # returning an empty page. Stop instead of polling forever.
            if not records:
                log.info("pagination_complete", ota=self.ota_id, pages=pages, last_cursor=cursor)
                self.state.save(self.ota_id, None)
                break

            for raw in records:
                yield InventoryRecord.model_validate(raw)

            pages += 1
            cursor = _next_cursor(payload.get("pagination", {}))
            self.state.save(self.ota_id, cursor)  # confirm the page before moving on
            if cursor is None:
                log.info("pagination_complete", ota=self.ota_id, pages=pages)
                break

Saving the cursor after yielding a page, and treating an empty data array as a hard stop, are the two decisions that make this loop both resumable and immune to the stale-cursor infinite loop that OTAs occasionally trigger by echoing back a token that never advances. Wrapping resp.json() to also catch ValueError means a truncated body served with a 200 is logged and retried rather than crashing the generator mid-walk.

Gotchas & production notes

Empty page with a live cursor. Some OTA gateways return a valid next_cursor alongside a zero-length data array during backend replication lag. If you trust the cursor over the data you loop forever. Always let an empty page win, as walk() does above, and reconcile the suspected-missing tail on the next scheduled tick.

200 OK with a truncated JSON body. Proxies and gateways sometimes flush a partial response with a success status. resp.json() raises ValueError on that, so it must be caught alongside RequestException; otherwise a single bad page kills the whole generator and you lose the pages you already yielded. Because the yield happens before the failure, downstream consumers should treat a raised walk as “resume”, not “restart”.

Timezone on effective_date. OTA feeds report the rate calendar in either UTC or property-local time, and mixing them shifts an entire day’s availability by one row. Pin the parser to interpret effective_date in the property’s local timezone (the same convention the batch reconciliation workflow uses) so a midnight rollover does not misfile inventory against the wrong date.

Per-page rate-budget accounting. A single property might be one API call or forty, depending on page size and inventory depth. Budget against pages, not properties: a 100-property portfolio at 40 pages each is 4,000 requests, and pacing that flat against the channel’s window is the difference between a clean sweep and an IP ban.

Verification snippet

Before promoting the parser, prove it consumes every page, resumes cleanly, and refuses to loop on a stale cursor — using a fake paginator so no live OTA credentials are needed.

python

def test_walk_consumes_all_pages_then_stops():
    # Fake three data pages, then a page that echoes a cursor but no data.
    pages = [
        {"data": [{"propertyId": "P1", "roomTypeCode": "dbl-std ", "ratePlanCode": "bar",
                   "baseRate": 149.0, "currency": "EUR", "inventoryCount": 5,
                   "effectiveDate": "2026-07-10"}],
         "pagination": {"next_cursor": "c2"}},
        {"data": [{"propertyId": "P1", "roomTypeCode": "DBL-STD", "ratePlanCode": "BAR",
                   "baseRate": 159.0, "currency": "EUR", "inventoryCount": 3,
                   "effectiveDate": "2026-07-11"}],
         "pagination": {"next_cursor": "c3"}},
        {"data": [], "pagination": {"next_cursor": "c3"}},  # stale cursor trap
    ]
    parser = OTAPaginationParser("booking_com", "https://x", "k", PaginationState())
    parser._fetch_page = lambda cursor: pages.pop(0)

    records = list(parser.walk())
    assert len(records) == 2                      # both data pages consumed
    assert records[0].room_type_code == "DBL-STD" # normalized despite " dbl-std "
    assert parser.state.get("booking_com") is None  # state cleared on completion
    print("pagination walk OK")


test_walk_consumes_all_pages_then_stops()

The assertions cover the three properties that matter most: every non-empty page is yielded, the field_validator normalized a messy room_type_code, and the empty-page-with-cursor case terminates and clears state instead of looping. In CI, additionally assert that the count of pagination_complete log events equals the number of OTAs walked per cycle — any shortfall means a walk died silently mid-pagination.

Async Polling for Inventory Updates — the polling loop this parser feeds pages into
Handling OTA API Rate Limits — the request budget that pagination multiplies against
Implementing Exponential Backoff in Python — the backoff math behind the retry adapter
Categorizing 4xx vs 5xx Sync Errors — which statuses the parser should and should not retry
Building Batch Reconciliation Scripts for Daily Syncs — where a full paginated snapshot is diffed against the PMS baseline

← Back to Async Polling for Inventory Updates