Back to Blog
openclawpolymarketagentsl2backtestingpython

OpenClaw + Polymarket: Build a Local Trading Agent With Historical L2 Data

A research-hardened guide to using OpenClaw for Polymarket workflows with strict agent contracts, historical L2 execution checks, and reproducible backtests.

PolymarketData Team
Developer workstation running local agent workflows and market analysis
Image credit: Glenn Carstens-Peters

Most AI trading demos for prediction markets share a failure mode: they look impressive in a notebook and fall apart the moment you try to run them twice. The agent produces different reasoning for the same market state. Fills get assumed at midpoint. The decision log is a pile of unstructured text. There's no way to tell whether the backtest results came from the model or the assumptions baked into the simulation.

OpenClaw combined with historical L2 data solves this if you treat the agent output contract as the first thing you build — not an afterthought. Here's the architecture that actually holds up.

What OpenClaw adds to a Polymarket research pipeline

OpenClaw is a local-first agent execution framework built for tool use and browser automation, with explicit logging and step isolation. For Polymarket research, that means you can control precisely which tools the agent is allowed to call, ensure every decision step is persisted with its feature inputs, and validate agent output before it touches any execution simulation logic.

The alternative — wiring an LLM directly to a trading function with minimal guardrails — produces demos. The goal here is a system you'd be willing to put real money behind.

The output contract: define this before writing any agent code

The single most impactful thing you can do for a Polymarket agent is specify its output schema before you write a single agent prompt. The contract forces the model into a structured decision space and makes every output auditable by inspection.

Here's a contract that works in practice:

You are a Polymarket trading agent. You will receive a JSON object containing
market features. Return a JSON object only — no commentary, no markdown.

Output schema:
{
  "action": "buy_yes" | "buy_no" | "hold",
  "size": <integer, contracts>,
  "confidence": <float, 0.0–1.0>,
  "max_slippage_bps": <integer>,
  "rationale": "<one sentence>"
}

Hard constraints — these are non-negotiable:
- If spread_bps > 60: return {"action": "hold", ...}
- If expected_fill_ratio < 0.85: reduce size by 50%
- Never assume midpoint fills
- Use only the features provided in the input object
- If any required feature is null, return {"action": "hold", ...}

The hard constraints in the prompt aren't a suggestion — they're a fallback in case the model starts hallucinating edge cases. The actual enforcement happens in your pipeline code after you receive the output.

Feeding the agent real features from the API

The agent is only as good as the features you give it. Build a feature object that includes both the signal-relevant data (price momentum, volume trend) and the execution-relevant data (current spread, top-of-book depth). An agent that can only see price history will inevitably try to trade into illiquid windows.

import os
import requests
import json

API_KEY = os.environ["POLYMARKETDATA_API_KEY"]
BASE = "https://api.polymarketdata.co/v1"
HEADERS = {"X-API-Key": API_KEY}


def build_agent_features(slug: str, signal_ts: str) -> dict:
    """
    Assemble the feature object the agent will reason over.
    All features are explicit and versioned — no hidden state.
    """
    # Prices: recent window for momentum features
    prices_r = requests.get(
        f"{BASE}/markets/{slug}/prices",
        headers=HEADERS,
        params={
            "end_ts": signal_ts,
            "resolution": "5m",
            "limit": 24,  # last 2 hours at 5-min resolution
        },
        timeout=30,
    )
    prices_r.raise_for_status()
    prices = [float(p["p"]) for p in prices_r.json()["data"]]

    # Metrics: spread, liquidity, volume at signal time
    metrics_r = requests.get(
        f"{BASE}/markets/{slug}/metrics",
        headers=HEADERS,
        params={
            "end_ts": signal_ts,
            "resolution": "5m",
            "limit": 1,
        },
        timeout=30,
    )
    metrics_r.raise_for_status()
    metrics_data = metrics_r.json()["data"]
    latest_metrics = metrics_data[-1] if metrics_data else {}

    # Books: current depth snapshot
    books_r = requests.get(
        f"{BASE}/markets/{slug}/books",
        headers=HEADERS,
        params={
            "end_ts": signal_ts,
            "resolution": "5m",
            "limit": 1,
        },
        timeout=30,
    )
    books_r.raise_for_status()
    books_data = books_r.json()["data"]
    latest_book = books_data[-1] if books_data else {}

    # Compute derived features
    if len(prices) >= 2:
        ret_1h = (prices[-1] - prices[-12]) / prices[-12] if prices[-12] > 0 else None
        ret_2h = (prices[-1] - prices[0]) / prices[0] if prices[0] > 0 else None
    else:
        ret_1h = ret_2h = None

    top_ask_depth = float(latest_book["asks"][0][1]) if latest_book.get("asks") else None
    top_bid_depth = float(latest_book["bids"][0][1]) if latest_book.get("bids") else None

    return {
        "slug": slug,
        "signal_ts": signal_ts,
        "current_price": prices[-1] if prices else None,
        "ret_1h": round(ret_1h, 5) if ret_1h is not None else None,
        "ret_2h": round(ret_2h, 5) if ret_2h is not None else None,
        "spread_bps": float(latest_metrics.get("spread", 0)) * 10_000 if latest_metrics.get("spread") else None,
        "volume_5m": float(latest_metrics.get("volume", 0)) if latest_metrics else None,
        "top_ask_depth": top_ask_depth,
        "top_bid_depth": top_bid_depth,
    }

Persist this feature object alongside every agent decision. If you can't reconstruct exactly what the agent saw when it made a decision, the backtest isn't auditable.

Validating and enforcing the agent output

The prompt contract gets you most of the way there. The validation layer handles the rest — parsing failures, out-of-range values, constraint violations that slipped through.

import json
from dataclasses import dataclass
from typing import Literal

@dataclass
class AgentDecision:
    action: Literal["buy_yes", "buy_no", "hold"]
    size: int
    confidence: float
    max_slippage_bps: int
    rationale: str


def parse_agent_output(raw_output: str, features: dict) -> AgentDecision:
    """
    Parse and validate agent JSON output. Apply hard constraint overrides.
    Raises ValueError if output is structurally invalid.
    """
    try:
        data = json.loads(raw_output.strip())
    except json.JSONDecodeError as e:
        raise ValueError(f"Agent output not valid JSON: {e}\n\nRaw: {raw_output[:200]}")

    action = data.get("action")
    if action not in ("buy_yes", "buy_no", "hold"):
        raise ValueError(f"Invalid action: {action}")

    size = int(data.get("size", 0))
    confidence = float(data.get("confidence", 0))
    max_slippage_bps = int(data.get("max_slippage_bps", 50))
    rationale = str(data.get("rationale", ""))

    # Hard constraint enforcement — override agent if needed
    spread_bps = features.get("spread_bps")
    if spread_bps is not None and spread_bps > 60:
        action = "hold"
        rationale = f"[OVERRIDE: spread {spread_bps:.0f} bps > 60 threshold] " + rationale

    return AgentDecision(
        action=action,
        size=size,
        confidence=confidence,
        max_slippage_bps=max_slippage_bps,
        rationale=rationale,
    )

The [OVERRIDE] annotation in the rationale is deliberate. When you review your logs, you want to know which decisions came from the model and which got overridden by the constraint layer.

L2-aware execution simulation

After the agent produces a validated decision, simulate the fill against the historical book. This is where midpoint-fill assumptions get replaced with reality.

def simulate_fill(decision: AgentDecision, book_snapshot: dict) -> dict:
    """
    Simulate a weighted fill from the historical order book.
    Returns execution result with slippage in bps.
    """
    if decision.action == "hold":
        return {"status": "hold", "filled": 0, "slippage_bps": None}

    side = "asks" if decision.action == "buy_yes" else "bids"
    levels = book_snapshot.get(side, [])

    if not levels:
        return {"status": "no_depth", "filled": 0, "slippage_bps": None}

    remaining = float(decision.size)
    filled = notional = 0.0

    for price, size in levels:
        take = min(remaining, float(size))
        notional += take * float(price)
        filled += take
        remaining -= take
        if remaining <= 0:
            break

    if filled == 0:
        return {"status": "unfillable", "filled": 0, "slippage_bps": None}

    avg_fill = notional / filled
    ref_price = float(levels[0][0])  # best quote as reference
    slippage = (avg_fill - ref_price) / ref_price * 10_000 if decision.action == "buy_yes" \
               else (ref_price - avg_fill) / ref_price * 10_000

    fill_ratio = filled / decision.size

    # Risk gate: reject if slippage exceeds agent's stated tolerance
    if slippage > decision.max_slippage_bps:
        return {
            "status": "rejected_slippage",
            "avg_fill": avg_fill,
            "slippage_bps": slippage,
            "filled": 0,
        }

    return {
        "status": "filled",
        "avg_fill": avg_fill,
        "slippage_bps": slippage,
        "filled": filled,
        "fill_ratio": fill_ratio,
    }

The rejected_slippage status is important. It means the agent wanted to trade but the execution cost exceeded its own stated tolerance. Log these — if they're frequent, either the agent's max_slippage_bps parameter is too conservative or the strategy is systematically targeting illiquid windows.

Failure modes worth logging explicitly

Three failure patterns show up consistently in agent-driven Polymarket backtests. The first is non-reproducibility: the agent produces different reasoning for identical feature inputs across runs. Fix this by persisting your prompt version and the exact feature object for every decision — if you can't reproduce a result, it doesn't mean anything. The second is scale blindness: a strategy looks profitable at 500-contract size and collapses at 5,000, because slippage curves are nonlinear and the agent's size parameter never got pressure-tested. Fix this by running your backtest across multiple target-size buckets. The third is conflating decision quality with execution quality: a high signal hit rate doesn't mean the strategy is tradable, and an agent that calls the outcome correctly but triggers in illiquid windows will lose money. Track both KPIs separately in every report.

The operational discipline that makes it auditable

Pin versions of everything: your prompt, your feature extraction code, your OpenClaw configuration. Keep research API keys isolated from any live-trading credentials. For each backtest run, save the full run metadata including the feature object for every decision, the raw agent output before parsing, the parsed decision after validation and overrides, and the fill simulation result. That trail is what separates a system you can trust from a demo that only worked once.

A first useful deployment: pick one market, run it for a week with the full logging stack, and compare midpoint-fill PnL against L2-aware PnL. That single comparison will tell you more than another iteration on the signal.


All data from the polymarketdata.co API. Full endpoint reference at polymarketdata.co/docs. OpenClaw documentation at docs.openclaw.ai.