OpenClaw + Polymarket: Build a Local Trading Agent With Historical L2 Data
A research-hardened guide to using OpenClaw for Polymarket workflows with strict agent contracts, historical L2 execution checks, and reproducible backtests.

Most AI trading demos for prediction markets share a failure mode: they look impressive in a notebook and fall apart the moment you try to run them twice. The agent produces different reasoning for the same market state. Fills get assumed at midpoint. The decision log is a pile of unstructured text. There's no way to tell whether the backtest results came from the model or the assumptions baked into the simulation.
OpenClaw combined with historical L2 data solves this if you treat the agent output contract as the first thing you build — not an afterthought. Here's the architecture that actually holds up.
What OpenClaw adds to a Polymarket research pipeline
OpenClaw is a local-first agent execution framework built for tool use and browser automation, with explicit logging and step isolation. For Polymarket research, that means you can control precisely which tools the agent is allowed to call, ensure every decision step is persisted with its feature inputs, and validate agent output before it touches any execution simulation logic.
The alternative — wiring an LLM directly to a trading function with minimal guardrails — produces demos. The goal here is a system you'd be willing to put real money behind.
The output contract: define this before writing any agent code
The single most impactful thing you can do for a Polymarket agent is specify its output schema before you write a single agent prompt. The contract forces the model into a structured decision space and makes every output auditable by inspection.
Here's a contract that works in practice:
You are a Polymarket trading agent. You will receive a JSON object containing
market features. Return a JSON object only — no commentary, no markdown.
Output schema:
{
"action": "buy_yes" | "buy_no" | "hold",
"size": <integer, contracts>,
"confidence": <float, 0.0–1.0>,
"max_slippage_bps": <integer>,
"rationale": "<one sentence>"
}
Hard constraints — these are non-negotiable:
- If spread_bps > 60: return {"action": "hold", ...}
- If expected_fill_ratio < 0.85: reduce size by 50%
- Never assume midpoint fills
- Use only the features provided in the input object
- If any required feature is null, return {"action": "hold", ...}
The hard constraints in the prompt aren't a suggestion — they're a fallback in case the model starts hallucinating edge cases. The actual enforcement happens in your pipeline code after you receive the output.
Feeding the agent real features from the API
The agent is only as good as the features you give it. Build a feature object that includes both the signal-relevant data (price momentum, volume trend) and the execution-relevant data (current spread, top-of-book depth). An agent that can only see price history will inevitably try to trade into illiquid windows.
import os
import requests
import json
API_KEY = os.environ["POLYMARKETDATA_API_KEY"]
BASE = "https://api.polymarketdata.co/v1"
HEADERS = {"X-API-Key": API_KEY}
def build_agent_features(slug: str, signal_ts: str) -> dict:
"""
Assemble the feature object the agent will reason over.
All features are explicit and versioned — no hidden state.
"""
# Prices: recent window for momentum features
prices_r = requests.get(
f"{BASE}/markets/{slug}/prices",
headers=HEADERS,
params={
"end_ts": signal_ts,
"resolution": "5m",
"limit": 24, # last 2 hours at 5-min resolution
},
timeout=30,
)
prices_r.raise_for_status()
prices = [float(p["p"]) for p in prices_r.json()["data"]]
# Metrics: spread, liquidity, volume at signal time
metrics_r = requests.get(
f"{BASE}/markets/{slug}/metrics",
headers=HEADERS,
params={
"end_ts": signal_ts,
"resolution": "5m",
"limit": 1,
},
timeout=30,
)
metrics_r.raise_for_status()
metrics_data = metrics_r.json()["data"]
latest_metrics = metrics_data[-1] if metrics_data else {}
# Books: current depth snapshot
books_r = requests.get(
f"{BASE}/markets/{slug}/books",
headers=HEADERS,
params={
"end_ts": signal_ts,
"resolution": "5m",
"limit": 1,
},
timeout=30,
)
books_r.raise_for_status()
books_data = books_r.json()["data"]
latest_book = books_data[-1] if books_data else {}
# Compute derived features
if len(prices) >= 2:
ret_1h = (prices[-1] - prices[-12]) / prices[-12] if prices[-12] > 0 else None
ret_2h = (prices[-1] - prices[0]) / prices[0] if prices[0] > 0 else None
else:
ret_1h = ret_2h = None
top_ask_depth = float(latest_book["asks"][0][1]) if latest_book.get("asks") else None
top_bid_depth = float(latest_book["bids"][0][1]) if latest_book.get("bids") else None
return {
"slug": slug,
"signal_ts": signal_ts,
"current_price": prices[-1] if prices else None,
"ret_1h": round(ret_1h, 5) if ret_1h is not None else None,
"ret_2h": round(ret_2h, 5) if ret_2h is not None else None,
"spread_bps": float(latest_metrics.get("spread", 0)) * 10_000 if latest_metrics.get("spread") else None,
"volume_5m": float(latest_metrics.get("volume", 0)) if latest_metrics else None,
"top_ask_depth": top_ask_depth,
"top_bid_depth": top_bid_depth,
}
Persist this feature object alongside every agent decision. If you can't reconstruct exactly what the agent saw when it made a decision, the backtest isn't auditable.
Validating and enforcing the agent output
The prompt contract gets you most of the way there. The validation layer handles the rest — parsing failures, out-of-range values, constraint violations that slipped through.
import json
from dataclasses import dataclass
from typing import Literal
@dataclass
class AgentDecision:
action: Literal["buy_yes", "buy_no", "hold"]
size: int
confidence: float
max_slippage_bps: int
rationale: str
def parse_agent_output(raw_output: str, features: dict) -> AgentDecision:
"""
Parse and validate agent JSON output. Apply hard constraint overrides.
Raises ValueError if output is structurally invalid.
"""
try:
data = json.loads(raw_output.strip())
except json.JSONDecodeError as e:
raise ValueError(f"Agent output not valid JSON: {e}\n\nRaw: {raw_output[:200]}")
action = data.get("action")
if action not in ("buy_yes", "buy_no", "hold"):
raise ValueError(f"Invalid action: {action}")
size = int(data.get("size", 0))
confidence = float(data.get("confidence", 0))
max_slippage_bps = int(data.get("max_slippage_bps", 50))
rationale = str(data.get("rationale", ""))
# Hard constraint enforcement — override agent if needed
spread_bps = features.get("spread_bps")
if spread_bps is not None and spread_bps > 60:
action = "hold"
rationale = f"[OVERRIDE: spread {spread_bps:.0f} bps > 60 threshold] " + rationale
return AgentDecision(
action=action,
size=size,
confidence=confidence,
max_slippage_bps=max_slippage_bps,
rationale=rationale,
)
The [OVERRIDE] annotation in the rationale is deliberate. When you review your logs, you want to know which decisions came from the model and which got overridden by the constraint layer.
L2-aware execution simulation
After the agent produces a validated decision, simulate the fill against the historical book. This is where midpoint-fill assumptions get replaced with reality.
def simulate_fill(decision: AgentDecision, book_snapshot: dict) -> dict:
"""
Simulate a weighted fill from the historical order book.
Returns execution result with slippage in bps.
"""
if decision.action == "hold":
return {"status": "hold", "filled": 0, "slippage_bps": None}
side = "asks" if decision.action == "buy_yes" else "bids"
levels = book_snapshot.get(side, [])
if not levels:
return {"status": "no_depth", "filled": 0, "slippage_bps": None}
remaining = float(decision.size)
filled = notional = 0.0
for price, size in levels:
take = min(remaining, float(size))
notional += take * float(price)
filled += take
remaining -= take
if remaining <= 0:
break
if filled == 0:
return {"status": "unfillable", "filled": 0, "slippage_bps": None}
avg_fill = notional / filled
ref_price = float(levels[0][0]) # best quote as reference
slippage = (avg_fill - ref_price) / ref_price * 10_000 if decision.action == "buy_yes" \
else (ref_price - avg_fill) / ref_price * 10_000
fill_ratio = filled / decision.size
# Risk gate: reject if slippage exceeds agent's stated tolerance
if slippage > decision.max_slippage_bps:
return {
"status": "rejected_slippage",
"avg_fill": avg_fill,
"slippage_bps": slippage,
"filled": 0,
}
return {
"status": "filled",
"avg_fill": avg_fill,
"slippage_bps": slippage,
"filled": filled,
"fill_ratio": fill_ratio,
}
The rejected_slippage status is important. It means the agent wanted to trade but the execution cost exceeded its own stated tolerance. Log these — if they're frequent, either the agent's max_slippage_bps parameter is too conservative or the strategy is systematically targeting illiquid windows.
Failure modes worth logging explicitly
Three failure patterns show up consistently in agent-driven Polymarket backtests. The first is non-reproducibility: the agent produces different reasoning for identical feature inputs across runs. Fix this by persisting your prompt version and the exact feature object for every decision — if you can't reproduce a result, it doesn't mean anything. The second is scale blindness: a strategy looks profitable at 500-contract size and collapses at 5,000, because slippage curves are nonlinear and the agent's size parameter never got pressure-tested. Fix this by running your backtest across multiple target-size buckets. The third is conflating decision quality with execution quality: a high signal hit rate doesn't mean the strategy is tradable, and an agent that calls the outcome correctly but triggers in illiquid windows will lose money. Track both KPIs separately in every report.
The operational discipline that makes it auditable
Pin versions of everything: your prompt, your feature extraction code, your OpenClaw configuration. Keep research API keys isolated from any live-trading credentials. For each backtest run, save the full run metadata including the feature object for every decision, the raw agent output before parsing, the parsed decision after validation and overrides, and the fill simulation result. That trail is what separates a system you can trust from a demo that only worked once.
A first useful deployment: pick one market, run it for a week with the full logging stack, and compare midpoint-fill PnL against L2-aware PnL. That single comparison will tell you more than another iteration on the signal.
All data from the polymarketdata.co API. Full endpoint reference at polymarketdata.co/docs. OpenClaw documentation at docs.openclaw.ai.