Trace Design

How we trace agent runs that make sequential tool calls. Two trace shapes share one backend — LangSmith — wired via LANGSMITH_TRACING=true on the container, with the API key resolved from CF Secrets Store at cold-start. The five-phase pipeline (backend/leadgen_agent/pipeline_graph.py) emits one trace per run with phase spans for discover → enrich → contacts → qa → outreach. The three JSON-router agent loops (admin_chat, agentic_search, research_agent._run_task) are each wrapped in agent_run_span + tool_call_span so prompts, completions, tokens, model, latency, tool args, tool results, errors, retries, and per-call USD cost are all on the trace.

5 phase spans3 agent-loop graphs1 trace per runUSD cost per callLangSmith-backed
The one-paragraph answer

Two trace shapes share one backend (LangSmith, via LANGSMITH_TRACING=true). The pipeline trace is one root per run with a phase span for each of discoverenrichcontacts qaoutreach (fan-out phases nest one worker span per company id); every LLM call inside a node is a leaf with prompt, completion, model, input/output/total_tokens, latency_ms, retries, and cost_usd (computed against a pricing table at call time).

The agent-loop trace shape covers the three JSON-router graphs (admin_chat, agentic_search, research_agent._run_task). Each run is wrapped in an agent_run_span (chain run) whose outputs carry answer, steps, total_tokens, total_cost_usd. Every tool dispatch is a tool_call_span with args, result (or error), attempt, and latency_ms — the retries / errors signal the spec asks about. LLM calls inside the loop are auto-traced by LangChain so the parent run nests them with no callback wiring. The API key lives in CF Secrets Store (binding LANGSMITH_API_KEY) and is resolved at container cold-start; rotation is non- destructive.

The five spans

Lead-gen's pipeline LangGraph is — by construction — a five-phase sequence. Each phase becomes one span; fan-out phases nest a worker span per company id. Every LLM call inside a node is its own child span via the LangSmith auto-instrumentation. Phase nodes return a StageReport (processed, created, errors[], duration_ms) which becomes the phase span's attribute bag.

1

discover

phase span

run_discover

Brainstorm or insert seed domains. Invokes the company_discovery subgraph; either pulls from Common Crawl + a domain LLM call, or accepts an explicit DOMAINS= list from the operator.

InputICP brief, optional explicit domain list
Outputdiscovered_ids: number[] (Neon company rows)
Error modeSoft-fails to an empty discovered_ids; StageReport carries errors[] for downstream visibility.
Span attributes
icp_summary (string, truncated)discovered_count (int)duration_ms (int)llm.calls (int)llm.cost_usd (float)
backend/leadgen_agent/pipeline_graph.py·line 158
2

enrich

phase spanfan-out

enrich_queue → enrich_one (Send-fanout) → enrich_collect

Per-company LLM enrichment. The queue node emits a Send token per id; each worker invokes company_enrichment_graph in its own subspan; the collector aggregates back into a single StageReport.

Input_enrich_queue: number[]
Outputenriched companies written to Neon, per-company telemetry merged into graph_meta
Fan-outLangGraph Send-tokens, one worker span per company id
Error modeEach worker wraps a try/except; failures land in StageReport.errors keyed by company_id. The phase never aborts on a single bad row.
Span attributes
company_id (int) — on worker spanstokens_in / tokens_out (int)model (e.g. deepseek-v4-pro)cost_usd (float)errors[] (string[])
backend/leadgen_agent/pipeline_graph.py·line 271, 313, 341
3

contacts

phase spanfan-out

contacts_queue → contact_one (Send-fanout) → contacts_collect

Per-company contact discovery + enrichment. Mirrors enrich's fan-out shape, but invokes both contact_discovery_graph and contact_enrich_graph per worker.

Input_contacts_queue: number[]
Outputcontacts rows in Neon, persona scores cached, per-contact telemetry merged
Fan-outLangGraph Send-tokens, one worker span per company id
Error modeSame try/except pattern as enrich. Workers that hit rate limits surface a retry counter in attrs.
Span attributes
company_id (int)contacts_created (int)deliverability_verified (bool)tokens / cost / latencyretries (int) — if any
backend/leadgen_agent/pipeline_graph.py·line 403, 435, 479
4

qa

phase span

run_qa → run_qa_critic

Deterministic SQL row counts followed by an LLM critic that audits a sample of enriched rows for plausibility. Critic is reflection-style: the model rates its own evidence.

InputNeon snapshot of the just-written enrichment rows
Outputqa_report jsonb (row counts, plausibility scores, critic comments)
Error modeCritic gracefully short-circuits to an empty sample if the LLM call fails; the deterministic counts always succeed.
Span attributes
row_count (int)sample_size (int)critic_pass_rate (float, 0..1)critic_model (string)critic_cost_usd (float)
backend/leadgen_agent/pipeline_graph.py·line 513, 619
5

outreach

phase spanfan-out

outreach_queue → outreach_run (with human-gate interrupt)

Compose email drafts for approved contacts. When auto_confirm=False, the queue node calls interrupt() and the trace pauses on a checkpoint; on resume, outreach_run invokes email_outreach_graph per recipient.

Inputapproved contact ids, persona scores, template selection
Outputemail_drafts rows in Neon, opt. send via Resend
Fan-outSequential per recipient (no Send) — keeps quota visible per-call
Error modeDB fallback path when last_outreach_at column missing (schema drift). Interrupt resumes via thread checkpoint; full prior trace is preserved.
Span attributes
contact_id (int)template_key (string)drafts_created (int)human_decision (approve|skip)interrupt_resume_at (timestamp)
backend/leadgen_agent/pipeline_graph.py·line 774, 819, 847

The other shape: JSON-router agent loops

Three graphs predate create_react_agent: they parse {"tool": ..., "args": ...} JSON from the LLM and execute Python tools directly. LangChain's auto-tracer can't see the tool dispatch step in this pattern. The fix: wrap each loop in agent_run_span (parent chain run) and each tool call in tool_call_span. Both are no-ops when LANGSMITH_TRACING is not true — strictly additive.

admin_chat.chat

max 6 stepsagent_run_span

Read-only DB Q&A. Each step parses {tool, args} JSON, dispatches to a Postgres helper (identifier-quoted, read-only transaction, 5-second statement timeout), and feeds the observation back as the next user turn. Used by the admin UI to answer ad-hoc questions like 'how many companies are in the database?' without scaffolding a custom SQL UI.

Tools
count_rowsinspect_schemaquery_db
backend/leadgen_agent/admin_chat_graph.py·line 154

agentic_search._tool_loop

max 8 stepsagent_run_span

Codebase search worker. Cost-aware tool hierarchy: glob (near-zero, paths only) → grep (file:line matches) → read (full file with line numbers). The decompose flow fans out three workers in parallel, each running its own tool_loop; synthesize_node joins the answers into one report with file:line citations.

Tools
globgrepread
backend/leadgen_agent/agentic_search_graph.py·line 432

research_agent._run_task

max 10 stepsagent_run_span

Semantic Scholar paper search. House style: ≥3 search_papers with different queries, then get_paper_detail on the 3–4 most promising papers, weight 2020+ heavily, report confidence honestly. plan_node decomposes the brief into tasks; each task runs its own loop; synthesize_node merges them.

Tools
search_papersget_paper_detail
backend/leadgen_agent/research_agent_graph.py·line 830

Span helpers (langsmith_setup.py)

Two context managers, both strict no-ops when LangSmith is disabled. Caller pattern: wrap the whole loop in agent_run_span; wrap each tool dispatch in tool_call_span and pipe the result (or exception) through the yielded finish(result=…, error=…) callback.

@contextmanager
def agent_run_span(name: str, *, metadata=None) -> Iterator[Run | None]:
    """Wraps an LLM↔tool loop as one parent chain run. The caller calls
    run.end(outputs={answer, steps, total_tokens, total_cost_usd}) at
    loop exit so the trace overview answers 'how many tokens / dollars
    did this agent burn' without having to sum the per-call children."""

@contextmanager
def tool_call_span(tool: str, args: dict, *, attempt: int = 1):
    """Wraps a single tool dispatch as a tool run. Yields
    finish(result=…, error=…). On exit, writes outputs + latency_ms
    metadata; on error, sets level=ERROR + outputs.error. The attempt
    counter is how retries surface in the trace."""

# caller-side (admin_chat_graph.chat):
with agent_run_span("agent_run:admin_chat", metadata={"max_steps": MAX_STEPS}) as run:
    for step in range(MAX_STEPS):
        result, tel = await ainvoke_json_with_telemetry(llm, msgs)
        acc_tokens += tel["total_tokens"]; acc_cost += tel["cost_usd"]
        if "answer" in result:
            if run is not None:
                run.end(outputs={"answer": result["answer"], "steps": step + 1,
                                 "total_tokens": acc_tokens, "total_cost_usd": acc_cost})
            return {"response": result["answer"]}
        with tool_call_span(tool, args, attempt=step + 1) as finish:
            try:
                observation = await _dispatch(tool, args); finish(result=observation[:2000])
            except Exception as exc:
                finish(error=exc); raise
backend/leadgen_agent/langsmith_setup.py:52·init_langsmithlangsmith_setup.py:87·agent_run_spanlangsmith_setup.py:133·tool_call_span

CF Secrets Store → container env

The LangSmith API key lives in Cloudflare Secrets Store, not in wrangler secret putplain-text secrets. Three steps to land it in os.environ["LANGSMITH_API_KEY"]inside the Python container:

1

Store

One CF account-wide vault holds every shared secret. Store ID ec928f4771fb4577a607a0b122e8087e; secret ID 1977cd3b73b647d4a5658b11562a58b6, scope workers. Rotation is non-destructive — the next container cold-start picks up the new value without a redeploy.

# create / rotate (uses stored OAuth — drop CLOUDFLARE_API_TOKEN)
env -u CLOUDFLARE_API_TOKEN \
  npx wrangler secrets-store secret create ec928f4771fb4577a607a0b122e8087e \
    --name LANGSMITH_API_KEY --value "lsv2_pt_…" --scopes workers --remote
2

Binding (wrangler.jsonc)

Each Worker that needs the secret declares its own async binding. The CF runtime puts an async handle on env.LANGSMITH_API_KEY — calling .get() at runtime fetches the current value from the vault. Same store, both Workers (lead-gen-core and lead-gen-research).

// backend/core/wrangler.jsonc — secrets_store_secrets[]
{
  "binding":     "LANGSMITH_API_KEY",
  "store_id":    "ec928f4771fb4577a607a0b122e8087e",
  "secret_name": "LANGSMITH_API_KEY"
}

// non-secret config flows via vars{}
"LANGSMITH_TRACING":  "true",
"LANGSMITH_ENDPOINT": "https://api.smith.langchain.com",
"LANGSMITH_PROJECT":  "lead-gen"
3

Async resolution (src/index.js)

The Container constructor builds this.envVars from sync values only. start() awaits the Secrets Store handles in parallel, then spreads them into super.start({envVars: …}) so the Container SDK injects them as OS env vars for the uvicorn process.

// backend/core/src/index.js — CoreContainer#start()
async start(startOptions, waitOptions) {
  const [hfToken, langsmithKey] = await Promise.all([
    this.env.HF_TOKEN?.get()          ?? Promise.resolve(""),
    this.env.LANGSMITH_API_KEY?.get() ?? Promise.resolve(""),
  ]);
  return super.start({
    ...startOptions,
    envVars: {
      ...this.envVars,
      LANGSMITH_API_KEY: langsmithKey ?? "",
      /* … */
    },
  }, waitOptions);
}

Deploy gotcha: wrangler deploy with CLOUDFLARE_API_TOKEN set errors with 10021 Secrets store binding authorization failed — the token lacks secrets_store:read. Fix: env -u CLOUDFLARE_API_TOKEN wrangler deploy so wrangler falls back to the stored OAuth (which has full scope).

What every span captures

These are the seven signals an interviewer is looking for, mapped to where they actually live in this codebase. Nothing here is aspirational — the wrapper is on by default on every LLM call.

SignalCaptured asSourceFile
LLM input / outputprompt + completion on every LLM spanLangChain auto-trace (LANGSMITH_TRACING=true)backend/leadgen_agent/langsmith_setup.py:52
Token countsinput_tokens, output_tokens, total_tokensainvoke_json_with_telemetry telemetry dict + LangSmith usage_metadatabackend/leadgen_agent/llm.py:632
Latency per LLM steplatency_ms (per LLM call) + duration_ms (per phase)ainvoke_json_with_telemetry + StageReportbackend/leadgen_agent/llm.py:632, pipeline_graph.py:91
Latency per tool steplatency_ms metadata on tool_call_spantool_call_span wall-clock measurementbackend/leadgen_agent/langsmith_setup.py:133
Tool args / resultsinputs={args, attempt} / outputs={result|error} on tool_call_spantool_call_span yielded finish(result=…, error=…) callbackbackend/leadgen_agent/langsmith_setup.py:133
Errors & retrieslevel=ERROR + outputs.error; attempt counter on every tool spantool_call_span finish(error=…) path; exponential-backoff in LLM wrapperlangsmith_setup.py:133, llm.py:259
Model usedmodel field on every LLM run (auto) + telemetry entrymake_llm() / deepseek_model_name(tier)backend/leadgen_agent/llm.py
Total cost / e2e latencyoutputs={total_tokens, total_cost_usd} on the parent agent_run_spanloop accumulator across ainvoke_json_with_telemetry callslangsmith_setup.py:87, admin_chat_graph.py:154
Run / thread correlationlg_run_id, lg_thread_id, x-request-idproduct_intel_runs columns + observability middlewaresrc/db/schema.ts:1196, backend/leadgen_agent/observability.py

Span hierarchy

One trace per pipeline run. The root span carries the run id; phase spans carry StageReport as attributes; fan-out phases nest one worker span per id; every LLM call gets its own leaf span with token/cost/latency.

pipeline_run (trace root)
├─ phase: discover  (run_discover)
│  └─ llm: company_discovery_graph  [model, tokens, latency, cost]
├─ phase: enrich  (enrich_queue → Send fanout → enrich_collect)
│  ├─ worker: enrich_one  [company_id=42]
│  │   └─ llm: company_enrichment_graph  [model, tokens, latency, cost]
│  ├─ worker: enrich_one  [company_id=43]
│  │   └─ llm: company_enrichment_graph  ...
│  └─ ... (one worker span per id)
├─ phase: contacts  (contacts_queue → Send fanout → contacts_collect)
│  ├─ worker: contact_one  [company_id=42]
│  │   ├─ llm: contact_discovery_graph
│  │   └─ llm: contact_enrich_graph
│  └─ ...
├─ phase: qa  (run_qa → run_qa_critic)
│  ├─ sql: row_counts
│  └─ llm: critic  [model, tokens, latency, cost]
└─ phase: outreach  (outreach_queue → interrupt → outreach_run)
   ├─ checkpoint: human_gate  [decision=approve]
   └─ worker: outreach_run  [contact_id=314]
       └─ llm: email_outreach_graph  [model, tokens, latency, cost]

Span hierarchy — agent loop

A real admin_chat run answering "how many companies are in the database?" — three loop turns, two tool calls, one root agent_run_span. LLM children are captured automatically by LangChain's tracer; tool children are captured by tool_call_span.

agent_run:admin_chat  (chain — agent_run_span)
│  metadata: { max_steps: 6, prompt_chars: 41 }
│
├─ ChatOpenAI deepseek-v4-pro                                       ← LangChain auto-trace
│   prompt:        "User question: how many companies are in the database?"
│   completion:    "{\"tool\": \"inspect_schema\", \"args\": {\"table\": \"companies\"}}"
│   model:         deepseek-v4-pro
│   input_tokens:  412   output_tokens: 38   total: 450
│   latency_ms:    1820
│
├─ tool:inspect_schema  (tool — tool_call_span)
│   inputs:    { args: { table: "companies" }, attempt: 1 }
│   outputs:   { result: "id: int\nname: text\nslug: text\n…" }
│   metadata:  { tool: "inspect_schema", attempt: 1, latency_ms: 83 }
│
├─ ChatOpenAI deepseek-v4-pro
│   completion:    "{\"tool\": \"count_rows\", \"args\": {\"table\": \"companies\"}}"
│   input_tokens:  680   output_tokens: 24   total: 704
│   latency_ms:    1340
│
├─ tool:count_rows
│   inputs:    { args: { table: "companies" }, attempt: 2 }
│   outputs:   { result: "companies: 2607 rows" }
│   metadata:  { tool: "count_rows", attempt: 2, latency_ms: 145 }
│
├─ ChatOpenAI deepseek-v4-pro
│   completion:    "{\"answer\": \"There are 2607 companies in the database.\"}"
│   input_tokens:  720   output_tokens: 22   total: 742
│   latency_ms:    980
│
└─ outputs: {
      answer:            "There are 2607 companies in the database.",
      steps:             3,
      total_tokens:      1896,
      total_cost_usd:    0.00041
   }

Real-shape payload

The shape below is what one enrich_one worker emits into graph_meta.telemetry, and what gets rolled up into product_intel_runs.total_cost_usd at run completion. Values are illustrative; the keys and types are the real ones from ainvoke_json_with_telemetry.

// graph_meta.telemetry["enrich.enrich_one[company_id=42]"]
{
  "model": "deepseek-v4-pro",
  "calls": 3,
  "input_tokens": 4180,
  "output_tokens": 612,
  "total_tokens": 4792,
  "cost_usd": 0.001847,
  "latency_ms": 2840,
  "retries": 0
}

// Rolled up into product_intel_runs at end of run:
{
  "id": "0193-...-runid",
  "lg_run_id": "01HX...",
  "lg_thread_id": "01HX...",
  "total_cost_usd": 0.1834,
  "progress": {
    "discover":  { "duration_ms": 1820, "processed": 1,  "created": 14 },
    "enrich":    { "duration_ms": 18430, "processed": 14, "errors": [] },
    "contacts":  { "duration_ms": 24110, "processed": 14, "created": 38 },
    "qa":        { "duration_ms": 3120, "critic_pass_rate": 0.86 },
    "outreach":  { "duration_ms": 6240, "drafts": 12, "human_decision": "approve" }
  }
}

Real-shape payload — agent loop

What the parent agent_run_span emits as outputs at loop exit, plus one child tool_call_span. Token + cost totals come from ainvoke_json_with_telemetry (the same path the pipeline graph uses) and are accumulated across loop turns before being written on the parent.

// parent agent_run_span outputs at loop exit:
{
  "answer":         "There are 2607 companies in the database.",
  "steps":          3,
  "total_tokens":   1896,
  "total_cost_usd": 0.00041
}

// one child tool_call_span (inspect_schema):
{
  "name":     "tool:inspect_schema",
  "run_type": "tool",
  "inputs":   { "args": { "table": "companies" }, "attempt": 1 },
  "outputs":  { "result": "id: int\nname: text\nslug: text\n…" },
  "metadata": { "tool": "inspect_schema", "attempt": 1, "latency_ms": 83 }
}

// error-path tool_call_span (forced bad SQL — query_db):
{
  "name":     "tool:query_db",
  "run_type": "tool",
  "inputs":   { "args": { "sql": "SELECT count(*) FROM nonexistent" }, "attempt": 4 },
  "outputs":  { "error": "Error: relation \"nonexistent\" does not exist" },
  "level":    "ERROR",
  "metadata": { "tool": "query_db", "attempt": 4, "latency_ms": 31 }
}

End-to-end correlation

x-request-id is minted by the CF dispatcher middleware (backend/leadgen_agent/observability.py) and forwarded through the core / ml / research worker chain — so a trace started by the Next.js app stays correlated even when it crosses worker boundaries.lg_run_id and lg_thread_id are persisted on product_intel_runs (src/db/schema.ts:1196). Open a row in the admin UI, click through to LangSmith (project lead-gen / lead-gen-research) with those ids, and the full span tree — prompts, completions, token counts, cost — is there for that exact run.The GraphQL surface (src/apollo/resolvers/products/intel-runs.ts) exposes totalCostUsd at the boundary; the per-node breakdown stays in progress (jsonb) for richer drill-downs without an extra LLM call.
The one-paragraph debugging answer

First I open the LangSmith trace via lg_run_id from product_intel_runs. Spans are colour-coded by level — an ERROR tool span pins the failing step immediately; otherwise a latency outlier or a non-empty StageReport.errors[] does. At that step I read three things on one screen: the tool inputs (args), the tool outputs (result or error), and the LLM completion that asked for them. If the tool returned correct data but the model produced a wrong answer, that's hallucination — the prompt or model tier is the fix. If the tool errored or returned the wrong shape, that's a tool failure — the JSON-router parsing or the helper is the fix. To cross-check production logs (retry warnings, upstream errors, raw stack traces), I wrangler tail and grep by x-request-id, which the FastAPI middleware mints once and echoes on every response — so one id stitches Vercel logs → CF dispatcher → core / research workers → LangSmith trace.

Debugging playbook — wrong-answer report

A user says "your agent gave a wrong answer". Five steps, in order, to localize the failure and decide what to fix. Each maps to a real artifact in this codebase — the trace, the DB pointer, the worker logs, the boot probe. No prompt-tweaking on a hunch.

1

Find the run

Pull lg_run_id + lg_thread_id from product_intel_runs (or the GraphQL productIntelRun(id) query). Open the LangSmith trace with that id. For the JSON-router loops (admin_chat, agentic_search, research_agent), the trace url is the only entry point — grab it from the response's x-request-id header.

Where to lookDB row.lg_run_id → smith.langchain.com/o/.../runs/{id}; GraphQL field totalCostUsd to confirm you have the right run before clicking through
src/db/schema.ts:1200, src/apollo/resolvers/products/intel-runs.ts:268
2

Identify the failing step

Three fingerprints, in order: (a) an ERROR-level tool span (tool_call_span set level=ERROR on exception); (b) a latency outlier on a phase span (one phase 10× the others); (c) StageReport.errors[] non-empty for the pipeline graph. LangSmith colour-codes ERROR spans red — the failing step is usually one click.

Where to lookRun tree → red-bordered children; phase span duration_ms outliers; outputs.error on any tool span
backend/leadgen_agent/langsmith_setup.py:133, backend/leadgen_agent/pipeline_graph.py:90-96
3

Inspect prompts + tool inputs/outputs

On one screen read three things: the LLM completion (what the model decided), the tool inputs (what args it sent), and the tool outputs (what came back). LangChain auto-trace captures the prompt + completion verbatim on every LLM run; tool_call_span captures args + result on every tool dispatch. If the tool returned correct data and the model produced a wrong answer → hallucination. If the tool errored or returned the wrong shape → tool failure.

Where to lookLLM run.inputs.messages + run.outputs.generations; tool_call_span.inputs.args + outputs.result
backend/leadgen_agent/langsmith_setup.py:87, 133
4

Correlate traces with logs

Grab x-request-id from the response header (or LangSmith trace metadata). The FastAPI middleware mints one id per request and echoes it on every response — including 401s that never reach a handler. `wrangler tail lead-gen-core --format pretty | grep <id>` joins Vercel logs → CF dispatcher → core/research worker logs to the LangSmith trace. Look for retry-loop warnings (logs-only — known gap), psycopg pool errors, or raw stack traces that didn't make it into the trace.

Where to lookResponse header x-request-id; observability.py middleware mints uuid4 if absent; wrangler tail lead-gen-core / lead-gen-research
backend/leadgen_agent/observability.py:57-90
5

Diagnose root cause

Walk the failure-modes table below. Each symptom has a trace fingerprint (what to grep for) and a fix recipe. Most production wrong-answer reports fall into one of seven buckets: hallucination, schema drift, send-worker exception, transient upstream retries, prompt-cache hit, bad tool-selection prompt, or telemetry-not-flowing.

Where to lookFailure modes table (next section); also /__debug/state for boot-time problems (langsmith_disabled, graph_compile_failures, lifespan_error)
backend/core/app.py:581-612

Common failure modes

Seven recurring shapes from real wrong-answer incidents. The fingerprint column tells you what to look for in the trace; the fix column is the actual remediation, not generic advice. Hallucination and tool failure are the two ends of the spectrum — most rows diagnose which side of that line you're on.

SymptomFingerprint in traceLikely causeFix
Wrong factual answer (e.g. wrong company count)tool:count_rows span output is correct; LLM completion contradicts itHallucination — model didn't ground on the tool resultTighten system prompt with an explicit grounding rule; escalate to tier="deep" (reasoning_effort=high); add a verification step
Tool returns "Error: relation X does not exist"tool:query_db has level=ERROR + outputs.error populatedSchema drift between Drizzle migrations and the live DBpnpm db:migrate; verify with pnpm db:studio; in admin_chat, regenerate TOOLS_DOC if the table was renamed
Empty or truncated answerStageReport.errors[] populated; phase status=PARTIAL or FAILSend-worker exception — one company id failed, phase emitted but errors[] non-emptyRead the first error string in errors[]. Common: rate limit (429), Brave 500, missing API key. Retry the single id after fixing.
Latency spike on one phasePhase duration_ms is 10× the others; LLM child latency_ms flatTransient upstream retries — the LLM wrapper retried with exponential backoff + jitterGrep wrangler tail for "LLM call transient failure" — only place retry attempts surface today (known gap, see below)
Same exact wrong answer twiceTrace shows identical tool args + identical completion both timesPrompt-cache hit (DeepSeek) or same seed across runsForce cache miss: prepend a nonce to the prompt; or briefly switch tier="standard" ↔ "deep" to invalidate the cache key
Tool called with bizarre args (e.g. table="users" instead of "companies")tool:inspect_schema inputs.args.table is wrongBad tool-selection prompt — model confused two similar tools or hallucinated a table nameTighten TOOLS_DOC strings in the graph; add a worked example in STEP_INSTRUCTION; consider a whitelist check before dispatch
Container booted but no traces appear in LangSmith/__debug/state.boot_log shows langsmith_disabledLANGSMITH_TRACING != "true" OR LANGSMITH_API_KEY empty (Secrets Store didn't resolve)wrangler tail for "langsmith: disabled (tracing=… key=…)"; if key=missing, re-check the Secrets Store binding in wrangler.jsonc and the await chain in CoreContainer#start()

Honest gaps

What's deliberately not built yet. Calling these out is part of a good trace-design answer — observability is never finished, and an interviewer learns more from the gap list than from the feature list.

OpenTelemetry / OTLP

Not wired. We rely on LangSmith's own run-tree model. Adding an OTel exporter would let traces flow into Grafana Tempo / Honeycomb / Datadog alongside LangSmith — useful once non-LLM services need to share the same trace context.

In-app flame graph UI

There is no flame-graph viewer inside lead-gen itself. Operators jump to LangSmith for the visual; the app surfaces only the rolled-up cost + status via the productIntelRun GraphQL query.

Dedicated traces / spans tables in Neon

LangSmith owns trace storage. Only the aggregate (total_cost_usd, progress jsonb, lg_run_id, lg_thread_id) lives in product_intel_runs. If LangSmith ever needs to go, we'd need to mirror spans into Neon — a known migration cost.

Per-tool argument redaction

Prompts / completions / tool args are captured verbatim. For PII-heavy contact enrichment, a redaction pass before the LangSmith client fires would be the next hardening step.

Secrets Store binding scope

`wrangler deploy` with a CLOUDFLARE_API_TOKEN that lacks secrets_store:read errors with 10021 (Secrets store binding authorization failed). Workaround is to drop the token and use stored OAuth (env -u CLOUDFLARE_API_TOKEN wrangler deploy). The deploy runbook documents the create-secret variant only; the deploy-time variant is now noted on this page.

Retry count is logs-only, not on telemetry

ainvoke_json_with_telemetry emits {model, input_tokens, output_tokens, total_tokens, cost_usd, latency_ms, calls}. The retry-attempt counter exists only as log.warning('LLM call transient failure (attempt N/M)...') in llm.py:670 — debugging a flaky-upstream incident requires `wrangler tail` + grep, not LangSmith filters. Promoting attempt count into the telemetry dict (and onto tool_call_span metadata for retried tools) is the obvious follow-up.

DeepSeek <think> blocks are stripped before capture

_parse_json() in llm.py strips markdown fences and extracts balanced JSON; the model's chain-of-thought is dropped on the way in, so it's not visible in trace prompts/completions. Useful for token economy and clean prompts, painful for debugging when the answer is wrong but the reasoning would have been the tell.

No admin UI list page for product_intel_runs

Operators query Neon directly or hit the GraphQL productIntelRun(id) query. Listing all runs (sortable by cost, status, started_at) requires SQL or pulling the run-id from a Slack notification. A simple admin table view would close the loop for non-engineers triaging incidents.