Platform

Observability
per run, per agent, per dollar.

Every agent run is a span with tokens, cost, latency, tool calls, and outcome. Every workflow is a graph of those spans. Every artifact carries the trace back to the requirement that asked for it.
Per-span tokens + costModel tier visibleError lineageOTLP + SIEM export
acme / customer-hub / runs / 84c2
ship-a-page · run 84c2 · 00:04:18 total
  1. 09:03:14spec.approvedalice@acme
  2. 09:03:16schema.startedsonnet · 2.1k tok
  3. 09:03:48schema.applied$0.06 · 32s
  4. 09:03:49design.startedsonnet · 1.4k tok
  5. 09:04:21design.applied$0.04 · 32s
  6. 09:04:22pages/[id].startedsonnet · 2.2k tok
  7. 09:04:55pages/[id].parse_errorretry queued
  8. 09:05:08pages/[id].escalatedopus · 3.0k tok
  9. 09:07:03pages/[id].applied$0.21 · opus
  10. 09:07:21tests.applied$0.12

The problem

AI pipelines that look like black boxes end up as black holes — for cost, for debugging, for compliance. AlgorithmShift treats every agent run like a service span: structured, logged, exportable. You can answer "why did this cost $4?" or "which prompt produced this file?" in one query.
[01]

Every run is a span with a full lifecycle.

Start, end, inputs, outputs, tool calls, errors, retries — all structured, all queryable. The same shape for a 2-second router hop or a 4-minute page generation.
  • Stable schema across agents + model tiers
  • Parent / child spans mirror the task graph
  • OTLP-compatible, drop-in for your existing tracer
span.json — sample run recordjson
{
  "run_id": "run_84c2",
  "workflow": "ship-a-page",
  "step_id": "pages/customers/[id]",
  "agent": "pages",
  "agent_version": "3.4.1",
  "model": "claude-sonnet-4-6",
  "tier": "sonnet",
  "tokens": { "in": 2180, "out": 1974 },
  "cost_usd": 0.084,
  "latency_ms": 3421,
  "tools_called": [
    { "name": "read_schema", "ms": 42 },
    { "name": "read_design_tokens", "ms": 18 }
  ],
  "outcome": "applied",
  "artifact_id": "iter_4f19/pages/customers/[id].tsx"
}
[02]

Cost + tokens + model, side-by-side.

The dashboard shows which tier ran where, how many tokens it took, and what it cost. Budget alerts trip when a workflow crosses a threshold you set; escalation events are first-class so you can see exactly when Sonnet escalated to Opus and why.
  • Per-agent, per-run, per-workspace cost breakdown
  • Monthly + daily budget with hard-stop option
  • Escalation events tagged with the parse / validation failure that caused them
Daily spend — Apr 18 $3.01 · budget $10.00
AgentRunsTokensCostTier
requirements818.4k$0.42sonnet
design46.2k$0.18sonnet
schema614.1k$0.39sonnet
pages1242.8k$1.84sonnet → opus ×2
integration35.1k$0.14haiku
release-notes25.2k$0.12sonnet
[03]

Errors surface as structured events, not log dumps.

When an agent fails, the platform records the error type, the failing artifact, and the retry decision. Same spot, same shape, every time — so you can alert on patterns instead of grepping stack traces.
  • Error taxonomy: parse_error, validation_fail, tool_timeout, budget_exceeded, approval_rejected
  • Links to the failing artifact + the retry that eventually succeeded
  • Pattern-match alert rules via webhook or PagerDuty
error feed — last 24h
  • 09:04:55parse_errorpagesInvalid JSON in model response — retry 1/3 queued
  • 09:22:08tool_timeoutintegrationhttp tool jira_search exceeded 10s — marked failed
  • 10:01:33validation_failschemaNOT NULL without default on existing table — blocked before apply
  • 10:44:11budget_exceededpagesworkflow hit $5.00 soft cap — owner notified, run paused
[04]

Retry lineage — not just the last outcome.

The platform keeps every attempt in the ledger. If a page succeeded on retry 3 after 2 parse failures, you can walk all three attempts: prompt, tool calls, error, escalation decision.
  • All attempts persisted, not just the final one
  • Prompt + input diff between retries
  • Escalation tier switch recorded with rationale
retry lineage — pages / customers / [id]log
attempt_1   sonnet  2.2k tok   $0.09   parse_error
attempt_2   sonnet  2.4k tok   $0.10   parse_error (same shape)
            → escalated: "two same-shape parse failures"
attempt_3   opus    3.0k tok   $0.21   applied   ✔
[05]

Audit export — your SIEM, your retention.

Everything you see in the dashboard — every run, every approval, every retry — can be streamed out via webhook, OTLP, or pre-built SIEM integrations. Keep your own retention contract, even if you leave the platform.
  • Outbound webhook with HMAC + replay protection
  • OTLP-compatible spans for Honeycomb / Tempo / Grafana
  • Pre-built connectors for Datadog + Splunk
settings.yaml — configure exportsyaml
# settings.yaml — export to your observability stack
observability:
  webhook:
    url: https://o11y.acme.net/ingest/algoshift
    secret: env(ALGO_O11Y_SECRET)
    include:
      - run_completed
      - run_failed
      - approval_decided
      - cost_threshold_crossed

  otlp:
    endpoint: otel-collector.acme.svc:4317
    headers:
      authorization: env(OTLP_AUTH)

  siem:
    vendor: datadog
    api_key: env(DD_API_KEY)
    tags: [algorithmshift, tenant=acme]
[06]

Approval + review events are first-class.

Every reviewer decision is logged as an event — who approved, what they saw, what they said. Regulated environments can trace any commit back to the human who signed off, across the entire lifecycle of the app.
  • Reviewer role + decision + free-text rationale
  • Artifact hash pinned at decision time
  • Exportable to compliance via audit bundle
approval events — customer-hub
  1. Apr 18 · 09:11spec.approvedalice@acme · 'LGTM, ship'
  2. Apr 18 · 10:32migration.approvedbob@acme · DBA
  3. Apr 18 · 10:34release.approvedbob@acme · 'rollback: pr_notes.md'
  4. Apr 18 · 11:02release.appliedprod · v2.8.0

Per-span

Token + cost + latency

every agent call

OTLP

Native trace export

Honeycomb · Grafana · Datadog

Retry lineage

All attempts kept

not just the last

Exportable

Audit bundle per release

SOC-ready

FAQ

Common questions

Can I set a hard spending cap?
Yes — per workspace, per app, or per workflow. When the cap is hit, new runs are blocked until the next budget window or a lead overrides it. Mid-run breaches fail the task cleanly with a budget_exceeded error.
Do you log prompts + completions?
By default we keep enough to debug (inputs, summarised outputs, tool calls). Full prompt + completion storage is opt-in per workspace — useful for audit but costs more to retain. Retention is configurable from 7 days to 18 months.
How fast does a run show up in the dashboard?
Spans stream live — you see steps transition from `started` to `applied` in real time. End-to-end latency from agent completion to dashboard visibility is sub-second on the same region.
Can I query runs programmatically?
Yes — both a GraphQL API for interactive queries and a bulk export endpoint for periodic pulls into your warehouse. SDKs for Python + TypeScript. Schema documented under /docs/observability.

See what your agents actually did — down to the token.

Debugging AI pipelines without spans is like debugging distributed systems without logs. We don't ship either way.