Platform

Observability
per run, per agent, per dollar.

Every agent run is a span with tokens, cost, latency, tool calls, and outcome. Every workflow is a graph of those spans. Every artifact carries the trace back to the requirement that asked for it.

Per-span tokens + costModel tier visibleError lineageOTLP + SIEM export

Book a demo See the traceability model

acme / customer-hub / runs / 84c2

ship-a-page · run 84c2 · 00:04:18 total

09:03:14spec.approvedalice@acme
09:03:16schema.startedsonnet · 2.1k tok
09:03:48schema.applied$0.06 · 32s
09:03:49design.startedsonnet · 1.4k tok
09:04:21design.applied$0.04 · 32s
09:04:22pages/[id].startedsonnet · 2.2k tok
09:04:55pages/[id].parse_errorretry queued
09:05:08pages/[id].escalatedopus · 3.0k tok
09:07:03pages/[id].applied$0.21 · opus
09:07:21tests.applied$0.12

The problem

AI pipelines that look like black boxes end up as black holes — for cost, for debugging, for compliance. AlgorithmShift treats every agent run like a service span: structured, logged, exportable. You can answer "why did this cost $4?" or "which prompt produced this file?" in one query.

[01]

Every run is a span with a full lifecycle.

Start, end, inputs, outputs, tool calls, errors, retries — all structured, all queryable. The same shape for a 2-second router hop or a 4-minute page generation.

Stable schema across agents + model tiers
Parent / child spans mirror the task graph
OTLP-compatible, drop-in for your existing tracer

span.json — sample run recordjson

{
  "run_id": "run_84c2",
  "workflow": "ship-a-page",
  "step_id": "pages/customers/[id]",
  "agent": "pages",
  "agent_version": "3.4.1",
  "model": "claude-sonnet-4-6",
  "tier": "sonnet",
  "tokens": { "in": 2180, "out": 1974 },
  "cost_usd": 0.084,
  "latency_ms": 3421,
  "tools_called": [
    { "name": "read_schema", "ms": 42 },
    { "name": "read_design_tokens", "ms": 18 }
  ],
  "outcome": "applied",
  "artifact_id": "iter_4f19/pages/customers/[id].tsx"
}

[02]

Cost + tokens + model, side-by-side.

The dashboard shows which tier ran where, how many tokens it took, and what it cost. Budget alerts trip when a workflow crosses a threshold you set; escalation events are first-class so you can see exactly when Sonnet escalated to Opus and why.

Per-agent, per-run, per-workspace cost breakdown
Monthly + daily budget with hard-stop option
Escalation events tagged with the parse / validation failure that caused them

Daily spend — Apr 18 $3.01 · budget $10.00

Agent	Runs	Tokens	Cost	Tier
requirements	8	18.4k	$0.42	sonnet
design	4	6.2k	$0.18	sonnet
schema	6	14.1k	$0.39	sonnet
pages	12	42.8k	$1.84	sonnet → opus ×2
integration	3	5.1k	$0.14	haiku
release-notes	2	5.2k	$0.12	sonnet

[03]

Errors surface as structured events, not log dumps.

When an agent fails, the platform records the error type, the failing artifact, and the retry decision. Same spot, same shape, every time — so you can alert on patterns instead of grepping stack traces.

Error taxonomy: parse_error, validation_fail, tool_timeout, budget_exceeded, approval_rejected
Links to the failing artifact + the retry that eventually succeeded
Pattern-match alert rules via webhook or PagerDuty

error feed — last 24h

09:04:55parse_errorpagesInvalid JSON in model response — retry 1/3 queued
09:22:08tool_timeoutintegrationhttp tool jira_search exceeded 10s — marked failed
10:01:33validation_failschemaNOT NULL without default on existing table — blocked before apply
10:44:11budget_exceededpagesworkflow hit $5.00 soft cap — owner notified, run paused

[04]

Retry lineage — not just the last outcome.

The platform keeps every attempt in the ledger. If a page succeeded on retry 3 after 2 parse failures, you can walk all three attempts: prompt, tool calls, error, escalation decision.

All attempts persisted, not just the final one
Prompt + input diff between retries
Escalation tier switch recorded with rationale

retry lineage — pages / customers / [id]log

attempt_1   sonnet  2.2k tok   $0.09   parse_error
attempt_2   sonnet  2.4k tok   $0.10   parse_error (same shape)
            → escalated: "two same-shape parse failures"
attempt_3   opus    3.0k tok   $0.21   applied   ✔

[05]

Audit export — your SIEM, your retention.

Everything you see in the dashboard — every run, every approval, every retry — can be streamed out via webhook, OTLP, or pre-built SIEM integrations. Keep your own retention contract, even if you leave the platform.

Outbound webhook with HMAC + replay protection
OTLP-compatible spans for Honeycomb / Tempo / Grafana
Pre-built connectors for Datadog + Splunk

settings.yaml — configure exportsyaml

# settings.yaml — export to your observability stack
observability:
  webhook:
    url: https://o11y.acme.net/ingest/algoshift
    secret: env(ALGO_O11Y_SECRET)
    include:
      - run_completed
      - run_failed
      - approval_decided
      - cost_threshold_crossed

  otlp:
    endpoint: otel-collector.acme.svc:4317
    headers:
      authorization: env(OTLP_AUTH)

  siem:
    vendor: datadog
    api_key: env(DD_API_KEY)
    tags: [algorithmshift, tenant=acme]

[06]

Approval + review events are first-class.

Every reviewer decision is logged as an event — who approved, what they saw, what they said. Regulated environments can trace any commit back to the human who signed off, across the entire lifecycle of the app.

Reviewer role + decision + free-text rationale
Artifact hash pinned at decision time
Exportable to compliance via audit bundle

approval events — customer-hub

Apr 18 · 09:11spec.approvedalice@acme · 'LGTM, ship'
Apr 18 · 10:32migration.approvedbob@acme · DBA
Apr 18 · 10:34release.approvedbob@acme · 'rollback: pr_notes.md'
Apr 18 · 11:02release.appliedprod · v2.8.0

Per-span

Token + cost + latency

every agent call

OTLP

Native trace export

Honeycomb · Grafana · Datadog

Retry lineage

All attempts kept

not just the last

Exportable

Audit bundle per release

SOC-ready

FAQ

Common questions

Can I set a hard spending cap?

Yes — per workspace, per app, or per workflow. When the cap is hit, new runs are blocked until the next budget window or a lead overrides it. Mid-run breaches fail the task cleanly with a budget_exceeded error.

Do you log prompts + completions?

By default we keep enough to debug (inputs, summarised outputs, tool calls). Full prompt + completion storage is opt-in per workspace — useful for audit but costs more to retain. Retention is configurable from 7 days to 18 months.

How fast does a run show up in the dashboard?

Spans stream live — you see steps transition from `started` to `applied` in real time. End-to-end latency from agent completion to dashboard visibility is sub-second on the same region.

Can I query runs programmatically?

Yes — both a GraphQL API for interactive queries and a bulk export endpoint for periodic pulls into your warehouse. SDKs for Python + TypeScript. Schema documented under /docs/observability.

See what your agents actually did — down to the token.

Debugging AI pipelines without spans is like debugging distributed systems without logs. We don't ship either way.

Start free Book a demo

Observabilityper run, per agent, per dollar.