SERVICE · 01

Agents & Skills

Multi-step tools with evals, traces, and guardrails — operating in your stack, not a sandbox.

WHAT YOU GET

Typed contracts

Every tool, every input, every output has a Zod/JSON schema. The runtime enforces it; the LLM cannot drift past it.

Eval gates

Every release runs through deterministic + LLM-judge evals before promote. Regressions block the deploy.

Traces + rollback plan

OpenTelemetry spans on every span; rollback is a one-line route swap behind a feature flag.

02 · Reference architecture

What we actually ship.

Every system we build follows this shape. Client at the edge, tools in a sandbox, traces everywhere, evaluators gating output. No black boxes, no "it works on my machine."

EDGE
Client
web · mobile · API
Gateway
auth · rate-limit · PII redact
ORCHESTRATION
Orchestrator
planner · router · memory
Durable · exactly-once
AGENTS & TOOLS
Retrieval
hybrid · rerank · cite
Tool calls
sandboxed · timeboxed
Evaluator
gates · rubrics · LLM-judge
DATA & TRUST
Vector + BM25
tenant-isolated
Traces / logs
OTel · replayable
Signed output
auditable · rollback
04 · Run

Watch an agent do the job.

Three real production scenarios, replayed at observed latency. Every box is a span; every span has tokens, cost, and an eval gate. This is what shows up in your traces, not a marketing animation.

POST/api/v1/agent/runTrade desk submits a €4.2M block trade. Agent must reject, approve, or flag in <3s.
trace · compliancespan_id 7c1f…
orchestrator.run0ms
PLANorchestrator.planorchestrator · 80ms · 184 tok · $0.0014
Plan rationale

Pretrade review requires: applicable-rules retrieval, market-data lookup, position-check, and a deterministic evaluator gate.

Subtasks
retrievaltool.market_datatool.position_checkevaluator
0ms/1.82s
LATENCY0msbudget 3.00s
TOKENS0in + out
COST$0.0000budget $0.025
EVAL GATEdeterministic + LLM-judge
APPROVED · 1 flag
Trade clears 11 of 12 rules. Rule 23.2 (book concentration) flagged for desk-head sign-off.
latency2.41scost$0.014tokens1,827evals12/12
09 · STACK

Modern tools, composed cleanly.

Models
Claude, GPT-4, Llama 3
Runtime
Temporal, Inngest
Eval
Promptfoo, Braintrust
Observability
OpenTelemetry, Langfuse
Auth
Clerk, Auth0
Deploy
Vercel, AWS, GCP
10 · FAQ

FAQ · AI Agents & Skills

Every release runs through eval gates: deterministic checks (schema, latency, cost), tool-call coverage, and LLM-judge spot checks. The deploy stops if any of them regress.

Budgets are per-call and per-day. We cap context with retrieval-truncation, route to cheaper models where the eval allows, and fall back to deterministic paths under cost pressure.

Every path has a fallback. Deterministic logic catches the common cases; the agent path layers on top. Rollback is a one-line route swap behind a feature flag.

Two-week shadowing during ramp, runbook + eval suite delivered in your repo, oncall overlap for the first incident. After that you own it.

START

Ship the first system.

Fixed-price discovery in 2 weeks. You leave with an architecture, a working spike, and a build plan.