Every system we build follows this shape. Client at the edge, tools in a sandbox, traces everywhere, evaluators gating output. No black boxes, no "it works on my machine."

EDGE

Client

web · mobile · API

Gateway

auth · rate-limit · PII redact

ORCHESTRATION

Orchestrator

planner · router · memory

Durable · exactly-once

AGENTS & TOOLS

Retrieval

hybrid · rerank · cite

Tool calls

sandboxed · timeboxed

Evaluator

gates · rubrics · LLM-judge

DATA & TRUST

Vector + BM25

tenant-isolated

Traces / logs

OTel · replayable

Signed output

auditable · rollback

04 · Run

Watch an agent do the job.

Three real production scenarios, replayed at observed latency. Every box is a span; every span has tokens, cost, and an eval gate. This is what shows up in your traces, not a marketing animation.

FINTECHPre-trade MiFID II review

HEALTHTECHClinical scribe · SOAP note

INSURANCEAuto claim · FNOL triage

POST/api/v1/agent/runTrade desk submits a €4.2M block trade. Agent must reject, approve, or flag in <3s.

trace · compliancespan_id 7c1f…

orchestrator.run0ms

PLANorchestrator.planorchestrator · 80ms · 184 tok · $0.0014

Plan rationale

Pretrade review requires: applicable-rules retrieval, market-data lookup, position-check, and a deterministic evaluator gate.

Subtasks

retrievaltool.market_datatool.position_checkevaluator

0ms/1.82s

LATENCY0msbudget 3.00s

TOKENS0in + out

COST$0.0000budget $0.025

EVAL GATE—deterministic + LLM-judge

APPROVED · 1 flag

Trade clears 11 of 12 rules. Rule 23.2 (book concentration) flagged for desk-head sign-off.

latency2.41scost$0.014tokens1,827evals12/12

07 · ENGAGEMENT

Three ways to work with us.

Discovery sprint

Fixed · 2 weeks

Architecture, working spike, build plan. You leave with a repo.

You get

Architecture diagram
Risk register
Working prototype
Build plan + estimate

For teams sizing up an agentic system.

Most common

Build pod

Fixed · 8–14 weeks

A pod of 2–4 senior engineers ships in your repo, your stack, against your evals.

You get

Production system
Eval gates
Trace dashboards
Rollback plan
Runbook

For teams ready to put agents in production.

Stewardship

Monthly · rolling

We stay close as the system runs. Tunings, eval expansions, scope additions.

You get

Oncall rotation overlap
Eval suite maintenance
Cost + perf reviews
Quarterly roadmap

For teams owning the system long-term.

05 · PROCESS

How we ship agents.

Three phases, fixed-price discovery, your repo as the artifact.

01 discover · ∼1 week

Discover

We listen, audit, and spike. You leave with an architecture and a working proof.

Architecture diagram
Risk register
Working prototype

Deliverable

Architecture · spike · estimate

Who

Principal eng + designer

02 build · ∼6–10 weeks

Build

We ship in your repo, with your stack, against your evals.

Eval gates
Trace dashboards
Rollback plan

Deliverable

Production system · runbook

Who

Pod of 2–4 engineers

03 deploy · ∼2 weeks

Deploy

Ramp, monitor, hand off. You own the code and the playbooks.

Canary rollout
Oncall handoff
Knowledge transfer

Deliverable

Live system · trained team

Who

Same pod + ops engineer

09 · STACK

Modern tools, composed cleanly.

Models

Claude, GPT-4, Llama 3

Runtime

Temporal, Inngest

Retrieval

pgvector, Qdrant, BM25

Eval

Promptfoo, Braintrust

Observability

OpenTelemetry, Langfuse

Auth

Clerk, Auth0

Vector index

Embeddings · hybrid

Codegen

OpenAPI, Smithy, Zod

Deploy

Vercel, AWS, GCP

Storage

Postgres, S3, R2

Orchestration

Workflows, sagas

Messaging

NATS, SQS, typed envelopes

08 · PROOF

Outcomes, not demos.

“

They shipped a system that survived our peak weeks. The evals caught two regressions that would have made it to production otherwise.

VP of Engineering

Fintech, Europe

review-time cut

0.0s

p95 latency

rules covered

08 · TRUST

Built for production.

Eval gates, not vibe checks

Deterministic + LLM-judge evals run pre-deploy
Tool-call coverage + schema validation enforced
Latency + cost budgets gated per release

Senior engineers, no handoffs

Same pod from spike to production
12+ years average; ML, distributed systems, oncall
You meet the engineers in week one, not week ten

You own the code

Your repo, your stack, your monitoring
No vendor lock-in, no proprietary runtime
Knowledge transfer in week of go-live

10 · FAQ

Frequently asked.

01What does a discovery sprint cost?

Fixed-price, scoped to two weeks. We can share a rate sheet on request — the goal is that you leave with something concrete (architecture + spike) regardless of whether you continue with us.

02How long does a typical build pod run?

Eight to fourteen weeks from kick-off to production rollout. Discovery is included if you do it with us; otherwise we work from your existing spec.

03Who owns the code?

You do. The repo lives in your org from day one. We commit there. We do not run anything proprietary that you cannot replicate.

04How do you decide when an agent is ready for production?

Every release runs through eval gates: deterministic checks (schema, latency, cost), tool-call coverage, and LLM-judge spot checks. The deploy stops if any of them regress.

05What about runaway LLM costs?

Budgets are per-call and per-day. We cap context with retrieval-truncation, route to cheaper models where the eval allows, and fall back to deterministic paths under cost pressure.

06What happens when the LLM is down or wrong?

Every path has a fallback. Deterministic logic catches the common cases; the agent path layers on top. Rollback is a one-line route swap behind a feature flag.

07How do you hand the system off?

Two-week shadowing during ramp, runbook + eval suite delivered in your repo, oncall overlap for the first incident. After that you own it.

08How do you measure RAG quality?

Faithfulness (does the answer follow from cited docs?), grounding (are citations real?), and recall (did we find the right docs?). Every release runs the suite.

09Can workflows be replayed for debugging?

Yes. Durable execution captures inputs at every step. Replay against a known-bad input reproduces the failure exactly, including non-deterministic LLM calls when seed-able.

10How do agents authenticate to each other?

Each agent has a typed identity + mTLS by default. We use scoped tokens with short lifetimes; calls between agents carry trace IDs so the auditor can reconstruct the flow.

11How do you handle schema changes without breaking clients?

Additive-only by default, with a `version` field on every envelope and a deprecation policy that's checked in CI. Breaking changes go through a parallel-version window.

START

Ship the first system.

Fixed-price discovery in 2 weeks. You leave with an architecture, a working spike, and a build plan.

Start a project See engagement models