AI Agents & Skills — Tzar.Tech

Typed contracts

Every tool, every input, every output has a Zod/JSON schema. The runtime enforces it; the LLM cannot drift past it.

Eval gates

Every release runs through deterministic + LLM-judge evals before promote. Regressions block the deploy.

Traces + rollback plan

OpenTelemetry spans on every span; rollback is a one-line route swap behind a feature flag.

What we actually ship.

Every system we build follows this shape. Client at the edge, tools in a sandbox, traces everywhere, evaluators gating output. No black boxes, no "it works on my machine."

EDGE

Client

web · mobile · API

Gateway

auth · rate-limit · PII redact

ORCHESTRATION

Orchestrator

planner · router · memory

Durable · exactly-once

AGENTS & TOOLS

Retrieval

hybrid · rerank · cite

Tool calls

sandboxed · timeboxed

Evaluator

gates · rubrics · LLM-judge

DATA & TRUST

Vector + BM25

tenant-isolated

Traces / logs

OTel · replayable

Signed output

auditable · rollback

Watch an agent do the job.

Three real production scenarios, replayed at observed latency. Every box is a span; every span has tokens, cost, and an eval gate. This is what shows up in your traces, not a marketing animation.

POST/api/v1/agent/runTrade desk submits a €4.2M block trade. Agent must reject, approve, or flag in <3s.

trace · compliancespan_id 7c1f…

orchestrator.run0ms

PLANorchestrator.planorchestrator · 80ms · 184 tok · $0.0014

Plan rationale

Pretrade review requires: applicable-rules retrieval, market-data lookup, position-check, and a deterministic evaluator gate.

Subtasks

retrievaltool.market_datatool.position_checkevaluator

0ms/1.82s

LATENCY0msbudget 3.00s

TOKENS0in + out

COST$0.0000budget $0.025

EVAL GATE—deterministic + LLM-judge

APPROVED · 1 flag

Trade clears 11 of 12 rules. Rule 23.2 (book concentration) flagged for desk-head sign-off.

latency2.41scost$0.014tokens1,827evals12/12

FAQ · AI Agents & Skills

01How do you decide when an agent is ready for production?

Every release runs through eval gates: deterministic checks (schema, latency, cost), tool-call coverage, and LLM-judge spot checks. The deploy stops if any of them regress.

02What about runaway LLM costs?

Budgets are per-call and per-day. We cap context with retrieval-truncation, route to cheaper models where the eval allows, and fall back to deterministic paths under cost pressure.

03What happens when the LLM is down or wrong?

Every path has a fallback. Deterministic logic catches the common cases; the agent path layers on top. Rollback is a one-line route swap behind a feature flag.

04How do you hand the system off?

Two-week shadowing during ramp, runbook + eval suite delivered in your repo, oncall overlap for the first incident. After that you own it.

Agents & Skills