Skip to content

Benchmarking

Kortex ships a benchmark harness that measures the cost and routing-quality advantage of Kortex's heuristic router over static model assignment strategies. Run it before deploying to production to quantify the routing benefit for your specific workload.


Quick Start

import asyncio
from kortex.benchmark.harness import BenchmarkHarness
from kortex.core.router import ProviderModel

models = [
    ProviderModel(
        provider="openai", model="gpt-4o-mini",
        cost_per_1k_input_tokens=0.00015, cost_per_1k_output_tokens=0.0006,
        avg_latency_ms=250, capabilities=["reasoning", "analysis"], tier="fast",
    ),
    ProviderModel(
        provider="anthropic", model="claude-sonnet-4-20250514",
        cost_per_1k_input_tokens=0.003, cost_per_1k_output_tokens=0.015,
        avg_latency_ms=800,
        capabilities=["reasoning", "analysis", "code_generation", "content_generation"],
        tier="balanced",
    ),
    ProviderModel(
        provider="anthropic", model="claude-opus-4-20250514",
        cost_per_1k_input_tokens=0.015, cost_per_1k_output_tokens=0.075,
        avg_latency_ms=2000,
        capabilities=["reasoning", "analysis", "code_generation", "content_generation"],
        tier="powerful",
    ),
]

async def main():
    harness = BenchmarkHarness(models)
    report = await harness.full_benchmark()
    print(report.to_markdown())
    print(report.summary)

asyncio.run(main())

No API keys or network access required — the harness works entirely from cost and latency metadata.


Task Datasets

Three pre-built datasets cover different real-world workload profiles:

Dataset Tasks Profile
mixed 100 40% simple / 35% moderate / 25% complex
cost_sensitive 100 All have cost ceiling constraints
latency_sensitive 100 All have tight latency SLA requirements
from kortex.benchmark.harness import TaskDataset

mixed     = TaskDataset.mixed_workload(n=100)
cost      = TaskDataset.cost_sensitive(n=100)
latency   = TaskDataset.latency_sensitive(n=100)

Baseline Strategies

The harness compares Kortex routing against three static baselines:

Strategy Behaviour
cheapest Always routes to the cheapest registered model
strongest Always routes to the most powerful registered model
random Random model selection (lower bound)

CLI Commands

Run the full benchmark suite

kortex benchmark run
kortex benchmark run --dataset cost_sensitive
kortex benchmark run --dataset latency_sensitive
kortex benchmark run --output results.json

Compare routing vs. a baseline under a specific policy

kortex benchmark compare --policy examples/policies/cost_optimized.toml --baseline cheapest
kortex benchmark compare --policy examples/policies/quality_first.toml --baseline strongest

Report Format

BenchmarkReport.to_markdown() produces a table like:

| Dataset          | Kortex Cost | Baseline Cost | Savings | Kortex P95 | Baseline P95 |
|------------------|-------------|---------------|---------|------------|--------------|
| mixed            | $0.0412     | $0.1850       | 77.7%   | 800ms      | 2000ms       |
| cost_sensitive   | $0.0087     | $0.0450       | 80.7%   | 250ms      | 800ms        |
| latency_sensitive| $0.0031     | $0.1850       | 98.3%   | 250ms      | 2000ms       |

BenchmarkReport.summary is a single human-readable sentence suitable for logging.


Programmatic API

from kortex.benchmark.harness import BenchmarkHarness, BaselineStrategy

harness = BenchmarkHarness(models)

# Run individual phases
kortex_run  = await harness.run_kortex(dataset)
baseline_run = await harness.run_baseline(dataset, BaselineStrategy.CHEAPEST)

# Compare
comparison = harness.compare(kortex_run, baseline_run)
print(f"Cost savings: {comparison.cost_savings_pct:.1f}%")
print(f"Routing failures: {comparison.kortex_routing_failures}")

Running the Example

python examples/benchmark_example.py

This registers 6 models across 3 tiers, runs the full suite, and prints a markdown report.