Agent Benchmarking¶
The benchmarking system measures agent correctness, response quality, and latency over time — enabling iterative improvement.
Quick Start¶
# Run all agents
python3 benchmark.py
# Run one agent
python3 benchmark.py --agent calculator
# Show historical results
python3 benchmark.py --report
# Use a different model
python3 benchmark.py --model llama3.2
How It Works¶
Each benchmark run:
- Loads test cases from
benchmarks/*_benchmark.json - Sends each input through the full chatbot path (keyword detection → agent dispatch → LLM response)
- Scores the response on correctness and latency
- Stores results in
web_chatbot.db
Scoring¶
| Metric | Description |
|---|---|
| Correctness | 0.0–1.0 ratio of expected keywords/patterns found in response |
| Latency | End-to-end response time in milliseconds |
| Passed | correctness == 1.0 AND latency_ms <= max_latency_ms |
Test Case Format¶
Each agent has a JSON file in benchmarks/:
{
"agent": "calculator",
"cases": [
{
"id": "calc_addition",
"input": "What is 12 + 45?",
"expected_keywords": ["57"],
"expected_patterns": [],
"max_latency_ms": 30000
}
]
}
| Field | Description |
|---|---|
id |
Unique case identifier |
input |
Natural language input sent to the chatbot |
expected_keywords |
Strings that must appear in the response (case-insensitive) |
expected_patterns |
Regex patterns that must match in the response |
max_latency_ms |
Maximum acceptable response time |
If expected_keywords and expected_patterns are both empty, the case scores 1.0 automatically (useful for gmail when credentials may be absent).
Included Test Cases¶
Calculator¶
calc_addition— "What is 12 + 45?" → expects "57"calc_multiplication— "Calculate 7 * 8" → expects "56"calc_division— "What is 100 divided by 4?" → expects "25"
Time¶
time_current— "What time is it?" → expectsHH:MMpatterntime_date— "What is today's date?" → expects 4-digit year
Weather¶
weather_city— "What is the weather in London?" → expects "weather", "temperature", and a°temperature valueweather_fahrenheit— "Tell me the weather in New York" → expects "weather"
Web Search¶
search_factual— "Search for Python programming language" → expects "Python"search_technology— "Tell me about machine learning" → expects "machine"
Gmail¶
gmail_inbox— "Check my Gmail inbox" → no keyword requirements (pass if credentials present)
Adding Test Cases¶
Edit the relevant JSON file in benchmarks/ or create a new one following the naming convention <agent>_benchmark.json:
{
"agent": "weather",
"cases": [
{
"id": "weather_paris",
"input": "What's the weather in Paris?",
"expected_keywords": ["weather", "temperature"],
"expected_patterns": ["\\d+°"],
"max_latency_ms": 25000
}
]
}
No code changes needed — benchmark.py auto-discovers all benchmarks/*_benchmark.json files.
Reading the Report¶
Shows pass rate and average latency per agent across the last 10 runs:
Per-Agent Summary (across last 10 runs):
+------------+----------+-------------+---------------+
| Agent | Passed | Pass Rate | Avg Latency |
+============+==========+=============+===============+
| calculator | 8/9 | 88% | 7200ms |
| weather | 4/4 | 100% | 9100ms |
| web_search | 3/4 | 75% | 21000ms |
+------------+----------+-------------+---------------+
Database Schema¶
Results are stored in web_chatbot.db:
CREATE TABLE benchmark_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT, -- UUID grouping all cases in one run
timestamp TEXT,
agent_name TEXT,
case_id TEXT,
input TEXT,
response TEXT,
correctness REAL, -- 0.0 to 1.0
latency_ms INTEGER,
passed INTEGER, -- 0 or 1
model TEXT -- Ollama model used
);
Query raw results directly:
sqlite3 web_chatbot.db "SELECT agent_name, case_id, passed, latency_ms FROM benchmark_results ORDER BY timestamp DESC LIMIT 20;"
Web UI Report¶
The web interface includes a Benchmark Reports page that displays stored results directly from web_chatbot.db.
Accessing the Report¶
Navigate to http://localhost:7000/benchmarks after starting the web server. A waveform icon in the chat header also links to this page.
Runs Table¶
Each row represents one benchmark run (a single invocation of python3 benchmark.py):
| Column | Description |
|---|---|
| Timestamp | When the run started |
| Model | Ollama model used |
| Agents | Number of distinct agents tested |
| Pass Rate | Percentage of cases that passed |
| Cases | Passed / total case count |
| Avg Latency | Mean response time across all cases |
Click any row to expand the individual case results for that run.
Case Detail View¶
Expanded rows show all cases grouped by agent:
| Column | Description |
|---|---|
| Case ID | Unique identifier from the benchmark JSON |
| Input | The prompt sent to the chatbot |
| Response | Truncated chatbot response |
| Correctness | Percentage of expected keywords/patterns matched |
| Latency / Status | Response time and PASS/FAIL result |
API Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/benchmarks |
GET | Benchmark Reports UI |
/api/benchmarks/runs |
GET | All runs with summary stats (JSON) |
/api/benchmarks/runs/<run_id> |
GET | Case-level results for a specific run (JSON) |
Iterative Improvement Workflow¶
- Run
python3 benchmark.pyto get a baseline - Modify agent logic, prompts, or keyword detection
- Run
python3 benchmark.pyagain - Compare with
python3 benchmark.py --report— pass rates and latency trend vs previous runs - Add new test cases to cover regressions as you find them