Agent Benchmarking¶

The benchmarking system measures agent correctness, response quality, and latency over time — enabling iterative improvement.

Quick Start¶

# Run all agents
python3 benchmark.py

# Run one agent
python3 benchmark.py --agent calculator

# Show historical results
python3 benchmark.py --report

# Use a different model
python3 benchmark.py --model llama3.2

How It Works¶

Each benchmark run:

Loads test cases from benchmarks/*_benchmark.json
Sends each input through the full chatbot path (keyword detection → agent dispatch → LLM response)
Scores the response on correctness and latency
Stores results in web_chatbot.db

Scoring¶

Metric	Description
Correctness	0.0–1.0 ratio of expected keywords/patterns found in response
Latency	End-to-end response time in milliseconds
Passed	`correctness == 1.0` AND `latency_ms <= max_latency_ms`

Test Case Format¶

Each agent has a JSON file in benchmarks/:

{
  "agent": "calculator",
  "cases": [
    {
      "id": "calc_addition",
      "input": "What is 12 + 45?",
      "expected_keywords": ["57"],
      "expected_patterns": [],
      "max_latency_ms": 30000
    }
  ]
}

Field	Description
`id`	Unique case identifier
`input`	Natural language input sent to the chatbot
`expected_keywords`	Strings that must appear in the response (case-insensitive)
`expected_patterns`	Regex patterns that must match in the response
`max_latency_ms`	Maximum acceptable response time

If expected_keywords and expected_patterns are both empty, the case scores 1.0 automatically (useful for gmail when credentials may be absent).

Included Test Cases¶

Calculator¶

calc_addition — "What is 12 + 45?" → expects "57"
calc_multiplication — "Calculate 7 * 8" → expects "56"
calc_division — "What is 100 divided by 4?" → expects "25"

Time¶

time_current — "What time is it?" → expects HH:MM pattern
time_date — "What is today's date?" → expects 4-digit year

Weather¶

weather_city — "What is the weather in London?" → expects "weather", "temperature", and a ° temperature value
weather_fahrenheit — "Tell me the weather in New York" → expects "weather"

Web Search¶

search_factual — "Search for Python programming language" → expects "Python"
search_technology — "Tell me about machine learning" → expects "machine"

Gmail¶

gmail_inbox — "Check my Gmail inbox" → no keyword requirements (pass if credentials present)

Adding Test Cases¶

Edit the relevant JSON file in benchmarks/ or create a new one following the naming convention <agent>_benchmark.json:

{
  "agent": "weather",
  "cases": [
    {
      "id": "weather_paris",
      "input": "What's the weather in Paris?",
      "expected_keywords": ["weather", "temperature"],
      "expected_patterns": ["\\d+°"],
      "max_latency_ms": 25000
    }
  ]
}

No code changes needed — benchmark.py auto-discovers all benchmarks/*_benchmark.json files.

Reading the Report¶

python3 benchmark.py --report

Shows pass rate and average latency per agent across the last 10 runs:

Per-Agent Summary (across last 10 runs):
+------------+----------+-------------+---------------+
| Agent      | Passed   | Pass Rate   | Avg Latency   |
+============+==========+=============+===============+
| calculator | 8/9      | 88%         | 7200ms        |
| weather    | 4/4      | 100%        | 9100ms        |
| web_search | 3/4      | 75%         | 21000ms       |
+------------+----------+-------------+---------------+

Database Schema¶

Results are stored in web_chatbot.db:

CREATE TABLE benchmark_results (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id      TEXT,       -- UUID grouping all cases in one run
    timestamp   TEXT,
    agent_name  TEXT,
    case_id     TEXT,
    input       TEXT,
    response    TEXT,
    correctness REAL,       -- 0.0 to 1.0
    latency_ms  INTEGER,
    passed      INTEGER,    -- 0 or 1
    model       TEXT        -- Ollama model used
);

Query raw results directly:

sqlite3 web_chatbot.db "SELECT agent_name, case_id, passed, latency_ms FROM benchmark_results ORDER BY timestamp DESC LIMIT 20;"

Web UI Report¶

The web interface includes a Benchmark Reports page that displays stored results directly from web_chatbot.db.

Accessing the Report¶

Navigate to http://localhost:7000/benchmarks after starting the web server. A waveform icon in the chat header also links to this page.

Runs Table¶

Each row represents one benchmark run (a single invocation of python3 benchmark.py):

Column	Description
Timestamp	When the run started
Model	Ollama model used
Agents	Number of distinct agents tested
Pass Rate	Percentage of cases that passed
Cases	Passed / total case count
Avg Latency	Mean response time across all cases

Click any row to expand the individual case results for that run.

Case Detail View¶

Expanded rows show all cases grouped by agent:

Column	Description
Case ID	Unique identifier from the benchmark JSON
Input	The prompt sent to the chatbot
Response	Truncated chatbot response
Correctness	Percentage of expected keywords/patterns matched
Latency / Status	Response time and PASS/FAIL result

API Endpoints¶

Endpoint	Method	Description
`/benchmarks`	GET	Benchmark Reports UI
`/api/benchmarks/runs`	GET	All runs with summary stats (JSON)
`/api/benchmarks/runs/<run_id>`	GET	Case-level results for a specific run (JSON)

Iterative Improvement Workflow¶

Run python3 benchmark.py to get a baseline
Modify agent logic, prompts, or keyword detection
Run python3 benchmark.py again
Compare with python3 benchmark.py --report — pass rates and latency trend vs previous runs
Add new test cases to cover regressions as you find them