Skip to content
All projects
AI & MLActive2026 — Present

EvalForge

A production-grade, reproducible benchmarking and evaluation platform for LLMs — statistically honest results, every run stored.

7

model providers, one interface

100+

offline-reproducible tests

1 file

to add a provider, benchmark, or judge

EvalForge is what you reach for when “the model felt better” isn't good enough. It runs standardized and custom benchmarks across many providers, grades outputs with pluggable judges, persists every run for reproducibility, and reports results with real confidence intervals and significance tests.

01

Design rule: one new file

The whole system is built around one rule: adding a new provider, benchmark, judge, or dataset format takes exactly one new file. Each benchmark self-registers with sample data; each provider implements a single fetch-based interface; persistence sits behind a single RunStore port with in-memory, filesystem, and Postgres implementations.

It runs with zero setup — no API keys, no database, no Docker required. A deterministic mock provider drives the entire pipeline offline, which is how the 100+ tests stay fast and reproducible.

02

Statistically honest by default

Every headline number ships with a confidence interval. Model comparisons use paired significance tests rather than raw score deltas, and leaderboards are computed with Elo and Bradley–Terry models rather than simple averages. Runs are reproducible byte-for-byte from stored configuration — in the spirit of OpenAI Evals, SWE-bench, and LiveBench.

The premise: “best model” means nothing without a workload attached. EvalForge exists to find the most efficient model per task, not a single winner.

03

Highlights

  • 7 model providers behind one fetch-based interface — OpenAI, Anthropic, Google Gemini, OpenRouter, Groq, Ollama, plus a deterministic mock for offline runs; no vendor SDKs
  • Benchmark suites across coding, math, reasoning, knowledge, tool use, and long context — HumanEval, GSM8K, MMLU, ARC, TruthfulQA, function calling, and needle-in-a-haystack retrieval (10k–500k tokens)
  • Pluggable judges: exact / regex / numeric / JSON-schema graders, embedding similarity, LLM-as-judge, and sandboxed code execution
  • A real engine: parallel execution, retries with backoff, timeouts, content-addressed caching, and resume-after-interrupt
  • Statistical rigor: bootstrap and Wilson confidence intervals, paired permutation / McNemar / Welch tests, effect sizes, and Elo / Bradley–Terry leaderboards
  • REST API (Fastify) and a Next.js dashboard with radar, score-history, latency, cost, and heatmap views; reports in Markdown, HTML, JSON, and CSV