Prompt-template hacking is the practice of tuning the exact template in which a benchmark is presented to a model in order to maximize the reported score. Because language models are sensitive to surface form — a ten-word change in instruction can shift scores by 10–20 percentage points — and because benchmark authors typically do not specify an exact template, labs have wide latitude in how they run the evaluation they report. Over time this has produced a pervasive gap between "the score we can get with our best template" and "the score the model would get under a fair, author-specified template." Where the two numbers are allowed to diverge, benchmark leaderboards become a measurement of prompt-engineering skill as much as of model capability.
The most studied instances are multi-choice benchmarks. A model presented with a four-option question can be scored in at least three different ways: the option whose letter (A, B, C, D) is most probable after the question, the option whose text is most probable given the question, or the option the model produces when asked to "first reason and then answer." These three scoring methods can differ by ten or more points on the same benchmark for the same model. None of them is wrong; each is a legitimate reading of "how well does the model do on this benchmark." But labs that want a high number choose the reading that gives it.
Chain-of-thought prompting is a particularly potent form. A model asked "what is the answer" on a math benchmark scores much lower than the same model asked "let's think step by step, then give the answer." The difference can exceed twenty points. On GSM8K, on MATH, on parts of MMLU, on BIG-bench Hard, chain-of-thought dramatically changes the reported score. A lab that runs the benchmark without chain-of-thought and reports the lower number is comparing its model unfairly to labs that use it; a lab that runs every benchmark with extensive chain-of-thought and reports the higher number is making the same point in reverse.
Instruction phrasing is another axis. "Answer only with a single letter" versus "explain your reasoning and then answer" produces different distributions of correctness. "You are a helpful assistant" versus "You are an expert solver of multiple-choice questions" likewise. Few-shot prompting — providing the model with several worked examples before the target question — can add or subtract several points. Whether examples are drawn from the same benchmark or a different one matters; whether they are chosen randomly or curated for similarity to the test question matters more.
The response from evaluation-methodology researchers has been to publish fixed prompt templates with benchmarks (MMLU-Pro specifies its template), to run evaluations under multiple templates and report the distribution rather than a single number, and to make public the exact grading script used. These are improvements; none fully solves the problem. A determined lab can still run a benchmark a hundred times under a hundred minor variations, pick the best, and report it; the overhead is small compared to the commercial value of a leaderboard position.
The sensitivity of language models to prompt phrasing was documented during the GPT-3 era (Brown et al. 2020; Reynolds and McDonell 2021). The realization that this sensitivity meant benchmark scores could be inflated by prompt choice was widely discussed in 2022–2023 as chain-of-thought prompting spread. Sclar et al.'s Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design (2023) measured the effect across dozens of benchmarks and models, finding variances large enough to reorder leaderboards.
Surface form is a first-order variable. Small phrasing changes change scores materially; benchmark results without specified templates are not reproducible.
Chain-of-thought shifts distributions. It is not free: it increases latency and token cost, but reported numbers almost always include it because the uplift is large.
Leaderboards depend on template discipline. Without fixed templates and public grading code, leaderboard ordering partly reflects prompt-engineering effort.
The honest report includes a template. A responsible benchmark result says what exact prompt was used, with what sampling parameters, on what date.