How MARS Works
We wish to measure LLM retrieval performance on mathematical statements with a simple and reproducible protocol:
- Sample & Extract: We randomly sample mathematics papers from arXiv across different time periods and extract multiple theorems from each one.
-
Generate Queries: For each theorem, we generate synthetic queries across distinct
categories to simulate a range of real-world use cases. These queries span from
precise_assertionandimperfect_recall, which test the model's ability to find a result from a nearly exact or slightly flawed memory, to broader inquiries likeexploratory_searchandconceptual_search, which focus on the underlying topic or idea. - Retrieve the Source: For a given synthetic query, we ask an LLM to identify the original paper it came from, without any other context.
We then check whether the model correctly identifies the source paper. This process provides a clear benchmark of LLMs' ability to navigate mathematical knowledge across different eras.
Our initial findings highlight the difficulty of this task for plain LLMs without internet access. In over 100 trials using GPT-5 (without web search), the model failed to identify the correct source paper within its top five suggestions a single time...
Preliminary Results
Original Theorem
Synthetic Queries
Candidate References
Prompts Used in the Experiments
Synthetic Query Generation Prompt
System: You are a research mathematician simulator. Your task is to generate
plausible search queries that a real mathematician would use when exploring a concept or
looking for a specific type of result. Be realistic. Output must be a JSON object with a single
key "queries" containing a list of strings.
User: Imagine you are a research mathematician. The Artifact Text below represents a mathematical theorem you have in mind. You do not know that this has been published, and you have no knowledge of the paper it comes from. Your goal is to generate search queries to discover if a result like this exists in the literature.
Styles
- precise_assertion: Formulate the central claim of the theorem as a precise
statement or question so it would likely find this exact result.
- Frame as a search for a known result (e.g., Korovkin's theorem for sublinear operators).
- Include LaTeX where it adds precision, e.g.,
$\\mathbb(R)^(N)$,$\\mathcal(F)(X)$.
- imperfect_recall: Simulate realistic, slightly flawed memory. Keep it
technically precise but introduce minor inaccuracies.
- Omit a secondary condition or hypothesis.
- Change variable names (e.g., use
Kinstead ofX). - Alter a technical term slightly (e.g., monotone positive operators vs. monotone and sublinear operators).
- conceptual_search: Ask about the relationship/implication in less formal terms; focus on the meaning (e.g., when does pointwise convergence imply uniform convergence?).
- exploratory_search: Use broader, higher-level queries about the general topic or field (e.g., nonlinear approximation theory, positive linear operators).
Common Rules
- Never use names, dates, or direct quotes from the paper's title or abstract.
- Never include reference labels like
\\label{...}. - Avoid instructional phrasing like “find a paper on”.
- Use LaTeX when it adds precision.
Output: JSON with a single key queries containing a list of
strings.
Generate exactly k queries.
Closed-Book Retrieval Prompt
System: You are an expert research assistant with deep knowledge of academic literature (arXiv and major mathematical journals). Your task is to recall and cite the most relevant research papers for a given query. Respond strictly in the JSON format provided.
User: Based on your knowledge, what are the top k most likely
research papers a researcher is looking for the given query?
Rules
- Provide a ranked list in JSON; the first entry is your top guess.
- Each title must be the exact, full title of the publication.
- Prioritize primary research articles; suggest surveys/books only for very broad queries.
Expected JSON
{
"candidates": [
{
"reference": { "title": "The full and exact title of the #1 paper" },
"confidence": 0.9,
"reasoning": "A brief justification for why this is the best match."
},
{
"reference": { "title": "The full and exact title of the #2 paper" },
"confidence": 0.8,
"reasoning": "A brief justification for why this is a plausible alternative."
}
]
}