MARS — Math Artifact Retrieval Scoring

How MARS Works

We wish to measure LLM retrieval performance on mathematical statements with a simple and reproducible protocol:

Sample & Extract: We randomly sample mathematics papers from arXiv across different time periods and extract multiple theorems from each one.
Generate Queries: For each theorem, we generate synthetic queries across distinct categories to simulate a range of real-world use cases. These queries span from precise_assertion and imperfect_recall, which test the model's ability to find a result from a nearly exact or slightly flawed memory, to broader inquiries like exploratory_search and conceptual_search, which focus on the underlying topic or idea.
Retrieve the Source: For a given synthetic query, we ask an LLM to identify the original paper it came from, without any other context.

We then check whether the model correctly identifies the source paper. This process provides a clear benchmark of LLMs' ability to navigate mathematical knowledge across different eras.

Our initial findings highlight the difficulty of this task for plain LLMs without internet access. In over 100 trials using GPT-5 (without web search), the model failed to identify the correct source paper within its top five suggestions a single time...

Preliminary Results

Original Theorem

Synthetic Queries

Candidate References

Prompts Used in the Experiments

Synthetic Query Generation Prompt

System: You are a research mathematician simulator. Your task is to generate plausible search queries that a real mathematician would use when exploring a concept or looking for a specific type of result. Be realistic. Output must be a JSON object with a single key "queries" containing a list of strings.

User: Imagine you are a research mathematician. The Artifact Text below represents a mathematical theorem you have in mind. You do not know that this has been published, and you have no knowledge of the paper it comes from. Your goal is to generate search queries to discover if a result like this exists in the literature.

Styles

precise_assertion: Formulate the central claim of the theorem as a precise statement or question so it would likely find this exact result.
- Frame as a search for a known result (e.g., Korovkin's theorem for sublinear operators).
- Include LaTeX where it adds precision, e.g., $\\mathbb(R)^(N)$ , $\\mathcal(F)(X)$ .
imperfect_recall: Simulate realistic, slightly flawed memory. Keep it technically precise but introduce minor inaccuracies.
- Omit a secondary condition or hypothesis.
- Change variable names (e.g., use K instead of X).
- Alter a technical term slightly (e.g., monotone positive operators vs. monotone and sublinear operators).
conceptual_search: Ask about the relationship/implication in less formal terms; focus on the meaning (e.g., when does pointwise convergence imply uniform convergence?).
exploratory_search: Use broader, higher-level queries about the general topic or field (e.g., nonlinear approximation theory, positive linear operators).

Common Rules

Never use names, dates, or direct quotes from the paper's title or abstract.
Never include reference labels like \\label{...}.
Avoid instructional phrasing like “find a paper on”.
Use LaTeX when it adds precision.

Output: JSON with a single key queries containing a list of strings. Generate exactly k queries.

Closed-Book Retrieval Prompt

System: You are an expert research assistant with deep knowledge of academic literature (arXiv and major mathematical journals). Your task is to recall and cite the most relevant research papers for a given query. Respond strictly in the JSON format provided.

User: Based on your knowledge, what are the top k most likely research papers a researcher is looking for the given query?

Rules

Provide a ranked list in JSON; the first entry is your top guess.
Each title must be the exact, full title of the publication.
Prioritize primary research articles; suggest surveys/books only for very broad queries.

Expected JSON

{
  "candidates": [
    {
      "reference": { "title": "The full and exact title of the #1 paper" },
      "confidence": 0.9,
      "reasoning": "A brief justification for why this is the best match."
    },
    {
      "reference": { "title": "The full and exact title of the #2 paper" },
      "confidence": 0.8,
      "reasoning": "A brief justification for why this is a plausible alternative."
    }
  ]
}