The goal of this repository is to work collaboratively on MathComp documentation using the LLM4Docq pipeline. More precisely, LLM4Docq relies on leveraging external LLMs, and to maximize performance, it requires careful prompting.
To avoid overflowing the LLM’s context and to improve results, we chunk the source code into contiguous pieces and ask the LLM to annotate them, given a context header that helps it understand each piece of code. Here, a context header is not part of the MathComp source code, it is an external description used in the prompting process to give the LLM the necessary background for understanding the file.
In short, the LLM takes as input some general instructions (see the prompt_template), a context header, and (in the next version thanks to this repository) some carefully handcrafted examples of well-chosen docstrings.
The main task is to audit these context headers, which were first generated synthetically using state-of-the-art reasoning models, and to add precise and context-aware examples to the prompt. More simply, we consider three tasks for each source file:
- Audit an header
- Select 5 to 10 elements for annotation
- Manually add docstrings to previously selected elements
We are working with this fork of MathComp.
To balance the work, please indicate on the following link what you are working on, be careful not to duplicate someone else's work
Headers to be amended are located in the headers folder of the repository.
Contributors should edit these files directly to correct, clarify, or enrich the generated context headers.
The goal is to make them more accurate, coherent, and informative for the LLM.
The guiding question is:
"Given this header, could I understand almost any chunk (~100 lines) of the associated source file without more information?"
For each source file, identify 5 to 10 relevant elements (functions, lemmas, definitions, etc.) that are representative of the challenges presented in the current file. They should be as diverse as possible to cover different kinds of constructs and difficulties.
Selections and annotations are stored in the examples/ folder (placeholder templates are already provided there), following the JSON format described below.
Each example should follow this structure:
[
{
"name": "element_name",
"line": 42,
"docstring": "A short, clear explanation of what this element does."
}
]Try to follow these guidelines:
- Use natural language only. If possible, do not use any mathematical symbols (like <, ≤, =, etc.).
- Write complete sentences. Each docstring must be at least one clear, grammatically correct sentence.
- Be explicit. Spell out operations and relationships in plain English (e.g., "less than" instead of "<").
- Be self-contained. The reader should understand the meaning of the declaration without seeing the code.
- Be embedding-friendly. Do not use pronouns or external references like "this lemma" or "the following." Explicitly mention inputs and outputs.
- Use high-level terms like natural number, matrix, or polynomial.
- Be consistent. If the element is a lemma, start the docstring with "This lemma states ..."; apply the same rule to all other kinds of elements.
- Avoid referring to variable names when possible. For example, instead of writing "Defines the minimal polynomial of a matrix A as the monic polynomial of least degree that evaluates to the zero matrix when A is substituted for the variable,..." prefer "Defines the minimal polynomial of a matrix as the monic polynomial of least degree that evaluates to the zero matrix when substituted for the variable."