Paper Notes: DR Tulu
last updated 2026-05-10
DR Tulu is a paper from AI2 where they train an 8B-parameter LLM to do deep research ~as effectively as GPT-5. The core idea is to generate rubrics on the fly with a strong model (GPT-4.1). These rubrics are based on the returned content of the tools across rollouts.
Combine those rubric rewards with auxiliary rewards for citation correctness, format, and use of search, and perform a big RL run.
They collect data in two ways: 1. SFT by getting traces from GPT-5, with synthetic reasoning output 2. Collect a bunch of questions for a specific corpus for the RL run
Auxiliary Rewards
DR Tulu uses 3 auxiliary rewards on top of the evolving rubric: 1. format 2. search 3. citation
Citation Rewards
Extract a set of claims from the answer: \(\mathcal{C} = \{c_1,...,c_{|\mathcal{C}|}\} = ExtractClaims(y)\). Map these claims to the citation store. Measure recall and precision for the extracted claims, and use per-claim \(F_1\) as the reward.
Recall
How many returned claims are cited. For each claim, use an LLM-judge to score if the claim is supported by the mapped citations, either: {Fully, Partially, No}.
Precision
For each claim, use an LLM-judge to score if the mapped citations are relevant to the claim, either: {Relevant, Irrelevant}.
One result is really good citation precision and recall even with an 8B model: Cite-P:
Rubric Rewards
Rubric rewards are effectively unit tests that can be used to score a completion. Consider a sample rubric with two rubric items:
[
"Answer mentions cytokine signaling.",
"Citations returned by the model must be present in the input."
]
We want to use this rubric to score an answer (\(y\)) for to a given question (\(x\)).
\(S(x, y) = \frac{\sum_{k=1}^{K}{w_{x,k} \cdot Judge(r_{x,k},y)}}{\sum_{k:w_{x,k}>0}w_{x,k}}\)
Breaking down the above equation:
| Variable | Meaning |
|---|---|
| \(x\) | The deep-research question, e.g. How can genetically engineered T cells be used as an anti-inflammatory therapy for IBD? |
| \(y\) | The model’s completion |
| \(S(x, y)\) | \([0,1]\)-normalized score score given a question \(x\) and a completion \(y\) |
| \(K\) | The number of items in the rubric |
| \(r_{x, k}\) | The rubric item \(k\) for question \(x\) (note that you might have question-specific rubrics). An example would be Answer mentions cytokine signaling |
| \(w_{x,k}\) | The weight of the k-th rubric item. Can be positive or negative. Note that we normalize \(S\) by using only positive scores, though. |
| \(Judge(r, y)\) | A reward model that judges (generally binary) whether or not the completion satisfies the rubric item. |