Paper Notes: DR Tulu

last updated 2026-05-10

DR Tulu is a paper from AI2 where they train an 8B-parameter LLM to do deep research ~as effectively as GPT-5. The core idea is to generate rubrics on the fly with a strong model (GPT-4.1). These rubrics are based on the returned content of the tools across rollouts.

Combine those rubric rewards with auxiliary rewards for citation correctness, format, and use of search, and perform a big RL run.

They collect data in two ways: 1. SFT by getting traces from GPT-5, with synthetic reasoning output 2. Collect a bunch of questions for a specific corpus for the RL run

Auxiliary Rewards

DR Tulu uses 3 auxiliary rewards on top of the evolving rubric: 1. format 2. search 3. citation

Citation Rewards

Extract a set of claims from the answer: \(\mathcal{C} = \{c_1,...,c_{|\mathcal{C}|}\} = ExtractClaims(y)\). Map these claims to the citation store. Measure recall and precision for the extracted claims, and use per-claim \(F_1\) as the reward.

Recall How many returned claims are cited. For each claim, use an LLM-judge to score if the claim is supported by the mapped citations, either: {Fully, Partially, No}.

Precision For each claim, use an LLM-judge to score if the mapped citations are relevant to the claim, either: {Relevant, Irrelevant}.

One result is really good citation precision and recall even with an 8B model: Cite-P:

Rubric Rewards

Rubric rewards are effectively unit tests that can be used to score a completion. Consider a sample rubric with two rubric items:

[
    "Answer mentions cytokine signaling.",
    "Citations returned by the model must be present in the input."
]

We want to use this rubric to score an answer (\(y\)) for to a given question (\(x\)).

\(S(x, y) = \frac{\sum_{k=1}^{K}{w_{x,k} \cdot Judge(r_{x,k},y)}}{\sum_{k:w_{x,k}>0}w_{x,k}}\)

Breaking down the above equation:

Variable	Meaning
\(x\)	The deep-research question, e.g. `How can genetically engineered T cells be used as an anti-inflammatory therapy for IBD?`
\(y\)	The model’s completion
\(S(x, y)\)	\([0,1]\)-normalized score score given a question \(x\) and a completion \(y\)
\(K\)	The number of items in the rubric
\(r_{x, k}\)	The rubric item \(k\) for question \(x\) (note that you might have question-specific rubrics). An example would be `Answer mentions cytokine signaling`
\(w_{x,k}\)	The weight of the k-th rubric item. Can be positive or negative. Note that we normalize \(S\) by using only positive scores, though.
\(Judge(r, y)\)	A reward model that judges (generally binary) whether or not the completion satisfies the rubric item.