A data-generation plan for the long-context extension stage — what to source, how to synthesize it, and two candidate generation paths to bake off.
We are extending a model pretrained at a 4k context window out to ~120k. This is a continued-pretraining / mid-training problem, not an instruction-tuning one. The good news from the literature is that context extension is cheap on data — on the order of 10–50B tokens is enough — provided the data actually contains long-range dependencies and the original short-context mixture is replayed to avoid regression Fu 2024 · OLMo 3.
Context extension means training on longer sequences while rescaling the positional embeddings so the model generalizes to positions it never saw in pretraining. Two principles decide whether it works:
A 7B model at a 120k sequence length cannot fit activations on one GPU. You need context / sequence parallelism (OLMo used 8-way CP, 8k tokens per device, with all-gather attention to support irregular masks) OLMo 3 §3.6.4. Confirm your training stack supports this before building data — it dictates the shard format. Note also: your 4k→120k jump (~30×) is more aggressive than OLMo's 8k→65k (8×), which is why we stage it.
OLMo 3 extended its base model from 8,192 → 65,536 tokens using Dolma 3 Longmino Mix — a 600B-token pool from which they trained on 50B tokens (7B model) / 100B (32B) OLMo 3 §3.6. The pieces we adopt:
| Ingredient | What OLMo did |
|---|---|
| Mix ratio | 34% long-context + 66% short-context replay (from their mid-training mix). They tested it: 66%-long drops short-task perf by 2.5 pts; 34%-long drops only 0.8 pts. |
| Long-doc backbone | OCR'd scientific PDFs, filtered by gzip compressibility — drop the most-compressible 20% and least-compressible 20%. |
| Synthetic augmentation | Inject aggregation tasks (CWE / REX) into real long docs. Beats natural documents alone on RULER. (Detailed in §5 below.) |
| RoPE extension | YaRN applied to full-attention layers only (sliding-window layers left untouched). Beat base-frequency scaling and position interpolation. |
| Packing | Best-fit document packing + intra-document masking (each sequence attends only within its own source document). |
| Token budget | More is better, especially at long lengths. 50B for the 7B. |
| Eval | RULER (development metric) + HELMET (held-out). |
The 120k length should be filled with real text, never by asking a model to write 120k tokens. Long generations degrade (repetition, drift), cost a fortune, and are worst in low-resource languages. The length budget is a sourcing + packing problem; generation only ever produces the short task layer on top.
How do you build long-context training data without already owning a long-context model? OLMo's answer OLMo 3 §3.6.2 · CLIPPER: use document statistics (not an LLM) to find salient terms, extract a few short snippets, and feed only the snippets to a plain short-context generator to write a task. The answer is computed by code, not generated.
Consequence: the method is length-agnostic on the generation side. A short-context model can produce a 120k-token training sample, because it never sees 120k — only the snippets. The long context is the real document you attach.
So "bootstrapping a 120k sample" is entirely possible (and cheap) — what's not possible at quality is generating 120k tokens of context text. Source the context; generate only the task.
These get conflated. They are different, and the difference decides whether the long rung teaches anything:
| Technique | What it is | Cross-doc attention? | Use when |
|---|---|---|---|
| Best-fit packing Ding 2024 | Fill the fixed sequence buffer with multiple docs to avoid padding | No — intra-doc masked | You already have genuinely long single docs (efficiency only) |
| Related-doc concatenation ICLM / Shi 2023 | Join topically related docs into one long pseudo-document | Yes — no mask within group | Single docs are too short (your low-resource case at 120k) |
OLMo relied on single long docs + masking. You won't have many single 120k-token docs in el / uk / ro, so at the 120k rung you'll concatenate semantically related docs (same EUR-Lex domain, topic cluster, or retrieved neighbors), run the bootstrap over the whole concatenation, and not mask within a related group — so the 120k-range dependency is real.
Don't jump 4k→120k in one step. Ladder it — 4k → 16k → 32k → 64k → 120k — with a YaRN rescale and an eval gate at each rung (mirroring OLMo's RULER curriculum). At each stage, train on a band of lengths rather than one exact length: roughly 70–75% of sequences near the current target, 25–30% shorter (the Qwen3 pattern, e.g. "75% at 16K–32K, 25% at 4K–16K") Qwen3. That avoids overfitting to a single length and yields a cheap checkpoint to evaluate at every rung. Two regimes drive the data:
| Regime | RoPE | Where the length comes from | Generation used for |
|---|---|---|---|
| Early rungs (→ 16k / 32k) | YaRN (full-attn layers) | Sourced single docs (EUR-Lex / Europarl) | CWE/REX task layer; short reasoning traces ride in replay |
| Long rungs (→ 64k / 120k) | YaRN, rescaled further | Sourced + related-doc concatenation to reach length | CWE/REX over the concatenation; long reasoning traces in the long slice |
Training on longer documents extends the context window (input comprehension). It does not, by itself, create reasoning ability — that comes from reasoning-trace content plus later SFT/RL. The two co-occur (traces are long) but they're different levers. The 4k→32k rung is itself already a context extension (8×), so it needs YaRN too — it isn't "just training on longer docs."
Both paths share the same sourced long documents, the same curriculum, packing, and the 34/66 mix — so sourcing is never duplicated. They differ only in how the task layer is produced.
The pilot's synthgen library is internal and not yet public. Either path can be driven by any batch inference engine that fans requests across vLLM endpoints on Leonardo — e.g. Inference Hive for SLURM-scheduled batch jobs. The meta-prompt / topic-persona logic is a thin layer on top; the engine just needs checkpoint/resume and an OpenAI-compatible client.
Reuse the synthgen meta-prompt machinery (topic distribution + personas + constraints), but feed a real sourced long document into a multilingual model with a genuine 120k input window and have it author diverse IF tasks and answers grounded in the whole context.
Caution: the "fully synthetic" variant (ask the model to write 120k of text) hits the output-length wall (~8–32k reliable). Keep that to a small long-form-writing slice only.
Port OLMo's longmino_synthetic_cwe_rex pipeline to multilingual. A short-context model sees only snippets; the answer is computed.
| Dimension | Path 1 (prompt-driven) | Path 2 (bootstrap) |
|---|---|---|
| 120k length from | real sourced docs | real sourced / concatenated docs |
| Generator requirement | multilingual model strong at 120k input | any short-context multilingual model |
| What's generated | task + answer | task only (answer computed) |
| Cost | high (long prefill) | low |
| Task diversity | high / natural | CWE + 12 vignettes |
| Verifiable signal | no | yes (CWE) |
| Scales to 120k | yes, expensively | yes, trivially |
| Source | Why, for the EU target langs |
|---|---|
| EUR-Lex / MultiEURLEX | Top pick. Legislative texts — individually very long, parallel across all 24 EU official languages (covers every target). Your "science-PDF equivalent." |
| HPLT v2 · CulturaX · FineWeb-2 | Document-level web text; filter for the longest docs per language. Bulk volume. |
| Europarl | Parliamentary proceedings — long, parallel. |
| Wikipedia (per-language) | Best raw material for related-doc concatenation at the 120k rung. |
| The Stack v2 (repo-level) | Language-agnostic long-range structure; improves long-context generally. A few % of the mix. |
Apply OLMo's gzip filter (drop most/least-compressible 20%) to whatever you pick — it's language-agnostic and ports directly.
OLMo deliberately put reasoning/thinking traces into base mid-training as plain CoT content — not delimited <think> post-training behavior. The "full mix" with thinking + instruction data beat the mix without it on their base eval (avg 50.7 vs 48.8; Math 48.7 vs 43.1) OLMo 3 Table 10. Those traces enter the long-context stage automatically through the 66% replay.
<think>-token behavior for the later post-training SFT/RL stage.There's no standard multilingual RULER, so we build one — this is the natural job for synthgen:
Path 2 is the scalable backbone — cheap, verifiable, scales to 120k without quality risk. Path 1 is a diversity top-up — only if you can find a multilingual model you trust at 120k input. Bake them off first: generate a small batch from each, train a 32k-rung probe, and let multilingual-RULER set the ratio — exactly the Phase-1 generator A/B methodology you already used.
The 10–50B-token extension budget is a total — and it has to cover every target language plus replay. So the per-language long-context allocation is modest, and the lowest-resource languages (el, uk, ro) need explicit token floors so they aren't drowned by es/fr/de. Path 2 (verifiable, cheap) is what lets you hit those floors economically; Path 1 is rationed where it adds the most diversity.
Almost the entire long-context literature above is English-centric — the "code + books + scientific papers" long-document recipe is tuned for English. LongAlign is the one paper that explicitly ablates multilingual long-context SFT, and finds it helps the target languages without hurting English Bai 2024.
If your EU-language experiments push to 120K, you may be among the first to surface whether that recipe transfers to lower-resource EU languages — where genuinely long natural documents in-language are far scarcer than in English. That scarcity makes synthetic concatenation (Quest / ICLM-style: group same-language documents by topic into coherent long sequences) not just a convenience but more necessary than in the English-only papers. It's a real gap, and a publishable contribution if the transfer story is measured cleanly per language.
Items marked in OLMo 3 are cited within the OLMo 3 technical report §3.6; links go to the canonical paper or a search for it.
olmo_3.pdf). allenai.org/olmodolma3/datasets/dolma3_longmino_mix/synthetic_cwe_rex/. github.com/allenai/dolma3