OELLM · Multilingual Long-Context Mid-Training

Extending a 4k model to 120k context for low-resource European languages

A data-generation plan for the long-context extension stage — what to source, how to synthesize it, and two candidate generation paths to bake off.

Target: in-house multilingual base model pretrained at 4k · Goal window: ~120k tokens · Languages: es, fr, de, it, pt, pl, nl, cs, ro, el, uk (target set may grow) · Infra: local vLLM on Leonardo · Tooling: synthgen (internal, not yet public) — or any batch inference engine, e.g. Inference Hive

We are extending a model pretrained at a 4k context window out to ~120k. This is a continued-pretraining / mid-training problem, not an instruction-tuning one. The good news from the literature is that context extension is cheap on data — on the order of 10–50B tokens is enough — provided the data actually contains long-range dependencies and the original short-context mixture is replayed to avoid regression Fu 2024 · OLMo 3.

1 · The goal & the one hard constraint

Context extension means training on longer sequences while rescaling the positional embeddings so the model generalizes to positions it never saw in pretraining. Two principles decide whether it works:

The data must contain real long-range dependencies. Packing random short documents up to 120k teaches nothing — attention never needs to reach past 4k. This is the entire challenge for low-resource languages, where genuinely long single documents are scarce.
Replay the original mixture. Mix long-context data with high-quality short-context data from the previous stage so short-context skills don't regress.

⚠ The real blocker is training-side, not data-side

A 7B model at a 120k sequence length cannot fit activations on one GPU. You need context / sequence parallelism (OLMo used 8-way CP, 8k tokens per device, with all-gather attention to support irregular masks) OLMo 3 §3.6.4. Confirm your training stack supports this before building data — it dictates the shard format. Note also: your 4k→120k jump (~30×) is more aggressive than OLMo's 8k→65k (8×), which is why we stage it.

2 · How OLMo 3 did it (§3.6, our reference recipe)

OLMo 3 extended its base model from 8,192 → 65,536 tokens using Dolma 3 Longmino Mix — a 600B-token pool from which they trained on 50B tokens (7B model) / 100B (32B) OLMo 3 §3.6. The pieces we adopt:

Ingredient	What OLMo did
Mix ratio	34% long-context + 66% short-context replay (from their mid-training mix). They tested it: 66%-long drops short-task perf by 2.5 pts; 34%-long drops only 0.8 pts.
Long-doc backbone	OCR'd scientific PDFs, filtered by gzip compressibility — drop the most-compressible 20% and least-compressible 20%.
Synthetic augmentation	Inject aggregation tasks (CWE / REX) into real long docs. Beats natural documents alone on RULER. (Detailed in §5 below.)
RoPE extension	YaRN applied to full-attention layers only (sliding-window layers left untouched). Beat base-frequency scaling and position interpolation.
Packing	Best-fit document packing + intra-document masking (each sequence attends only within its own source document).
Token budget	More is better, especially at long lengths. 50B for the 7B.
Eval	RULER (development metric) + HELMET (held-out).

3 · Three concepts that drive every choice

3.1 — Length comes from sourcing, not generation

The 120k length should be filled with real text, never by asking a model to write 120k tokens. Long generations degrade (repetition, drift), cost a fortune, and are worst in low-resource languages. The length budget is a sourcing + packing problem; generation only ever produces the short task layer on top.

★ The bootstrap insight — this is the whole trick

How do you build long-context training data without already owning a long-context model? OLMo's answer OLMo 3 §3.6.2 · CLIPPER: use document statistics (not an LLM) to find salient terms, extract a few short snippets, and feed only the snippets to a plain short-context generator to write a task. The answer is computed by code, not generated.

Consequence: the method is length-agnostic on the generation side. A short-context model can produce a 120k-token training sample, because it never sees 120k — only the snippets. The long context is the real document you attach.

3.2 — A "120k sample" is mostly real document + a tiny generated task

[ real 120k-token context (sourced / packed) ] + [ short generated question ] + [ computed answer ]

So "bootstrapping a 120k sample" is entirely possible (and cheap) — what's not possible at quality is generating 120k tokens of context text. Source the context; generate only the task.

3.3 — "Packing" ≠ "concatenating related docs"

These get conflated. They are different, and the difference decides whether the long rung teaches anything:

Technique	What it is	Cross-doc attention?	Use when
Best-fit packing Ding 2024	Fill the fixed sequence buffer with multiple docs to avoid padding	No — intra-doc masked	You already have genuinely long single docs (efficiency only)
Related-doc concatenation ICLM / Shi 2023	Join topically related docs into one long pseudo-document	Yes — no mask within group	Single docs are too short (your low-resource case at 120k)

OLMo relied on single long docs + masking. You won't have many single 120k-token docs in el / uk / ro, so at the 120k rung you'll concatenate semantically related docs (same EUR-Lex domain, topic cluster, or retrieved neighbors), run the bootstrap over the whole concatenation, and not mask within a related group — so the 120k-range dependency is real.

4 · The staged curriculum

Don't jump 4k→120k in one step. Ladder it — 4k → 16k → 32k → 64k → 120k — with a YaRN rescale and an eval gate at each rung (mirroring OLMo's RULER curriculum). At each stage, train on a band of lengths rather than one exact length: roughly 70–75% of sequences near the current target, 25–30% shorter (the Qwen3 pattern, e.g. "75% at 16K–32K, 25% at 4K–16K") Qwen3. That avoids overfitting to a single length and yields a cheap checkpoint to evaluate at every rung. Two regimes drive the data:

Regime	RoPE	Where the length comes from	Generation used for
Early rungs (→ 16k / 32k)	YaRN (full-attn layers)	Sourced single docs (EUR-Lex / Europarl)	CWE/REX task layer; short reasoning traces ride in replay
Long rungs (→ 64k / 120k)	YaRN, rescaled further	Sourced + related-doc concatenation to reach length	CWE/REX over the concatenation; long reasoning traces in the long slice

A clarification worth keeping

Training on longer documents extends the context window (input comprehension). It does not, by itself, create reasoning ability — that comes from reasoning-trace content plus later SFT/RL. The two co-occur (traces are long) but they're different levers. The 4k→32k rung is itself already a context extension (8×), so it needs YaRN too — it isn't "just training on longer docs."

5 · Two generation paths

Both paths share the same sourced long documents, the same curriculum, packing, and the 34/66 mix — so sourcing is never duplicated. They differ only in how the task layer is produced.

Tooling note

The pilot's synthgen library is internal and not yet public. Either path can be driven by any batch inference engine that fans requests across vLLM endpoints on Leonardo — e.g. Inference Hive for SLURM-scheduled batch jobs. The meta-prompt / topic-persona logic is a thin layer on top; the engine just needs checkpoint/resume and an OpenAI-compatible client.

Path 1 · Prompt-driven

Strong multilingual model authors the task

Reuse the synthgen meta-prompt machinery (topic distribution + personas + constraints), but feed a real sourced long document into a multilingual model with a genuine 120k input window and have it author diverse IF tasks and answers grounded in the whole context.

Generates: task and answer (model-judged)
Strengths: high task diversity & naturalness; multi-hop / synthesis; leverages your persona-topic strength
Risks: answer quality bounded by the model's real long-context ability (partial bootstrap problem); hallucination over 120k; expensive long prefill, tiny vLLM concurrency; few multilingual models are truly strong at 120k for low-resource langs
Verifiable: no — needs a QC / back-translation pass
Also the natural home for reasoning traces: prompt the model to produce long, multi-step CoT (math / code / analysis) in-language. These are long by construction (≤~32k), so they feed the 34% long slice directly — and give you per-language reasoning content that the bootstrap path can't produce.

Caution: the "fully synthetic" variant (ask the model to write 120k of text) hits the output-length wall (~8–32k reliable). Keep that to a small long-form-writing slice only.

Path 2 · OLMo bootstrap

Stats pick terms; code computes the answer

Port OLMo's longmino_synthetic_cwe_rex pipeline to multilingual. A short-context model sees only snippets; the answer is computed.

Partition a long doc into 8k–32k sections at natural breaks
Tokenize, extract 1–2 word noun phrases, rank by tf-idf (per-language)
Per phrase, pull k=8 top snippets
Short model writes the task; CWE answer = computed count, REX = one of 12 vignettes

Strengths: cheap, fast, verifiable (CWE counts are ground truth), scales to 120k trivially, OLMo-validated
Risks: task variety limited to CWE + 12 REX vignettes; tf-idf needs per-language tokenization (el/uk non-Latin)
Verifiable: CWE yes, REX phrasing only

Side-by-side

Dimension	Path 1 (prompt-driven)	Path 2 (bootstrap)
120k length from	real sourced docs	real sourced / concatenated docs
Generator requirement	multilingual model strong at 120k input	any short-context multilingual model
What's generated	task + answer	task only (answer computed)
Cost	high (long prefill)	low
Task diversity	high / natural	CWE + 12 vignettes
Verifiable signal	no	yes (CWE)
Scales to 120k	yes, expensively	yes, trivially

Where to source the documents (open item for Path 2)

Source	Why, for the EU target langs
EUR-Lex / MultiEURLEX	Top pick. Legislative texts — individually very long, parallel across all 24 EU official languages (covers every target). Your "science-PDF equivalent."
HPLT v2 · CulturaX · FineWeb-2	Document-level web text; filter for the longest docs per language. Bulk volume.
Europarl	Parliamentary proceedings — long, parallel.
Wikipedia (per-language)	Best raw material for related-doc concatenation at the 120k rung.
The Stack v2 (repo-level)	Language-agnostic long-range structure; improves long-context generally. A few % of the mix.

Apply OLMo's gzip filter (drop most/least-compressible 20%) to whatever you pick — it's language-agnostic and ports directly.

6 · Reasoning traces — include them, as base content

OLMo deliberately put reasoning/thinking traces into base mid-training as plain CoT content — not delimited <think> post-training behavior. The "full mix" with thinking + instruction data beat the mix without it on their base eval (avg 50.7 vs 48.8; Math 48.7 vs 43.1) OLMo 3 Table 10. Those traces enter the long-context stage automatically through the 66% replay.

Make your 66% replay carry reasoning traces (math/code/general CoT datasets), as plain content, no think tokens at this stage.
Use long reasoning traces for double duty: a multi-thousand-token trace (OLMo notes up to ~32k) is simultaneously reasoning content and a naturally long sequence — so it can sit in the 34% long slice, directly serving long-output capability.
Generate them via Path 1. The prompt-driven path (§5) is exactly the mechanism: prompt a strong multilingual model to produce long, multi-step CoT (math / code / analysis) in-language. Long ones go in the 34% slice; shorter ones ride in replay. This is reasoning content Path 2's bootstrap cannot produce.
Math/code traces are largely language-portable — a cheap way to spread reasoning signal across the target languages.
Save explicit <think>-token behavior for the later post-training SFT/RL stage.

7 · Mix, RoPE & packing recipe

Mix: ~34% long-context / ~66% short-context replay (replay = your prior mid-training mixture, which itself includes reasoning traces).
Long slice composition: natural sourced docs + Path 2 (CWE/REX) as the volume backbone + Path 1 as a diversity garnish + long reasoning traces.
RoPE: YaRN on full-attention layers; rescale per rung. (If your architecture is full-attention everywhere, apply throughout — confirm the arch.)
Packing: best-fit packing; intra-doc masking except within related-doc concatenations (where the cross-doc signal is the point).
Token budget: ~10–20B (token-efficient, à la ProLong's 20B) up to ~50B (OLMo-style) depending on Leonardo compute. More tokens especially help the longest lengths.
Target sequence length: 131,072 to comfortably cover 120k.

8 · Evaluation

There's no standard multilingual RULER, so we build one — this is the natural job for synthgen:

Multilingual needle-in-a-haystack + aggregation at 4k / 16k / 32k / 64k / 120k × each target language (CWE-style counting is itself a RULER aggregation task, and it's verifiable).
Held-out long-document perplexity per language.
Short-task regression gate: rerun the existing 4k evals after each rung — they must not drop (that's what the 66% replay protects).
RULER / HELMET as the conceptual templates Hsieh 2024 · Yen 2025.

9 · Recommendation & next steps

★ Recommended shape

Path 2 is the scalable backbone — cheap, verifiable, scales to 120k without quality risk. Path 1 is a diversity top-up — only if you can find a multilingual model you trust at 120k input. Bake them off first: generate a small batch from each, train a 32k-rung probe, and let multilingual-RULER set the ratio — exactly the Phase-1 generator A/B methodology you already used.

Phase 0 — unblock training: confirm context/sequence parallelism in the training stack; confirm the target model's positional encoding and tokenizer.
Phase 1 — profile sources: per-language length histograms across EUR-Lex / HPLT / CulturaX. Identify each language's long-doc deficit (drives how much related-doc concatenation the 120k rung needs).
Phase 2 — build & bake off: port CWE/REX to multilingual (Path 2); stand up a 120k-input generator for Path 1; generate small batches of each.
Phase 3 — eval probe: train the 32k rung on each, measure multilingual-RULER, set the Path-1/Path-2 ratio.
Phase 4 — scale & ladder: 4k→32k→(64k)→120k with per-rung YaRN + eval gates and the 34/66 mix.

Open decisions

Path 1 model: is there a multilingual instruct model you trust at 120k input for el/uk/ro? If not, Path 1 collapses to the small long-form-writing slice.
Path 2 source: EUR-Lex as primary — confirm license & per-language volume.
Architecture: does the target model have sliding-window layers (→ YaRN on full-attn only) or full attention everywhere (→ YaRN throughout)?

Final words — budget & the multilingual bet

Budget, spread across the languages

The 10–50B-token extension budget is a total — and it has to cover every target language plus replay. So the per-language long-context allocation is modest, and the lowest-resource languages (el, uk, ro) need explicit token floors so they aren't drowned by es/fr/de. Path 2 (verifiable, cheap) is what lets you hit those floors economically; Path 1 is rationed where it adds the most diversity.

★ The multilingual bet — where this work is novel

Almost the entire long-context literature above is English-centric — the "code + books + scientific papers" long-document recipe is tuned for English. LongAlign is the one paper that explicitly ablates multilingual long-context SFT, and finds it helps the target languages without hurting English Bai 2024.

If your EU-language experiments push to 120K, you may be among the first to surface whether that recipe transfers to lower-resource EU languages — where genuinely long natural documents in-language are far scarcer than in English. That scarcity makes synthetic concatenation (Quest / ICLM-style: group same-language documents by topic into coherent long sequences) not just a convenience but more necessary than in the English-only papers. It's a real gap, and a publishable contribution if the transfer story is measured cleanly per language.

10 · References & data sources

Items marked in OLMo 3 are cited within the OLMo 3 technical report §3.6; links go to the canonical paper or a search for it.

Methods & recipes

OLMo 3 Technical Report, Allen Institute for AI — §3.6 "Stage 3: Long-context Extension." Primary reference for this plan (local: olmo_3.pdf). allenai.org/olmo
Data Engineering for Scaling Language Models to 128K Context — Fu et al., 2024. Per-source length upsampling + replay; ~1–5B tokens suffice. arXiv:2402.10171
How to Train Long-Context Language Models (Effectively) (ProLong) — Gao et al., 2025. 20B-token token-efficient extension recipe. paper
CLIPPER: compression enables long-context synthetic data generation — Pham et al., 2025. Direct inspiration for OLMo's CWE/REX bootstrap. in OLMo 3 paper
In-Context Pretraining: Language Modeling Beyond Document Boundaries (ICLM) — Shi et al., 2023. Concatenating related docs to create long-range dependencies. arXiv:2310.10638
Quest: Query-centric Data Synthesis for Long-context Scaling — 2024. Group documents by topic/query to synthesize coherent long sequences — the same-language version is key for low-resource langs. paper
LongAlign: A Recipe for Long Context Alignment — Bai et al., 2024. The one paper ablating multilingual long-context SFT — helps target langs without hurting English. arXiv:2401.18058
Qwen3 Technical Report — Qwen Team, 2025. Source of the within-stage length-mixing curriculum (e.g. 75% at 16K–32K, 25% at 4K–16K). paper
YaRN: Efficient Context Window Extension of Large Language Models — Peng et al., 2023. arXiv:2309.00071
Extending Context Window via Position Interpolation — Chen et al., 2023. arXiv:2306.15595
Fewer Truncations Improve Language Modeling (best-fit packing) — Ding et al., 2024. paper
Qwen2.5-1M — Yang et al., 2025. Long-context synthetic data at scale; OLMo notes similarity. in OLMo 3 paper

Evaluation

RULER: What's the Real Context Size of Your Long-Context Language Models? — Hsieh et al., 2024. arXiv:2404.06654
HELMET: How to Evaluate Long-Context Models Effectively and Thoroughly — Yen et al., 2025. paper

Data sources (all cover the target languages)

EUR-Lex — EU legislation, long & parallel across all 24 official languages. eur-lex.europa.eu · MultiEURLEX (Chalkidis et al., 2021) paper
HPLT v2 — document-level multilingual web corpus. hplt-project.org
CulturaX — Nguyen et al., 2023. huggingface.co/datasets/uonlp/CulturaX
FineWeb-2 — multilingual web. huggingface.co/datasets/HuggingFaceFW/fineweb-2
Europarl — Koehn, 2005. statmt.org/europarl
The Stack v2 — Lozhkov et al., 2024. huggingface.co/datasets/bigcode/the-stack-v2
OLMo CWE/REX reference implementation — dolma3/datasets/dolma3_longmino_mix/synthetic_cwe_rex/. github.com/allenai/dolma3

Prepared for the OELLM / ELLIS multilingual post-training effort. Scope: long-context mid-training data for the 4k→120k extension across the low-resource EU target languages. Pairs with the existing synthgen pipeline and the synthetic-IF corpus plan.