Abhash Kumar Jha← All writeups
OELLM · Multilingual Long-Context Mid-Training

Extending a 4k model to 120k context for low-resource European languages

A data-generation plan for the long-context extension stage — what to source, how to synthesize it, and two candidate generation paths to bake off.

Target: in-house multilingual base model pretrained at 4k · Goal window: ~120k tokens · Languages: es, fr, de, it, pt, pl, nl, cs, ro, el, uk (target set may grow) · Infra: local vLLM on Leonardo · Tooling: synthgen (internal, not yet public) — or any batch inference engine, e.g. Inference Hive

We are extending a model pretrained at a 4k context window out to ~120k. This is a continued-pretraining / mid-training problem, not an instruction-tuning one. The good news from the literature is that context extension is cheap on data — on the order of 10–50B tokens is enough — provided the data actually contains long-range dependencies and the original short-context mixture is replayed to avoid regression Fu 2024 · OLMo 3.

1 · The goal & the one hard constraint

Context extension means training on longer sequences while rescaling the positional embeddings so the model generalizes to positions it never saw in pretraining. Two principles decide whether it works:

⚠ The real blocker is training-side, not data-side

A 7B model at a 120k sequence length cannot fit activations on one GPU. You need context / sequence parallelism (OLMo used 8-way CP, 8k tokens per device, with all-gather attention to support irregular masks) OLMo 3 §3.6.4. Confirm your training stack supports this before building data — it dictates the shard format. Note also: your 4k→120k jump (~30×) is more aggressive than OLMo's 8k→65k (8×), which is why we stage it.

2 · How OLMo 3 did it (§3.6, our reference recipe)

OLMo 3 extended its base model from 8,192 → 65,536 tokens using Dolma 3 Longmino Mix — a 600B-token pool from which they trained on 50B tokens (7B model) / 100B (32B) OLMo 3 §3.6. The pieces we adopt:

IngredientWhat OLMo did
Mix ratio34% long-context + 66% short-context replay (from their mid-training mix). They tested it: 66%-long drops short-task perf by 2.5 pts; 34%-long drops only 0.8 pts.
Long-doc backboneOCR'd scientific PDFs, filtered by gzip compressibility — drop the most-compressible 20% and least-compressible 20%.
Synthetic augmentationInject aggregation tasks (CWE / REX) into real long docs. Beats natural documents alone on RULER. (Detailed in §5 below.)
RoPE extensionYaRN applied to full-attention layers only (sliding-window layers left untouched). Beat base-frequency scaling and position interpolation.
PackingBest-fit document packing + intra-document masking (each sequence attends only within its own source document).
Token budgetMore is better, especially at long lengths. 50B for the 7B.
EvalRULER (development metric) + HELMET (held-out).

3 · Three concepts that drive every choice

3.1 — Length comes from sourcing, not generation

The 120k length should be filled with real text, never by asking a model to write 120k tokens. Long generations degrade (repetition, drift), cost a fortune, and are worst in low-resource languages. The length budget is a sourcing + packing problem; generation only ever produces the short task layer on top.

★ The bootstrap insight — this is the whole trick

How do you build long-context training data without already owning a long-context model? OLMo's answer OLMo 3 §3.6.2 · CLIPPER: use document statistics (not an LLM) to find salient terms, extract a few short snippets, and feed only the snippets to a plain short-context generator to write a task. The answer is computed by code, not generated.

Consequence: the method is length-agnostic on the generation side. A short-context model can produce a 120k-token training sample, because it never sees 120k — only the snippets. The long context is the real document you attach.

3.2 — A "120k sample" is mostly real document + a tiny generated task

[ real 120k-token context (sourced / packed) ] + [ short generated question ] + [ computed answer ]

So "bootstrapping a 120k sample" is entirely possible (and cheap) — what's not possible at quality is generating 120k tokens of context text. Source the context; generate only the task.

3.3 — "Packing" ≠ "concatenating related docs"

These get conflated. They are different, and the difference decides whether the long rung teaches anything:

TechniqueWhat it isCross-doc attention?Use when
Best-fit packing Ding 2024Fill the fixed sequence buffer with multiple docs to avoid paddingNo — intra-doc maskedYou already have genuinely long single docs (efficiency only)
Related-doc concatenation ICLM / Shi 2023Join topically related docs into one long pseudo-documentYes — no mask within groupSingle docs are too short (your low-resource case at 120k)

OLMo relied on single long docs + masking. You won't have many single 120k-token docs in el / uk / ro, so at the 120k rung you'll concatenate semantically related docs (same EUR-Lex domain, topic cluster, or retrieved neighbors), run the bootstrap over the whole concatenation, and not mask within a related group — so the 120k-range dependency is real.

4 · The staged curriculum

Don't jump 4k→120k in one step. Ladder it — 4k → 16k → 32k → 64k → 120k — with a YaRN rescale and an eval gate at each rung (mirroring OLMo's RULER curriculum). At each stage, train on a band of lengths rather than one exact length: roughly 70–75% of sequences near the current target, 25–30% shorter (the Qwen3 pattern, e.g. "75% at 16K–32K, 25% at 4K–16K") Qwen3. That avoids overfitting to a single length and yields a cheap checkpoint to evaluate at every rung. Two regimes drive the data:

RegimeRoPEWhere the length comes fromGeneration used for
Early rungs (→ 16k / 32k)YaRN (full-attn layers)Sourced single docs (EUR-Lex / Europarl)CWE/REX task layer; short reasoning traces ride in replay
Long rungs (→ 64k / 120k)YaRN, rescaled furtherSourced + related-doc concatenation to reach lengthCWE/REX over the concatenation; long reasoning traces in the long slice
A clarification worth keeping

Training on longer documents extends the context window (input comprehension). It does not, by itself, create reasoning ability — that comes from reasoning-trace content plus later SFT/RL. The two co-occur (traces are long) but they're different levers. The 4k→32k rung is itself already a context extension (8×), so it needs YaRN too — it isn't "just training on longer docs."

5 · Two generation paths

Both paths share the same sourced long documents, the same curriculum, packing, and the 34/66 mix — so sourcing is never duplicated. They differ only in how the task layer is produced.

Tooling note

The pilot's synthgen library is internal and not yet public. Either path can be driven by any batch inference engine that fans requests across vLLM endpoints on Leonardo — e.g. Inference Hive for SLURM-scheduled batch jobs. The meta-prompt / topic-persona logic is a thin layer on top; the engine just needs checkpoint/resume and an OpenAI-compatible client.

Path 1 · Prompt-driven

Strong multilingual model authors the task

Reuse the synthgen meta-prompt machinery (topic distribution + personas + constraints), but feed a real sourced long document into a multilingual model with a genuine 120k input window and have it author diverse IF tasks and answers grounded in the whole context.

  • Generates: task and answer (model-judged)
  • Strengths: high task diversity & naturalness; multi-hop / synthesis; leverages your persona-topic strength
  • Risks: answer quality bounded by the model's real long-context ability (partial bootstrap problem); hallucination over 120k; expensive long prefill, tiny vLLM concurrency; few multilingual models are truly strong at 120k for low-resource langs
  • Verifiable: no — needs a QC / back-translation pass
  • Also the natural home for reasoning traces: prompt the model to produce long, multi-step CoT (math / code / analysis) in-language. These are long by construction (≤~32k), so they feed the 34% long slice directly — and give you per-language reasoning content that the bootstrap path can't produce.

Caution: the "fully synthetic" variant (ask the model to write 120k of text) hits the output-length wall (~8–32k reliable). Keep that to a small long-form-writing slice only.

Path 2 · OLMo bootstrap

Stats pick terms; code computes the answer

Port OLMo's longmino_synthetic_cwe_rex pipeline to multilingual. A short-context model sees only snippets; the answer is computed.

  1. Partition a long doc into 8k–32k sections at natural breaks
  2. Tokenize, extract 1–2 word noun phrases, rank by tf-idf (per-language)
  3. Per phrase, pull k=8 top snippets
  4. Short model writes the task; CWE answer = computed count, REX = one of 12 vignettes
  • Strengths: cheap, fast, verifiable (CWE counts are ground truth), scales to 120k trivially, OLMo-validated
  • Risks: task variety limited to CWE + 12 REX vignettes; tf-idf needs per-language tokenization (el/uk non-Latin)
  • Verifiable: CWE yes, REX phrasing only

Side-by-side

DimensionPath 1 (prompt-driven)Path 2 (bootstrap)
120k length fromreal sourced docsreal sourced / concatenated docs
Generator requirementmultilingual model strong at 120k inputany short-context multilingual model
What's generatedtask + answertask only (answer computed)
Costhigh (long prefill)low
Task diversityhigh / naturalCWE + 12 vignettes
Verifiable signalnoyes (CWE)
Scales to 120kyes, expensivelyyes, trivially

Where to source the documents (open item for Path 2)

SourceWhy, for the EU target langs
EUR-Lex / MultiEURLEXTop pick. Legislative texts — individually very long, parallel across all 24 EU official languages (covers every target). Your "science-PDF equivalent."
HPLT v2 · CulturaX · FineWeb-2Document-level web text; filter for the longest docs per language. Bulk volume.
EuroparlParliamentary proceedings — long, parallel.
Wikipedia (per-language)Best raw material for related-doc concatenation at the 120k rung.
The Stack v2 (repo-level)Language-agnostic long-range structure; improves long-context generally. A few % of the mix.

Apply OLMo's gzip filter (drop most/least-compressible 20%) to whatever you pick — it's language-agnostic and ports directly.

6 · Reasoning traces — include them, as base content

OLMo deliberately put reasoning/thinking traces into base mid-training as plain CoT content — not delimited <think> post-training behavior. The "full mix" with thinking + instruction data beat the mix without it on their base eval (avg 50.7 vs 48.8; Math 48.7 vs 43.1) OLMo 3 Table 10. Those traces enter the long-context stage automatically through the 66% replay.

7 · Mix, RoPE & packing recipe

8 · Evaluation

There's no standard multilingual RULER, so we build one — this is the natural job for synthgen:

9 · Recommendation & next steps

★ Recommended shape

Path 2 is the scalable backbone — cheap, verifiable, scales to 120k without quality risk. Path 1 is a diversity top-up — only if you can find a multilingual model you trust at 120k input. Bake them off first: generate a small batch from each, train a 32k-rung probe, and let multilingual-RULER set the ratio — exactly the Phase-1 generator A/B methodology you already used.

  1. Phase 0 — unblock training: confirm context/sequence parallelism in the training stack; confirm the target model's positional encoding and tokenizer.
  2. Phase 1 — profile sources: per-language length histograms across EUR-Lex / HPLT / CulturaX. Identify each language's long-doc deficit (drives how much related-doc concatenation the 120k rung needs).
  3. Phase 2 — build & bake off: port CWE/REX to multilingual (Path 2); stand up a 120k-input generator for Path 1; generate small batches of each.
  4. Phase 3 — eval probe: train the 32k rung on each, measure multilingual-RULER, set the Path-1/Path-2 ratio.
  5. Phase 4 — scale & ladder: 4k→32k→(64k)→120k with per-rung YaRN + eval gates and the 34/66 mix.

Open decisions


Final words — budget & the multilingual bet

Budget, spread across the languages

The 10–50B-token extension budget is a total — and it has to cover every target language plus replay. So the per-language long-context allocation is modest, and the lowest-resource languages (el, uk, ro) need explicit token floors so they aren't drowned by es/fr/de. Path 2 (verifiable, cheap) is what lets you hit those floors economically; Path 1 is rationed where it adds the most diversity.

★ The multilingual bet — where this work is novel

Almost the entire long-context literature above is English-centric — the "code + books + scientific papers" long-document recipe is tuned for English. LongAlign is the one paper that explicitly ablates multilingual long-context SFT, and finds it helps the target languages without hurting English Bai 2024.

If your EU-language experiments push to 120K, you may be among the first to surface whether that recipe transfers to lower-resource EU languages — where genuinely long natural documents in-language are far scarcer than in English. That scarcity makes synthetic concatenation (Quest / ICLM-style: group same-language documents by topic into coherent long sequences) not just a convenience but more necessary than in the English-only papers. It's a real gap, and a publishable contribution if the transfer story is measured cleanly per language.


10 · References & data sources

Items marked in OLMo 3 are cited within the OLMo 3 technical report §3.6; links go to the canonical paper or a search for it.

Methods & recipes

Evaluation

Data sources (all cover the target languages)