Philipp Borchert | NLP Researcher
Let's connect and collaborate! You can find me here:
Multilingual language models like Llama 3 and Gemma 2 are still trained almost entirely on English data, Llama 3’s pretraining is only about 5% multilingual
This imbalance results in a much stronger English representation space. Ideally, a cross-lingual transfer method can tap into these strong English representations when adapting to new (target) languages.
FLARE an effective method that enables multilingual LLMs to use both source (e.g., English) and target language knowledge to improve NLU performance.
It’s also lightweight: FLARE doesn’t add any additional parameters to the model.
We start with a multilingual LLM and assume we have labeled data in the source language (let’s use English as our example).
First, we fine-tune the model on English task data using supervised learning.
To adapt to the target language, we use LoRA adapters inserted into the transformer’s attention modules, leaving the original model weights frozen.
Because labeled data in the target language is scarce, we machine-translate the English training set using NLLB 3.3B.
FLARE takes both the source and translated target text as input. For each forward pass, it does the following:
The FLARE-MT variant skips the full forward pass on the source language input.
Instead, it generates a latent translation using just the NLLB encoder and fuses this into every transformer layer in the main model.
FLARE is the best performing translate-train method in our experiments.
FLARE also outperforms other translate-test and zero-shot baselines:
For more details, ablation studies, and deeper analysis, check out the paper on arXiv!