Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Remy, François; Delobelle, Pieter; Berendt, Bettina; Demuynck, Kris; Demeester, Thomas

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Dateien

Remy-et-al_2023_Tik-to-Tok.pdf (411.11 KB)

Datum

2023

Autor:innen

Remy, François

Delobelle, Pieter

Berendt, Bettina

Demuynck, Kris

Demeester, Thomas

Zusammenfassung

Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.

URI

https://www.weizenbaum-library.de/handle/id/630

Zitierform

Remy, F., Delobelle, P., Berendt, B., Demuynck, K., & Demeester, T. (2023). Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation. https://doi.org/10.48550/ARXIV.2310.03477