Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Remy, François; Delobelle, Pieter; Berendt, Bettina; Demuynck, Kris; Demeester, Thomas

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

dc.contributor.author	Remy, François
dc.contributor.author	Delobelle, Pieter
dc.contributor.author	Berendt, Bettina
dc.contributor.author	Demuynck, Kris
dc.contributor.author	Demeester, Thomas
dc.date.accessioned	2024-05-02T15:03:16Z
dc.date.available	2024-05-02T15:03:16Z
dc.date.issued	2023
dc.description.abstract	Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.
dc.identifier.citation	Remy, F., Delobelle, P., Berendt, B., Demuynck, K., & Demeester, T. (2023). Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation. https://doi.org/10.48550/ARXIV.2310.03477
dc.identifier.doi	https://doi.org/10.48550/arxiv.2310.03477
dc.identifier.uri	https://www.weizenbaum-library.de/handle/id/630
dc.language.iso	eng
dc.rights	open access
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation
dc.type	Preprint
dc.type.status	publishedVersion
dcmi.type	Text
dcterms.bibliographicCitation.url	https://doi.org/10.48550/arxiv.2310.03477
local.researchgroup	Daten, algorithmische Systeme und Ethik
local.researchtopic	Digitale Technologien in der Gesellschaft