Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

dc.contributor.authorRemy, François
dc.contributor.authorDelobelle, Pieter
dc.contributor.authorBerendt, Bettina
dc.contributor.authorDemuynck, Kris
dc.contributor.authorDemeester, Thomas
dc.date.accessioned2024-05-02T15:03:16Z
dc.date.available2024-05-02T15:03:16Z
dc.date.issued2023
dc.description.abstractTraining monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.
dc.identifier.citationRemy, F., Delobelle, P., Berendt, B., Demuynck, K., & Demeester, T. (2023). Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation. https://doi.org/10.48550/ARXIV.2310.03477
dc.identifier.doihttps://doi.org/10.48550/arxiv.2310.03477
dc.identifier.urihttps://www.weizenbaum-library.de/handle/id/630
dc.language.isoeng
dc.rightsopen access
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleTik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation
dc.typeArticle
dc.type.statuspublishedVersion
dcmi.typeText
dcterms.bibliographicCitation.urlhttps://doi.org/10.48550/arxiv.2310.03477
local.researchgroupDaten, algorithmische Systeme und Ethik
local.researchtopicDigitale Technologien in der Gesellschaft
Dateien
Originalbündel
Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
Remy-et-al_2023_Tik-to-Tok.pdf
Größe:
411.11 KB
Format:
Adobe Portable Document Format
Beschreibung: