Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation
dc.contributor.author | Remy, François | |
dc.contributor.author | Delobelle, Pieter | |
dc.contributor.author | Berendt, Bettina | |
dc.contributor.author | Demuynck, Kris | |
dc.contributor.author | Demeester, Thomas | |
dc.date.accessioned | 2024-05-02T15:03:16Z | |
dc.date.available | 2024-05-02T15:03:16Z | |
dc.date.issued | 2023 | |
dc.description.abstract | Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide. | |
dc.identifier.citation | Remy, F., Delobelle, P., Berendt, B., Demuynck, K., & Demeester, T. (2023). Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation. https://doi.org/10.48550/ARXIV.2310.03477 | |
dc.identifier.doi | https://doi.org/10.48550/arxiv.2310.03477 | |
dc.identifier.uri | https://www.weizenbaum-library.de/handle/id/630 | |
dc.language.iso | eng | |
dc.rights | open access | |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
dc.title | Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation | |
dc.type | Article | |
dc.type.status | publishedVersion | |
dcmi.type | Text | |
dcterms.bibliographicCitation.url | https://doi.org/10.48550/arxiv.2310.03477 | |
local.researchgroup | Daten, algorithmische Systeme und Ethik | |
local.researchtopic | Digitale Technologien in der Gesellschaft |
Dateien
Originalbündel
1 - 1 von 1
Lade...
- Name:
- Remy-et-al_2023_Tik-to-Tok.pdf
- Größe:
- 411.11 KB
- Format:
- Adobe Portable Document Format
- Beschreibung: