Sailor owes its existence to the open-source community. It is crafted by continually pre-training from language models like the remarkable Qwen 1.5 models, which already has a great performance on SEA languages. The pre-training corpus heavily leverages the publicly available corpus, including SlimPajama, SkyPile, CC100 and MADLAD-400.
By employing aggressive data deduplication and careful data cleaning on the collected corpus, we have attained a high-quality dataset spanning various languages. Through systematic experiments to determine the weights of different languages, Sailor models undergo training from 200B to 400B tokens, tailored to different model sizes. The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise. Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
For most of the models, we use 200 billion tokens, with the effective tokens for each language as shown below. For models utilizing 400 billion tokens, they are doubled accordingly.
Language |
Tokens (Billion) |
Indonesian (id) |
51.56 |
Malay (ms) |
7.91 |
Thai (th) |
38.24 |
Vietnamese (vi) |
41.50 |
Lao (lo) |
0.34 |
English (en) |
37.2 |
Chinese (zh) |
22.64 |