Introducing Sailor:
Open Language Models for South-East Asia

and

Abstract

Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao. Developed with careful data curation, Sailor models are designed to understand and generate text across diverse linguistic landscapes of SEA region. Built from Qwen 1.5, Sailor encompasses models of varying sizes, spanning from 0.5B to 7B versions for different requirements. Benchmarking results demonstrate Sailor's proficiency in tasks such as question answering, commonsense reasoning, reading comprehension and etc. in SEA languages.

  • Continually pretrained on 200 Billion to 400 Billion tokens over 7 languages, including Indonesian, Thai, Vietnamese, Malay, Lao, English and Chinese.
  • Various model sizes (0.5B, 1.8B, 4B and 7B) to support different requirements.
  • Strong performance on SEA benchmarks such as XQuAD, TydiQA, XCOPA, Belebele and M3Exam.
  • No restrict on the research and the commercial use, but should comply with the Qwen 1.5 license.

Sailor is Built from Open-Source Community

Sailor owes its existence to the open-source community. It is crafted by continually pre-training from language models like the remarkable Qwen 1.5 models, which already has a great performance on SEA languages. The pre-training corpus heavily leverages the publicly available corpus, including SlimPajama, SkyPile, CC100 and MADLAD-400.

By employing aggressive data deduplication and careful data cleaning on the collected corpus, we have attained a high-quality dataset spanning various languages. Through systematic experiments to determine the weights of different languages, Sailor models undergo training from 200B to 400B tokens, tailored to different model sizes. The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise. Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.

For most of the models, we use 200 billion tokens, with the effective tokens for each language as shown below. For models utilizing 400 billion tokens, they are doubled accordingly.


Language Tokens (Billion)
Indonesian (id) 51.56
Malay (ms) 7.91
Thai (th) 38.24
Vietnamese (vi) 41.50
Lao (lo) 0.34
English (en) 37.2
Chinese (zh) 22.64

Sailor is Committed to Open-Source Community

The release of Sailor models marks the beginning of our commitment to open-source. Over the coming weeks, we plan to release several training recipes, including the code for pre-training, the pipeline for data cleaning and data deduplication. Additionally, we aim to share our pre-training corpus as soon as possible. We encourage you to stay tuned for updates.


Benchmarking Performance

Sailor models are evaluated spanned several high-quality benchmarks, encompassing four kinds of different tasks: question answering, common sense reasoning, reading Comprehension and examination. We gratefully acknowledge the contributions of all dataset authors. As for evaluation, following established evaluation protocols, we employed the awesome evaluation platform OpenCompass for comprehensive evaluation. The performance of all models are assessed based on the 3-shot Exact Match performance, with prompt provided in local languages (e.g., Indonesian task description for Indonesian tasks).

We acknowledge and respect the release of several SEA language models before, including SEA-LION, SeaLLMs, Typhoon and VinaLLaMA. Here we mainly selected SeaLLMs-7B-Hybrid, its base model Llama-2-7B, SeaLLMs-7B-v2 and its base model Mistral-7B-v0.1 for performance comparsion, and evaluation results of more models will be presented in our paper. Our reporting strictly adheres to the same evaluation methodology to ensure fair comparsion, and we make much efforts to closely match the reported results of the baseline.

  • Question Answering: XQuAD (Thai, Vietnamese) and TydiQA (Indonesian).
  • Commonsense Reasoning: XCOPA (Indonesian, Thai, Vietnamese).
  • Reading Comprehension: Belebele (Indonesian, Thai, Vietnamese).
  • Examination: M3Exam (Javanese, Thai, Vietnamese).

Question Answering

All models are evaluated on the XQuAD and TydiQA benchmarks, with the 3-shot Exact Match (EM) and F1 score reported. Baselines which have better performance than Sailor models are highlighted in green.


3-shot (EM / F1) XQuAD (th) TydiQA (id) XQuAD (vi)
Qwen1.5-0.5B 14.19 / 23.35 20.71 / 32.64 19.85 / 35.38
Sailor-0.5B 15.84 / 27.58 30.44 / 54.74 21.13 / 40.57
Qwen1.5-1.8B 27.24 / 43.56 29.73 / 53.76 29.17 / 48.15
Sailor-1.8B 32.72 / 48.66 40.88 / 65.37 34.22 / 53.35
Qwen1.5-4B 34.03 / 53.40 48.32 / 72.68 43.71 / 63.86
Sailor-4B 46.82 / 63.34 53.98 / 73.48 47.65 / 67.09
Llama-2-7B 30.64 / 43.80 56.64 / 72.14 46.96 / 66.16
Mistral-7B-v0.1 48.48 / 63.27 63.54 / 78.73 53.72 / 72.75
SeaLLM-7B-Hybrid 49.70 / 67.62 50.62 / 75.21 49.62 / 70.74
SeaLLM-7B-v2 34.55 / 55.13 52.21 / 77.00 46.19 / 72.11
Qwen1.5-7B 53.79 / 69.30 57.17 / 77.28 56.63 / 76.99
Sailor-7B 57.88 / 71.06 60.53 / 75.42 53.81 / 74.62

Commonsense Reasoning

All models are evaluated on the XCOPA benchmark, with the 3-shot accuracy reported.


3-shot (EM) XCOPA (th) XCOPA (id) XCOPA (vi)
Random Guess 50.00 50.00 50.00
Qwen1.5-0.5B 51.00 52.20 53.80
Sailor-0.5B 51.00 58.20 58.00
Qwen1.5-1.8B 52.60 51.60 53.40
Sailor-1.8B 53.80 64.20 63.20
Qwen1.5-4B 53.40 55.00 57.80
Sailor-4B 53.40 69.20 68.20
Llama-2-7B 52.80 64.00 62.00
Mistral-7B-v0.1 57.20 62.40 61.60
SeaLLM-7B-Hybrid 58.20 71.60 67.60
SeaLLM-7B-v2 56.80 64.00 64.60
Qwen1.5-7B 54.20 62.20 66.20
Sailor-7B 59.00 72.20 72.20

Reading Comprehension

All models are evaluated on the Belebele benchmark, with the 3-shot Exact Match (EM) reported. Baselines which have better performance than Sailor models are highlighted in green.


3-shot (EM) Belebele (th) Belebele (id) Belebele (vi)
Random Guess 25.00 25.00 25.00
Qwen1.5-0.5B 29.89 26.89 30.22
Sailor-0.5B 32.22 30.89 32.33
Qwen1.5-1.8B 30.11 32.00 31.33
Sailor-1.8B 34.22 34.89 35.33
Qwen1.5-4B 32.78 36.22 35.22
Sailor-4B 36.11 41.33 38.89
Llama-2-7B 31.78 39.78 38.00
Mistral-7B-v0.1 34.33 41.33 41.33
SeaLLM-7B-Hybrid 37.78 43.11 43.00
SeaLLM-7B-v2 36.33 43.11 47.00
Qwen1.5-7B 38.33 42.00 42.89
Sailor-7B 41.56 44.33 45.33

Examination

All models are evaluated on the M3Exam benchmark, with the 3-shot Exact Match (EM) reported. The code jv is short for Javanese, which is a language spoken in Indonesia. Please note that certain results may be different from those reported in SeaLLM-7B-v2's repo due to different evaluation criteria.


3-shot (EM) M3Exam (th) M3Exam (jv) M3Exam (vi)
Random Guess 22.90 25.00 25.21
Qwen1.5-0.5B 22.93 25.07 26.66
Sailor-0.5B 24.41 26.15 30.91
Qwen1.5-1.8B 24.04 24.26 28.68
Sailor-1.8B 25.38 28.30 34.71
Qwen1.5-4B 24.50 24.26 30.02
Sailor-4B 27.88 31.27 40.69
Llama-2-7B 23.67 25.07 33.15
Mistral-7B-v0.1 26.03 26.68 36.11
SeaLLM-7B-Hybrid 27.18 26.95 36.50
SeaLLM-7B-v2 28.48 29.92 39.18
Qwen1.5-7B 25.75 26.15 36.28
Sailor-7B 30.00 32.88 44.10

Contact Us

Sailor models are free for research and commercial use, but you should also obey the Qwen 1.5 license. We encourage you to use Sailor models in your research and applications, and we are looking forward to seeing the amazing things you will build with Sailor models. If you have any questions or want to reach out to us, please raise an issue in our Github or contact us at doulx@sea.com and liuqian@sea.com.