🌐 The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.
🎯 The model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.
🤝 Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.
🌍 Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by cliking the Join Button! 🔍