Training Azerbaijani language models on Amazon SageMaker AI
Training Azerbaijani language models on Amazon SageMaker AI
Publish Date: 2026-05-28 17:54:00
Source Domain: aws.amazon.com
-
Project Background and Collaboration: The project integrates open-source tools like PyTorch, Hugging Face Transformers, and Liger Kernels, with contributions from Azercell Telecom and the AWS Generative AI Innovation Center.
-
Framework Development: The framework comprises three main stages: efficient custom tokenizer development, continued pre-training for foundation model adaptation, and supervised fine-tuning with LoRA.
-
Tokenizer Efficiency: The custom monolingual tokenizer achieved a 2× improvement in encoding efficiency, effectively doubling the amount of Azerbaijani text the model can process within its context window.
-
Memory and Throughput Optimization: The use of Fully Sharded Data Parallel (FSDP) and Liger Kernels allowed for larger batch sizes, 23% higher training throughput, and 58% lower peak GPU memory usage.
-
Scalable Infrastructure: The solution provides a scalable and production-ready training framework tailored for growing training requirements and is designed to scale up with minimal changes.
-
Language Understanding and Generation: The fine-tuned model on Amazon SageMaker AI showed coherent Azerbaijan language generation, contrasting with the incoherent output from the non-fine-tuned foundation model.
-
Training Pipeline: The framework trains in three distinct stages with each stage optimizing for different aspects, starting from custom tokenizer development to achieving high throughput with memory optimizations and using efficient fine-tuning methods.
-
Conclusion and Implementation: The success of this model-building framework on Amazon SageMaker AI demonstrates a scalable methodology adaptable for other low-resource languages or scenarios that optimize GPU utilizations, offering a pathway for similar implementations.