Training Azerbaijani language models on Amazon SageMaker AI

Training Azerbaijani language models on Amazon SageMaker AI

Training Azerbaijani language models on Amazon SageMaker AI

https://aws.amazon.com/blogs/machine-learning/training-azerbaijani-language-models-on-amazon-sagemaker-ai/

Publish Date: 2026-05-28 17:54:00

Source Domain: aws.amazon.com

  • Project Background and Collaboration: The project integrates open-source tools like PyTorch, Hugging Face Transformers, and Liger Kernels, with contributions from Azercell Telecom and the AWS Generative AI Innovation Center.

  • Framework Development: The framework comprises three main stages: efficient custom tokenizer development, continued pre-training for foundation model adaptation, and supervised fine-tuning with LoRA.

  • Tokenizer Efficiency: The custom monolingual tokenizer achieved a 2× improvement in encoding efficiency, effectively doubling the amount of Azerbaijani text the model can process within its context window.

  • Memory and Throughput Optimization: The use of Fully Sharded Data Parallel (FSDP) and Liger Kernels allowed for larger batch sizes, 23% higher training throughput, and 58% lower peak GPU memory usage.

  • Scalable Infrastructure: The solution provides a scalable and production-ready training framework tailored for growing training requirements and is designed to scale up with minimal changes.

  • Language Understanding and Generation: The fine-tuned model on Amazon SageMaker AI showed coherent Azerbaijan language generation, contrasting with the incoherent output from the non-fine-tuned foundation model.

  • Training Pipeline: The framework trains in three distinct stages with each stage optimizing for different aspects, starting from custom tokenizer development to achieving high throughput with memory optimizations and using efficient fine-tuning methods.

  • Conclusion and Implementation: The success of this model-building framework on Amazon SageMaker AI demonstrates a scalable methodology adaptable for other low-resource languages or scenarios that optimize GPU utilizations, offering a pathway for similar implementations.