The recent release of the LLAMA-1B and LLAMA-3B models marks a significant advancement in AI, trained on an astounding 9 trillion tokens. These models are setting new benchmarks and redefining expectations for smaller model performance.
Training Innovations
The LLAMA-1B (1.23B parameters) and LLAMA-3B (3.21B parameters) models utilize a process called token-level distillation, which involves pruning from their 8B counterparts and leveraging logits from larger models. This technique helps maintain performance despite the reduction in size.
Context Length Challenges
One major challenge was extending context length from 8K to an ambitious 128K. Larger contexts enhance performance in tasks such as SCROLLS and InfiniteBench, but can degrade short-context effectiveness. To tackle this, a dual approach was implemented: a short warm-up, high-learning-rate model to boost long-context performance, followed by a longer warm-up, low-learning-rate model for short-context gains.
Post-training Dynamics
Training smaller models at the 1B scale presents unique challenges, particularly regarding stability and trade-offs. While there can be improvements in instruction following, this might come at the expense of coding performance. Understanding these dynamics is crucial for future model development.
Tooling Performance
The LLAMA-1B and LLAMA-3B models excel in tooling, achieving function-calling benchmarks comparable to larger models.
Overall, these new LLAMA models demonstrate that smaller architectures can achieve impressive results, opening new avenues for research and application in AI.