LLaMA Open and Efficient Foundation Language Models (2023)
March 14, 2024Scaling Laws for Neural Language Models refers to empirical observations that model test performance has a power-law relationship with each of the three scale factors: the number of model parameters, the dataset size in tokens, and the amount of compute used for training, when not bottlenecked by the other two. Hoffmann et al. (2022) investigated the bottleneck and found that, given a fixed FLOPs budget, the number of model parameters and the training tokens should be scaled equally to minimize pre-training loss. They trained a predicted compute-optimal model, Chinchilla, with 80B parameters, at the same compute budget as Gopher (280B) and with 4 times more data. Chinchilla outperformed Gopher and GPT-3 (175B) on a large range of downstream evaluation tasks.
Hoffmann et al. (2022) focused on how to best scale the dataset and model sizes given a particular training compute budget. In this context, given a target level of performance, larger models can be preferred to smaller ones, whereas larger models are less compute-efficient at inference.
LLaMA aims to investigate the best possible performance at various inference budgets by training a variety of models with different numbers of parameters on more tokens than what is typically used. Although Hoffmann et al. (2022) recommended training a 10B model on 200B tokens, they found that the performance of a 7B model continues to improve even after training on 1T tokens.
Unlike Chinchilla, PaLM, or GPT-3, LLaMAs are trained only on publicly available data.
The main differences from the original Transformer are in pre-normalization, the activation function, and embeddings. The input, but not the output, of each transformer sub-layer is normalized by the RMSNorm normalizing function. SwiGLU is employed instead of ReLU. The absolute positional embeddings are replaced with rotary positional embeddings (RoPE) at each layer of the network.