Scaling Instruction Finetuned Language Models (2022)

May 7, 2025

Instruction finetuning is a technique for enhancing the zero-shot performance of large language models (LLMs). The seminal work, Finetuned Language Models Are Zero-Shot Learners, refers to this method as instruction tuning. Building on this, Scaling Instruction-Finetuned Language Models explored scaling the number of tasks, model sizes, and the incorporation of chain-of-thought (CoT) data. Their findings demonstrate that instruction finetuning can significantly improve LLM performance across a wide range of settings.

Instruction finetuning is grounded in the insight that many NLP tasks can be framed as natural language instructions—examples include translation, sentiment analysis, and question answering. Notably, the original paper introduced instruction tuning through examples rather than formal definitions. Two representative examples in the paper are:

INPUT
Jane knocked on Susan’s door, but there was no answer.
OPTIONS:
Jane was out.
Susan was out.
TARGET
Susan was out.

INPUT
Question: who is the girl in more than you know??
Answer:
TARGET
Romi Van Renterghe

In their work, the authors reformatted datasets into instruction-style prompts and trained LLMs to evaluate the impact of this finetuning strategy under zero-shot conditions. Models trained in this manner were referred to as FLAN (Finetuned Language Net).

The Scaling Instruction-Finetuned Language Models paper extended this line of research by focusing on three main areas: scaling the number of instruction tasks, scaling the size of the models, incorporating chain-of-thought annotations into training data.

Their experiments utilized PaLM models with 8B, 62B, and 540B parameters, trained on progressively larger sets of tasks—specifically, 0, 9, 89, 282, and 1,836 tasks. Here, a task is defined as a specific pairing of a dataset and a task category, such as extractive question answering or query generation. A dataset can be associated with multiple task categories. For example, the SQuAD dataset supports extractive question answering, query generation, and context generation.

Across all model sizes, increasing the number of tasks up to 282 led to substantial improvements. However, adding more tasks beyond this threshold yielded only marginal gains. The authors propose two explanations for this observation. First, the additional tasks likely offer limited new information for models already trained on 282 tasks. Second, instruction finetuning predominantly enhances the model’s capacity to express knowledge acquired during pre-training, rather than providing entirely new knowledge not seen in the pre-training corpus.

In contrast, incorporating chain-of-thought annotations into the training datasets resulted in significant performance improvements on previously unseen reasoning tasks, highlighting the value of CoT data for enhancing models’ reasoning abilities.