Orca 2 on Colab

Feb 14, 2024

The following contents are loosely based on the article “Orca 2: Teaching Small Language Models How to Reason” by A. Mitra et al. (link).

Orca 2 is a language model from Microsoft Research. It builds on the previous Orca model and aims to explore the capabilities of smaller language models (LMs) through improved training signals and methods. There are two versions, one with 7 billion parameters and the other with 13 billion parameters. The goal is to demonstrate that smaller LMs, typically with around 10 billion parameters or less, can achieve enhanced reasoning abilities comparable to much larger models. Orca 2 outperforms models of similar size, including the original Orca, and exhibits performance levels similar to or better than models 5-10 times larger. The evaluation is based on complex tasks testing advanced reasoning abilities in zero-shot settings. The models are fine-tuned on high-quality synthetic data, and the weights of Orca 2 are made publicly available to encourage further research on the development, evaluation, and alignment of smaller LMs.

Orca 2 is designed with the understanding that diverse tasks require different solution strategies. Unlike a one-size-fits-all approach, it recognizes that while larger models like GPT-4 may excel at direct answers for complex tasks, smaller models can benefit from breaking tasks into steps.

In Orca 2, like its predecessor Orca 1, more advanced Language Model (LLMs) are employed to showcase various reasoning strategies across different tasks. However, in Orca 2, these strategies are specifically tailored to the task at hand. To generate nuanced data, the proficient LLM is provided with intricate prompts designed to elicit specific strategic behaviors, leading to more accurate results. During the training phase, the smaller model is exposed solely to the task and its resulting behavior, without access to the original prompts that triggered such behavior, enhancing the model’s ability to generalize and adapt its reasoning capabilities.

From Instruction Tuning to Explanation Tuning

Instruction tuning involves learning from input-output pairs where the input is natural language task description,and the output is a demonstration of the desired behavior. A weakness in instruction tuning is that a student model may generate outputs that are stylistically correct but ultimately incorrect. To address this issue, the authors introduce Explanation Tuning in Orca 1. This approach involves training student models on richer and more expressive reasoning signals obtained through system instructions. These instructions are crafted to elicit detailed explanations from a teacher model as it reasons through a task.

The process of explanation tuning initiates by creating a collection of N manually crafted system instructions aimed at prompting careful reasoning. These instructions, such as “think step-by-step” and “generate detailed answers,” are designed to elicit thoughtful responses from advanced language models like GPT-4, emphasizing “Slow Thinking.” These system instructions are integrated with user prompts spanning a wide range of tasks, forming a dataset of triplets composed of system instruction, user prompt, and the corresponding language model LLM answer. The student model is then trained to predict the LLM answer based on the given system instruction and user prompt.

Teaching Orca 2 to be a “Cautious Reasoner”

An example of system instructions is the following:

You will be given a task. Use the following steps to solve it.

Identify the main theme or topic of the story.
Look for any cause and effect relationships between the sentences.
Find the sentence that could be the start of the story. Go through each of the answer
choices and analyze to figure it out.
Rearrange the sentences in the correct order based on the information gathered in
the previous steps.
Final answer: Write down the correct order of the sentences using their numbers,
such as ‘23415’.

Hence this instructions are small aids to conduct a sort of step-by-step procedure hoping to direct towards a correct answer. It’s about breaking down a big problem into multiple subproblems.

Despite the correctness of all the provided answers, the question remains: which is the best answer for training the smaller model? This is a key issue and it is argued that smaller models should be taught to select the most effective solution strategy based on the problem at hand. For example, GPT-4, being a larger model, can effortlessly produce a direct response, whereas a smaller model might not possess this capability and could necessitate an alternative approach, such as a step-by-step thought process. Consequently, instructing a smaller model to simply “mimic” the reasoning behavior of a more powerful counterpart may not be the optimal approach.

The term Cautious Reasoning is used to describe the process of determining the appropriate solution strategy for a given task. This choice is made among options such as direct answer generation or various ‘Slow Thinking’ strategies (such as step-by-step, guess and check, explain-then-answer, etc.)

The following illustrates the process of training a Cautious Reasoning LLM:

Start with a collection of diverse tasks
Guided by the performance of Orca, decide which tasks require which solution
strategy (e.g. direct-answer, step-by-step, explain-then-answer, etc.)
Write task-specific system instruction(s) corresponding to the chosen strategy
in order to obtain teacher responses for each task.
Prompt Erasing: at training time, replace the student’s system instruction
with a generic one vacated of details of how to approach the task.

For brevity we will not go into technical details. However, we will stop to examine an Orca 2 coding example using the appropriate model from Hugging Face.

Example code

The following code can be easily executed by harnessing the power of a T4 GPU on Colab (Jupyter Notebook here).

At first, we install the widely used transformers library from Hugging Face.

!pip install git+https://github.com/huggingface/transformers

The accelerate library takes care of the heavy lifting… it’s a library that enables the same PyTorch code to be run across any distributed configuration. The -q flag in pip is used to enable quiet mode during package installation (can be specified up to 3 times to remove messages of increasing levels of importance – warning, error, critical).

!pip install accelerate -qq

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training (cit.) .

!pip install SentencePiece -qq

Protocol buffers (protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

!pip install protobuf -qq

bitsandbytes brings quantization to your model. You can now load any pytorch model in 8-bit or 4-bit with a few lines of code.

!pip install bitsandbytes -qq

Import PyTorch and Transformers library of Hugging Face. Set the default PyTorch device to be the GPU if available.

import torch
import transformers

if torch.cuda.is_available():
    torch.set_default_device("cuda")

transformers.AutoModelForCausalLM.from_pretrained: this is a method from the Transformers library that loads a pre-trained model. In this case, it’s loading a model with causal language modeling capabilities. The term “causal language model” typically refers to a model that generates sequences of tokens in a way that respects causality, where each token is generated based on the preceding tokens.

microsoft/Orca-2-7b: this is the identifier for the specific pre-trained model to be loaded. It seems to be a model named “Orca-2-7b” provided by Microsoft.

device_map='auto': this parameter suggests that the code should automatically handle device placement, likely selecting the available GPU if one is present.

load_in_8bit=True: this parameter indicates that the model should be loaded using 8-bit quantization, which is a technique to reduce the memory and computational requirements of the model by using lower precision for certain calculations.

model = transformers.AutoModelForCausalLM.from_pretrained(
    "microsoft/Orca-2-7b",
    device_map='auto',
    load_in_8bit=True)

Setting the tokenizer: the tokenizer is loaded with the pre-trained model identified by microsoft/Orca-2-7b. The use_fast=False parameter indicates not to use the fast tokenizer implementation, since fast and slow tokenizers produce different tokens.

tokenizer = transformers.AutoTokenizer.from_pretrained(
        "microsoft/Orca-2-7b",
        use_fast=False,
    )

system message is used interchangeably with system instruction introduced before.

tokenizer tokenizes the combined prompt using the initialized tokenizer. The return_tensors='pt' parameter specifies that the output should be PyTorch tensors. The resulting inputs variable holds the tokenized representation of the prompt.

system_message = "You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."

user_message = "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?"

prompt = f"system\n{system_message}\nuser\n{user_message}\nassistant"

inputs = tokenizer(prompt, return_tensors='pt')

After tokenizing the inputs, the generate() method returns the generated tokens. The generated tokens then should be converted to text before printing.

The generate method of the model is used to generate a sequence of tokens based on the input input_ids. The resulting output_ids variable holds the generated token IDs.

The batch_decode method of the tokenizer is used to convert the generated token IDs (output_ids) back into human-readable text. The [0] indexing is used to get the decoded text from the first sequence in the batch. Finally, the decoded answer is printed.

output_ids = model.generate(inputs["input_ids"],)
answer = tokenizer.batch_decode(output_ids)[0]

print(answer)

[ANSWER]

<s>  <|im_start|> system
You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. <|im_end|> 
 <|im_start|> user
How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful? <|im_end|> 
 <|im_start|> assistant
There are different ways to find out if a restaurant is popular among locals or mainly attracts tourists, and some possible reasons why this information might be useful are:

- If a restaurant is popular among locals, it might offer better quality, variety, or value for money than a tourist-oriented one, as it can rely on repeat customers and word-of-mouth recommendations.
- If a restaurant is mainly attracting tourists, it might have a more limited or expensive menu, or charge higher prices, as it has to cater to the expectations and budgets of visitors who might not know much about the local cuisine or culture.
- If you are looking for a place to eat that reflects the local lifestyle and preferences, you might want to avoid tourist traps and seek out restaurants that are frequented by locals, as they can give you a more authentic and enjoyable dining experience.
- If you are a local or a traveler who wants to support the local economy and culture, you might want to patronize restaurants that are popular among locals, as they can help sustain the local food industry and traditions.</s>

Below, the construction of a second turn message followed by its tokenization. Special tokens are not tokenized: add_special_tokens, which defaults to True, adds the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the end. If you do not want to use these symbols, you can set add_special_tokens to False.

Finally, the new input is formed (second_turn_input), followed by a new generation and decoding cycle.

# This example continues showing how to add a second
# turn message by the user to the conversation

second_turn_user_message = "Give me a list of the key points of your first answer."

# we set add_special_tokens=False because we dont want 
# to automatically add a bos_token between messages

second_turn_message_in_markup = f"\nuser\n{second_turn_user_message}\nassistant"

second_turn_tokens = tokenizer(
    second_turn_message_in_markup,
    return_tensors='pt',
    add_special_tokens=False)

second_turn_input = torch.cat([output_ids,
    second_turn_tokens['input_ids']], dim=1)

output_ids_2 = model.generate(second_turn_input,)
second_turn_answer = tokenizer.batch_decode(output_ids_2)[0]

print(second_turn_answer)

Conclusions

Orca 2 models, employing diverse reasoning techniques and selecting effective solution strategies for each task, achieve performance levels comparable to, and sometimes surpassing, much larger models, particularly in zero-shot reasoning tasks. Despite inherent limitations, these models exhibit promising potential for future improvement, particularly in terms of enhanced reasoning capabilities, control, and safety through post-training with synthetic data.

Additionally, the potential of using tailored and high-quality synthetic data, generated by more powerful models, for training smaller models in scenarios involving complex prompts and potentially multiple model calls should be underscored.

The authors believe that while frontier models will continue to showcase superior capabilities, research aimed at developing more capable smaller models will open new avenues for applications.

Useful links

Orca 2: Teaching Small Language Models How to Reason
A. Mitra et al. – Microsoft Research
arXiv:2311.11045 [cs.AI] (2023).

Orca 2: Teaching Small Language Models How to Reason
Microsoft Research Blog (link)

Microsoft Orca-2-7b Model – Hugging Face page.

Jupyter Notebook here.

Support this blog

m0nads