Phi-3 models

Tiny but mighty

May 29, 2024

Microsoft Phi-3 is a family of small language and multi-modal models whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5. Language models are available in short- and long-context lengths (4K and 128K).

Introduction

Microsoft researchers have always paid laudable attention to training techniques that take into account the careful selection of high-quality and more informative data combined with synthetic data produced with the aim of “instructing” or providing hacks to improve the performance of small models.

Phi-3 models are Transformer-based models trained using a selection of “textbook quality” data from the web together with synthetically generated textbooks and exercises with GPT-3.5.

Training strategy

Training data of consists of heavily filtered publicly available web data with high educational level from various open internet sources, as well as synthetic LLM-generated data. For example, suppose to compare coding files from training data: a high educational level code is code that is well formatted, easy to read and containing descriptions of what the code is for.

Pre-training is performed in two disjoint and sequential phases. The first phase comprises mostly of web sources aimed at teaching the model general knowledge and language understanding. The second phase merges even more heavily filtered web data (a subset used in the first phase) with some synthetic data whose purpose is to teach the model logical reasoning and various niche skills.

Researchers have attempted to calibrate the training data to bring it closer to the data optimal regime for small models, trying to retain those web pages that could potentially improve the “reasoning ability” of the model. For example, the result of a game in premier league in a particular day might be good training data for frontier models, but it is rather superfluous when focusing on reasoning capabilities of small sized models.

Phi-3 models range from 3.8 billion parameters (phi-3-mini, small enough to be deployed on a phone!) to 7 and 14 billion parameters (phi-3-small and phi-3-medium respectively). It is observed that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B. This could likely indicate that data mixture needs further work to be in the “data optimal regime” for 14B parameters model. Later we will introduce a coding hands-on example using phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.

The post-training process of phi-3-mini included two main stages: supervised finetuning (SFT) and direct preference optimization (DPO).

SFT: Utilized highly curated, high-quality data from various domains such as math, coding, reasoning, conversation, model identity, and safety. It began with English-only examples.
DPO: Focused on chat format data, reasoning, and responsible AI (RAI) efforts to guide the model away from undesirable behavior by marking those outputs as “rejected” responses.

Overall, this post-training process enhanced the model’s capabilities in math, coding, reasoning, robustness, and safety, transforming it into an efficient and safe AI assistant for user interactions.

Technical specifications

The phi-3-mini model is a Transformer decoder architecture with default context length 4K – there is also a long context version that extends the context length to 128K, called phi-3-mini-128K – and it is built upon a similar block structure as Llama-2. It uses the same tokenizer with vocabulary size of 320641, meaning that all packages developed for Llama-2 family of models can be directly adapted to phi-3-mini. The model uses 3072 hidden dimension, 32 heads and 32 layers. The model was trained using bfloat16 for a total of 3.3T tokens.

To leverage the better multilingual tokenization, phi-3-small model (7B parameters) uses the tiktoken tokenizer with a vocabulary size of 1003522 and default context length 8192. The decoder architecture meets the standards of a 7B model class, having 32 heads, 32 layers and a hidden size of 4096. The model

uses GEGLU activations and Maximal Update Parametrization – in practice, hyperparameters are tuned on a small proxy model and then transferred/extended to the larger target 7B model;
leverages a grouped-query attention with 4 queries sharing 1 key;
uses a novel block sparse attention which, for each attention head, enforces different sparsity patterns over Key-Values cache (you can click here for a good resource on block sparse attention);
alternates dense attention layers and block sparse attention layers to optimize KV cache savings while maintaining long context retrieval performance.

Benchmarks

Below, a list of different Phi-3 models benchmarks:

Phi-3-Small benchmarks (link);
Phi-3-Medium benchmarks (link);
Phi-3-Vision benchmarks (link).

Phi-3-Vision – quick testing

Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model (4.2B parameters) designed to process an image and a textual prompt as inputs, and subsequently generate textual outputs. The model is composed of an image encoder (CLIP ViT-L/14) and a transformer decoder (phi-3-mini-128K-instruct). The visual tokens, once extracted by the image encoder, are then combined with text tokens in an interleaved way (no particular order for image and text tokens).

In this section we will explore Phi-3-Vision model using a Jupyter notebook and Hugging Face Transformers library. The notebook needs Cuda Toolkit support because it uses Flash Attention – if you don’t have such a GPU, you could try phi-3-mini-128K model instead.

Let’s begin with installing and importing libraries.

%pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

%pip install accelerate
%pip install flash-attn

from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor

Set the model.

model_id = "microsoft/Phi-3-vision-128k-instruct" 

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True, 
    torch_dtype="auto")

processor = AutoProcessor.from_pretrained(
    model_id, 
    trust_remote_code=True)

Provide messages, including the request “Provide insightful questions to spark discussion”.

messages = [ 
    {"role": "user", "content": "\nWhat is shown in this image?"}, 
    {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."}, 
    {"role": "user", "content": "Provide insightful questions to spark discussion."} 
]

Select the input image.

url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
 
image = Image.open(requests.get(url, stream=True).raw)

Prepare the prompt and preprocess the input data (text and image) converting it into a format that can be fed into the model for further processing or inference.

prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True)

inputs = processor(
    prompt,
    [image], 
    return_tensors="pt").to("cuda:0")

The following cell sets up the configuration parameters for a text generation process by creating a dictionary with specific settings for the maximum number of tokens to generate, the temperature (randomness) of the output, and whether to use sampling or not (in this case the model will not sample from the probability distribution but will instead choose the token with the highest probability at each step) .

generation_args = { 
    "max_new_tokens": 500, 
    "temperature": 0.0, 
    "do_sample": False, 
}

The generate method generates the new text based on the input and the learned patterns from the training data. The output it produces is generate_ids, which is a list or tensor containing the token IDs representing the generated text.

generate_ids = model.generate(
    **inputs,
    eos_token_id=processor.tokenizer.eos_token_id,
    **generation_args)

In the folowing cell, the purpose of the first line is to remove the input token IDs from the generated output. The second line decodes the remaining generated token IDs into human-readable text.

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False)[0] 

print(response)

Here is the response.

1. What are the most significant barriers to meeting preparedness according to the respondents?
2. How does the level of agreement with each statement correlate with the respondents' overall satisfaction with their meetings?
3. Are there any notable differences in agreement levels between different demographics or job roles?
4. What strategies have been most effective in improving meeting preparedness based on the respondents' feedback?
5. How does the perceived importance of each statement vary across different industries or company sizes?

Useful links

Phi-3 article (link)

Phi-3 available on MS Azure (link)

Understanding BigBird’s Block Sparse Attention (link)

Jupyter notebook on Phi-3-Vision (link)

Support this blog

m0nads

Discussion about this post