Moondream

"A tiny open-source computer-vision model that runs everywhere and kicks ass" (cit.)

May 24, 2024

Moondream is a lightweight and fast transformer-based model that can be used for a variety of computer vision tasks. "moondream2" is a 1.86B parameter model. In essence, the model is a Vision Transformer (ViT) together with a language model that is capable of generating human-like text based on the input visual information. This allows the model to perform tasks such as image captioning, visual question answering, and multimodal reasoning. Our aim is to explore the Moondream model in the simplest way.

Although possible, there is no real need to clone the Moondream GitHub repository, instead there are two quick choices:

run the notebook using Google Colab (avoiding installing locally) or
open a Jupyter notebook from an environment with pip and PyTorch and run the following cells.

There is a third option, that is, testing directly the model from this demo page without following the content of this notebook.

We can leverage the Hugging Face transformers library to build powerful natural language processing models. The transformers library provides a wide range of pre-trained models for various NLP tasks, such as text classification, named entity recognition, and language generation. einops library is useful to transform tensors into the desired shape for the model.

! pip install transformers einops

The model is updated regularly, so it is recommended pinning the model version to a specific release.

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-05-08"

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)

Enabling Flash Attention by appropriately setting the parameters: this is shown in the cell below.

# Uncomment the following to enable Flash Attention on the text model

# model = AutoModelForCausalLM.from_pretrained(
#     model_id, trust_remote_code=True, revision=revision,
#     torch_dtype=torch.float16, attn_implementation="flash_attention_2"
# ).to("cuda")

Loading the tokenizer for the specified model and revision.

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

The following cell is to enable URL image retrieval. However, we can also use paths to files as described later. Click on the link to see the image fed to the model.

# This is for opening images from URLs
import requests
from io import BytesIO

response = requests.get("https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-2.jpg")

The following cell creates a PIL Image object from an image file or an image URL. Uncomment to make a choice.

# image = Image.open('<IMAGE_PATH>') # image from file
image = Image.open(BytesIO(response.content)) # image from URL

Now we encode the input image using the model's image encoding function, then print the answer to the question.

enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The image shows a black computer server rack with three levels, each containing multiple computer components. The topmost level has two computer monitors, one with a blue screen and the other with a green screen. The middle level has two computer fans, one with a blue fan and the other with a green fan. The bottom level has two computer mice, one with a blue mouse and the other with a green mouse. The rack is placed on a carpeted floor, and a brick wall is visible in the background.

Batch inference is also supported.

# Uncomment below to use

# answers = moondream.batch_answer(  
#     images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
#     prompts=["Describe this image.", "Are there people in this image?"],
#     tokenizer=tokenizer,
# )

Limitations

The following are taken verbatim from the repo page.

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

Useful links

Jupyter Notebook here.

Moondream GitHub repository

Moondream Live Demo

Hugging Face page

Vision Transformer (ViT)

Support this blog

m0nads

Discussion about this post