Running Gemma 4 with a 5090 on llama.cpp

Running Gemma 4 with a 5090 on llama.cpp

First, grab a trustworthy Gemma 4 gguf model (unsloth is great). I have been fooling around with Q4, Q6, Q8 models of gemma-4-26B and gemma-4-31B models.

opertyE2BE4B31B Dense
Total Parameters2.3B effective (5.1B with embeddings)4.5B effective (8B with embeddings)30.7B
Layers354260
Sliding Window512 tokens512 tokens1024 tokens
Context Length128K tokens128K tokens256K tokens
Vocabulary Size262K262K262K
Supported ModalitiesText, Image, AudioText, Image, AudioText, Image
Vision Encoder Parameters~150M~150M~550M
Audio Encoder Parameters~300M~300MNo Audio

Grab the latest version of llama.cpp and compile it with CUDA support for GPU usage (or CPU if you don’t have a CUDA enabled GPU).

Then set up your server command line

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.