Running Gemma 4 with a 5090 on llama.cpp
First, grab a trustworthy Gemma 4 gguf model (unsloth is great). I have been fooling around with Q4, Q6, Q8 models of gemma-4-26B and gemma-4-31B models.
| operty | E2B | E4B | 31B Dense |
|---|---|---|---|
| Total Parameters | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 30.7B |
| Layers | 35 | 42 | 60 |
| Sliding Window | 512 tokens | 512 tokens | 1024 tokens |
| Context Length | 128K tokens | 128K tokens | 256K tokens |
| Vocabulary Size | 262K | 262K | 262K |
| Supported Modalities | Text, Image, Audio | Text, Image, Audio | Text, Image |
| Vision Encoder Parameters | ~150M | ~150M | ~550M |
| Audio Encoder Parameters | ~300M | ~300M | No Audio |
Grab the latest version of llama.cpp and compile it with CUDA support for GPU usage (or CPU if you don’t have a CUDA enabled GPU).




