Running Gemma 4 with a 5090 on llama.cpp

June 20, 2026 matt Comments 0 Comment

First, grab a trustworthy Gemma 4 gguf model (unsloth is great). I have been fooling around with Q4, Q6, Q8 models of gemma-4-26B and gemma-4-31B models.

operty	E2B	E4B	31B Dense
Total Parameters	2.3B effective (5.1B with embeddings)	4.5B effective (8B with embeddings)	30.7B
Layers	35	42	60
Sliding Window	512 tokens	512 tokens	1024 tokens
Context Length	128K tokens	128K tokens	256K tokens
Vocabulary Size	262K	262K	262K
Supported Modalities	Text, Image, Audio	Text, Image, Audio	Text, Image
Vision Encoder Parameters	~150M	~150M	~550M
Audio Encoder Parameters	~300M	~300M	No Audio

Grab the latest version of llama.cpp and compile it with CUDA support for GPU usage (or CPU if you don’t have a CUDA enabled GPU).

Then set up your server command line

Matt's Homepage

Running Gemma 4 with a 5090 on llama.cpp

June 20, 2026 matt Comments 0 Comment

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply