What Google announced

Gemini keeps getting stronger, but you only run it on Google’s terms. Gemma is Google’s open-weight line- weights you can download and run yourself- and it has been the practical option for teams that need local or self-hosted AI. Gemma 3 is over a year old now. Google is replacing that generation with Gemma 4: four model sizes tuned for local use, plus a licensing change that addresses what developers have been complaining about for years.

Starting now, builders can work with Gemma 4 in Google’s tooling and pull weights from the usual hubs. The headline legal move: Google is dropping its custom Gemma license and releasing Gemma 4 under Apache 2.0.

Short version: Gemma 4 is a family of four open-weight models (two large for strong GPUs, two “effective” small ones for phones and edge). Same technical lineage as closed Gemini 3-class models, but runnable offline if you have the hardware. License is now standard Apache 2.0 instead of Google’s older, heavier terms.

The four Gemma 4 sizes

Model Role Notes
26B MoE Fast local inference on serious GPUs Mixture of Experts; only ~3.8B parameters active per forward pass — higher tokens/sec than many similarly sized dense models.
31B Dense Quality-first local work Slower than MoE; Google expects developers to fine-tune it for specific domains.
E2B (Effective 2B) Mobile / edge Low memory and battery use; optimized with Qualcomm and MediaTek for phones, Raspberry Pi, Jetson Nano.
E4B (Effective 4B) Stronger edge / on-device Same positioning as E2B with more capacity; Google cites near-zero latency vs Gemma 3 for these tiers.

26B MoE and 31B dense: what “local” actually costs

Google targets unquantized bfloat16 on a single 80GB Nvidia H100 for both large variants — that is still “local” in the data-center sense, not a laptop. An H100 is serious money. The point is: quantize and the same architectures shrink onto consumer GPUs that normal builders actually own.

Google also stresses lower latency so on-prem inference feels usable. The MoE design is the speed play; the 31B dense is the “make it good, then adapt it” play.

E2B and E4B: Pixel-class and maker boards

These are the models meant to run where power and RAM hurt. Google says the Pixel team collaborated with chip vendors so E2B/E4B behave well on real phone silicon, not just slides. Less memory, less battery drain than Gemma 3 on comparable tasks, with latency Google describes as nearly instant for typical on-device loops.

Chart comparing AI models: Elo score versus total parameters on log scale, highlighting Gemma 4 thinking variants in blue.

Capabilities: reasoning, agents, code, vision, speech

Google claims Gemma 4 clears Gemma 3 across the board and competes near the top of open-model leaderboards — citing, for example, Gemma 31B placing around third on an Arena-style open ranking, behind larger names like GLM-5 and Kimi 2.5, while staying much smaller and therefore cheaper to host yourself.

Under the hood, Google aligns Gemma 4 with the same family as Gemini 3 (closed): stronger reasoning, math, and instruction-following. The product story in 2026 is agentic workflows, so Gemma 4 adds native function calling, structured JSON output, and instructions aimed at common tools and APIs.

Code: Google positions Gemma 4 as a way to get high-quality generation offline if you can run the big checkpoints — a different trade than always calling Gemini Pro or cloud coding agents.

Vision: Improved multimodal handling for things like OCR and chart reading on local hardware.

Speech: E2B and E4B include native speech recognition; Gemma 3 had speech features, but Google is signaling a quality step for Gemma 4.

Context windows and languages

Gemma 4 supports 140+ languages. Context lengths: 128k tokens on the edge models (E2B / E4B), 256k on the 26B and 31B variants. That is strong for self-hosted models; it is still well short of cloud Gemini’s 1M token class of context.

Why Apache 2.0 is the real story for some teams

Older Gemma releases shipped under a custom Google license. Developers flagged problems: a prohibited-use list Google could change unilaterally, expectations that you police downstream projects, and language some read as affecting other models trained on synthetic data from Gemma. Whether every reading held up in court mattered less than the risk perception: legal ambiguity kills enterprise adoption.

Apache 2.0 is boring in the best way: widely understood, permissive for commercial use, no surprise one-sided updates. Google’s bet is that predictable terms grow the ecosystem they keep calling the Gemmaverse — more ports, more fine-tunes, more products that never phone home to Mountain View.

Gemini Nano 4: confirmed for Pixel

On-device Android AI under the Gemini Nano name (scam warnings, summaries, call recap without sending audio to the cloud) has always been related to Gemma. Google now states explicitly that the next-gen NanoGemini Nano 4 — will use 2B and 4B variants derived from Gemma 4 E2B and E4B. Today’s Pixel stack runs Nano 3 tied to Gemma 3n.

Developers can prototype with E2B/E4B in the latest AI Core Developer Preview; Google says those designs should be forward-compatible when Nano 4 ships. More detail will likely land at Google I/O in the coming weeks.

Where to try and download

  • AI Studio — larger checkpoints: 31B and 26B MoE.
  • AI Edge GalleryE4B and E2B.
  • Weights: full downloads via Hugging Face, Kaggle, and Ollama.
  • Cloud: Google will also host runs on Google Cloud for a fee if you want managed inference instead of your own GPUs.

Google’s announcement (video)

Official walkthrough and positioning from Google:

Watch on YouTube →

Bottom line

Gemma 4 is Google’s attempt to own the open local tier the same way Gemini owns the cloud: four clear SKUs from H100-class down to phone-class, agent-ready APIs, and a license developers do not have to parse with a lawyer on speed dial. Whether leaderboard claims and “near-zero latency” hold up is for independent benchmarks — but Apache 2.0 alone will pull teams back who walked away from Gemma 3 over terms.