Gemma 4 is Google’s open-weights model line you can run on your own hardware. This guide uses Ollama to install Gemma 4 quickly, then shows how a web frontend or backend talks to the model through Ollama’s local API at localhost:11434. The weights never need to live “inside” a static HTML file — a browser tab or server you control sends prompts to a machine that has the model loaded.

What this guide covers

You will install Ollama, pull a Gemma 4 tag, verify inference from the terminal, then call the same model from application code (with notes on CORS and backend proxies). Typical setups: your laptop, a team server, or a cloud GPU you operate — not a public page that expects the model to bundle into the browser bundle.

End state: Ollama serves Gemma 4 at http://localhost:11434. Your web stack (or a small proxy you add) sends JSON to that API and streams or displays the model’s text response.

Prerequisites and hardware

  • Operating system: macOS, Windows, or Linux (Ollama supports all three).
  • Disk space: several gigabytes per variant; larger tags need more (see below).
  • RAM / GPU: Ollama ships quantized builds so you can run smaller variants on consumer hardware. Bigger tags need more VRAM or unified memory; if inference is slow, pick gemma4:e2b or gemma4:e4b.
  • Network: required only to download the model the first time. After that, generation can run fully offline.

Google publishes approximate VRAM for weights only (BF16 / quantized) in its Gemma core documentation; real usage grows with context length (KV cache). Treat tables as planning hints, not guarantees.

Pick a Gemma 4 variant

Ollama exposes Gemma 4 under the gemma4 library name. Tags follow Google’s four sizes. Always confirm current tags on ollama.com/library/gemma4/tags.

Ollama tag (typical) Role Notes
gemma4:e2b Lightest Best for laptops, long battery, or tight RAM; 128K context class.
gemma4:e4b Default balance Often the practical “start here” tag for development; 128K context class.
gemma4:26b MoE throughput Stronger reasoning on capable GPUs; 256K context class. Heavier download and RAM.
gemma4:31b Dense quality Highest quality in the family for local use; 256K context class. Needs serious hardware.

Running ollama pull gemma4 without a tag usually pulls the library’s default tag (often aligned with E4B); specify a tag if you want a different size.

Step 1: Install Ollama

  1. Open ollama.com/download and install the package for your OS.

  2. In a terminal, verify the CLI:

    ollama --version

Use a recent Ollama release. Gemma 4 landed after older builds; if pull fails with manifest or compatibility errors, update Ollama first.

Step 2: Download Gemma 4

Download the weights once (requires internet during this step):

ollama pull gemma4

Or pin a size explicitly, for example:

ollama pull gemma4:e2b
ollama pull gemma4:26b

Confirm the model is registered:

ollama list

Step 3: Test from the terminal

Quick sanity check:

ollama run gemma4 "Summarize local LLM setup in one sentence."

Vision (image input): Ollama supports image paths in the prompt for multimodal models. Google’s Ollama + Gemma guide shows the pattern: include the file path in the prompt (adjust for your OS path).

Raw HTTP check (optional):

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Say hello in five words.",
  "stream": false
}'

Use Gemma 4 from a web app

Ollama listens on port 11434. Your application can:

  • Server-side: Node, Python, PHP, or edge functions call http://127.0.0.1:11434 with fetch or HTTP clients. This avoids browser CORS limits and is the pattern for production-style apps.
  • Client-side: JavaScript in the browser can call the same URL only if the page origin is allowed by Ollama’s CORS rules (see next section).

Example: non-streaming generate from JavaScript (suitable for a local dev page on an allowed origin):

const res = await fetch("http://127.0.0.1:11434/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gemma4",
    prompt: "Explain fetch() in one paragraph.",
    stream: false,
  }),
});
const data = await res.json();
console.log(data.response);

For chat-style APIs and streaming tokens, use Ollama’s /api/chat endpoint with stream: true and read the response body as an event stream; see Ollama API documentation.

CORS, localhost, and when to use a backend proxy

Browsers block cross-origin requests unless the server sends the right CORS headers. Ollama can be configured with OLLAMA_ORIGINS so specific web origins (your dev server URL, an internal dashboard, or a browser extension scheme) may call the API directly.

Patterns that usually work well:

  • Local full-stack app: Add a route such as POST /api/chat on your Next.js, Express, or FastAPI app. The browser talks to your origin; the server forwards the body to 127.0.0.1:11434. No CORS issue for the Ollama hop.
  • Browser-only experiments: Set OLLAMA_ORIGINS to include your dev origin (or follow Ollama docs for your OS) so preflight requests succeed. Avoid wildcard origins on shared machines.

If you deploy a public website, do not expose an open Ollama port to the internet without authentication and network controls. Run inference behind your API with rate limits and auth, or use managed cloud inference instead.

Alternatives: Hugging Face and Python

If you need full checkpoints, fine-tuning, or integration with PyTorch pipelines, download Gemma 4 from Hugging Face or Kaggle and follow Google’s notebooks:

Those paths are heavier to operate but give maximum control compared to Ollama’s pre-quantized GGUF workflow.

Troubleshooting

  • Pull fails or unknown model: Update Ollama, then retry ollama pull gemma4:<tag>. Confirm the tag exists on the official library page.
  • Out of memory or extreme slowness: Switch to e2b or e4b, close other GPU-heavy apps, or run on a machine with more unified memory / VRAM.
  • Browser errors mentioning CORS: Use a server-side proxy or configure OLLAMA_ORIGINS for your dev URL.
  • Need managed hosting: Google documents cloud deployment patterns (for example Cloud Run + Ollama + Gemma) for teams that do not want to run hardware on a desk.