Key takeaways (May 17, 2026)
- Comparison refreshed for May 2026 — Gemma 4 (open weights) and Claude Opus 4.7 are now the most relevant updates.
- Gemma 4 leads on permissively-licensed open models; Opus 4.7 leads on long-context reasoning with 1M tokens.
- GPT-5.4 / GPT-5.5 lead on agentic tool-use polish; Gemini 3 Pro leads inside Google Workspace.
- Use case is the right axis of comparison, not aggregate leaderboard wins.
Google released Gemma 4 on April 2, 2026, and it is the first open-weights family that genuinely makes me reconsider what I run locally versus what I pay an API for. I’ve spent the last few days reading the launch materials, the model card, the Arena leaderboard, and pushing the 31B against the same kinds of workloads I throw at Claude Opus 4.6, GPT-5.4 Pro, and Gemini 3.1 Pro. The short version: Gemma 4 doesn’t beat the closed frontier on the hardest tasks, but the gap is smaller than any open release I’ve seen, and it ships under a real Apache 2.0 license — not a half-open source-available one.
This article walks through what Google actually shipped, how it compares to the three closed flagships I use most, and what the release means for the broader AI race in April 2026.
Key findings
- Gemma 4 launched April 2, 2026 in four sizes: E2B, E4B, 26B Mixture-of-Experts (with 4B active parameters), and 31B Dense — all under Apache 2.0, per Google’s official announcement.
- The 31B Dense currently ranks #3 on the Arena AI open-source text leaderboard, and the 26B MoE ranks #6 — Google’s blog says they outcompete models “20x their size.”
- Claude Opus 4.6 holds a slight edge on SWE-bench Verified (~80.8% vs GPT-5.4’s ~80%) and led Terminal-Bench 2.0 and Humanity’s Last Exam at its Feb 5, 2026 launch, with a 1M-token context window in beta at $5/$25 per million tokens.
- GPT-5.4 launched March 5, 2026 and is the first model to cross the OSWorld human-expert baseline (75% vs 72.4%); the GPT-5.4 Pro variant runs at $30/$180 per million tokens — the most expensive frontier model on the market.
- Gemini 3.1 Pro (Feb 19, 2026) is the only one of the four with native audio + video input and a true 1M-token multimodal context window, making it the right pick for dense PDFs, long video, and full-codebase Q&A.
What is Gemma 4?
Gemma 4 is Google DeepMind’s April 2026 open-weights model family, released under a permissive Apache 2.0 license. It ships in four sizes: E2B and E4B edge models with native vision and audio support that run fully offline on phones and embedded hardware, a 26B Mixture-of-Experts model with ~4B active parameters per token, and a 31B Dense flagship that currently ranks #3 on the Arena open-source leaderboard. All four variants support a 256K-token context window, fluency in over 140 languages, and were built on Gemini 3 research. The 31B Dense is the first open-weights model that competes seriously with closed flagships like Claude Opus 4.6 and GPT-5.4 Pro on agentic workflows, while the edge variants are the only option in this tier that can run entirely on-device.
What Google actually shipped
The Gemma 4 announcement is signed by Clement Farabet, VP of Research at Google DeepMind, and Olivier Lacombe, Group Product Manager. The release covers four sizes, and the naming is a bit unusual on purpose:
- Gemma 4 E2B and E4B — “Effective 2B” and “Effective 4B.” These are the edge models with native vision and audio support, designed to run completely offline on phones, Raspberry Pi, and Jetson Nano hardware. Google built them in partnership with the Pixel team, Qualcomm, and MediaTek.
- Gemma 4 26B (A4B) — A Mixture-of-Experts model with 26B total parameters but only ~4B active per token. This is the one that hits #6 on the open-source Arena leaderboard.
- Gemma 4 31B Dense — The full dense workhorse and the headline of the release. #3 on the open-source Arena leaderboard. This is the model you run when you want frontier-class quality without paying API rates.
All four were built from Gemini 3 research, support up to a 256K context window, fluency in over 140 languages, and ship with weights on Hugging Face under Apache 2.0. The license matters more than people give it credit for. Gemma 3 had restrictive clauses that bit a couple of my clients on commercial use. Gemma 4 is genuinely permissive — you can deploy it anywhere, fine-tune it, and ship products on top of it without a custom license review.
The head-to-head comparison
Here’s the table I wish I’d had a few days ago. Pricing and benchmark numbers are from each provider’s official launch posts and the independent reporting I cite below.
| Model | Released | Context | Pricing (in / out per 1M tok) | Standout |
|---|---|---|---|---|
| Gemma 4 31B Dense | Apr 2, 2026 | 256K | Self-hosted (Apache 2.0) | #3 open model on Arena, frontier quality on a single GPU |
| Claude Opus 4.6 | Feb 5, 2026 | 1M (beta) | $5 / $25 | Highest at-launch scores on Terminal-Bench 2.0 and Humanity’s Last Exam |
| GPT-5.4 Pro | Mar 5, 2026 | 1M (922K in / 128K out) | $30 / $180 | GPT-5.4 family is first to beat the OSWorld human-expert baseline (75% vs 72.4%) |
| Gemini 3.1 Pro | Feb 19, 2026 | 1M | Per Vertex AI pricing | Native audio + video input, deepest multimodal |
A few things in that table are worth slowing down on.
How good is Gemma 4 at coding?
This is the closest race on the chart. Claude Opus 4.6 scores roughly 80.8% on SWE-bench Verified per Anthropic’s launch post and the independent GPT-5.4 comparison published on NxCode. GPT-5.4 matches that at around 80% on Verified and pulls ahead on the harder SWE-bench Pro variant at 57.7%. Opus still wins on Terminal-Bench 2.0, which I find correlates better with actual multi-file agent loops in real repos. I cover the full coding-agent picture in AI coding agents in 2026.
Where does Gemma 4 land? Google hasn’t published an apples-to-apples SWE-bench number, but the 31B’s #3 Arena ranking and the agentic-skills positioning in the Google Developers blog put it within striking distance of mid-tier closed models. It is not going to beat Opus 4.6 on a hard bug fix in a 50-file repo. It will absolutely handle most of the routine coding work that eats your API budget.
Which model wins on reasoning and knowledge work?
At its Feb 5 launch, Claude Opus 4.6 led Humanity’s Last Exam and the GDPval-AA evaluation run by Artificial Analysis, beating GPT-5.2 by ~144 Elo points per Anthropic’s footnotes. GPT-5.4 then took the GDPval crown back a month later at 83% on the standard variant per OpenAI’s launch numbers. These two trade leads benchmark by benchmark. Honest take: pick whichever you already use unless you specifically need OSWorld-style desktop automation, in which case GPT-5.4 Pro is the only credible choice right now.
How does Gemma 4 handle long context and multimodal input?
Gemini 3.1 Pro is still the model I trust on the longest workloads. The Gemini 3.1 Pro model card confirms 1M-token input across text, images, audio, and video, with 64K-token output. Opus 4.6’s 1M beta is solid but text-only at that length. GPT-5.4 Pro is 922K in / 128K out per OpenAI’s pricing page, with input pricing doubling above 272K. Gemma 4 holds 256K, which is generous for an open model but well behind the closed flagships.
For dense PDFs, video, or whole-codebase Q&A, Gemini 3.1 Pro is the right answer. For everything else, the others are competitive.
Can Gemma 4 run on edge and on-device?
This is the only category where Gemma 4 wins outright. The E2B and E4B variants run entirely offline on a phone, with vision and audio. None of the closed flagships even compete here — they aren’t designed to. If you’re building anything mobile-first, regulated-data, or air-gapped, this is the release you’ve been waiting for.
How powerful is Gemma 4, really
Powerful enough to change the build-vs-buy math for serious teams. Not powerful enough to retire your Anthropic, OpenAI, or Google API subscription if you’re doing frontier work.
My honest read: if your workload is well-scoped — customer support triage, document extraction, structured RAG, internal coding assistants, on-device assistants — Gemma 4 31B will do the job for a fraction of API cost, you can fine-tune it on your own data, and you don’t send anything to a third party. That matters for regulated industries and for any team where token spend is stacking up faster than the value it produces. The MCP protocol ecosystem is pushing more workloads in this direction, and Gemma 4 is finally a credible default for the open side of that stack.
If your workload is messy, multi-step, and benefits from the longest possible reasoning loop, you still want Opus 4.6 or GPT-5.4 Pro. The closed flagships pull ahead on the hardest 10% of tasks, and that 10% is often where the value lives.
What this does to the AI race
Three things shift, in my view.
1. Google reclaims the open-source narrative. Meta spent two years owning the “we ship weights” story with Llama. Gemma 4 takes that back. The 31B Dense is the first open model that genuinely competes on agentic workflows, and ranking #3 on the Arena open-source leaderboard is not a marketing claim — it’s an external benchmark. Three weeks later DeepSeek answered with DeepSeek V4 Pro and V4 Flash, pushing the frontier of open weights even further with 1.6T parameters and a 1M token context. Expect Meta to respond with a Llama 5 push before mid-year.
2. The middle of the market gets squeezed. If you’re a model provider whose pitch was “almost as good as GPT for half the price,” Gemma 4 just took that pitch and made it free. Self-host it, no per-token charges, comparable quality for routine work. Several mid-tier API companies are going to have a rough quarter.
3. Inference economics start to dominate model selection. When the gap between open and closed shrinks to single-digit points on most benchmarks, the question stops being “which model is smartest” and starts being “which model is cheapest at the quality I need.” That favors whoever owns the GPUs. Right now that’s the hyperscalers, which is why Google releasing strong open weights also drives Google Cloud consumption. I dug into this dynamic more in my AI governance framework piece.
The closed labs aren’t in trouble. Opus 4.6 and GPT-5.4 Pro still set the ceiling, and the frontier keeps moving. But the floor just got raised by a lot. For a broader look at where the agent stack is heading this year, see my agentic AI 2026 deployments writeup.
My recommendation
If you’re starting a new agent project today, here’s what I’d actually do. Prototype on Opus 4.6 or GPT-5.4 Pro because you want the highest-quality baseline first. Once the workflow is locked in, swap in Gemma 4 31B (or the 26B MoE for cheaper inference) for the 80% of calls that don’t need frontier reasoning, and keep the closed model for the hard tail. That’s roughly the setup I’m pitching to two of my own clients this week. The hybrid pattern is going to be the default by the end of 2026, and Gemma 4 is the model that finally makes the open half of that hybrid serious.
Frequently asked questions
What is Gemma 4? Gemma 4 is Google DeepMind’s April 2026 open-weights model family, released under a permissive Apache 2.0 license. It ships in four sizes — E2B, E4B, 26B MoE, and 31B Dense — all built on Gemini 3 research with a 256K context window and support for 140+ languages.
How does Gemma 4 compare to Claude Opus 4.6 and GPT-5.4 Pro? Gemma 4 31B Dense reaches mid-tier closed-model quality on most benchmarks but still trails Opus 4.6 and GPT-5.4 Pro on the hardest coding and reasoning tasks. Opus leads Terminal-Bench 2.0 and Humanity’s Last Exam; GPT-5.4 Pro is the first model to cross the OSWorld human-expert baseline at 75%. Gemma 4’s advantage is that it’s self-hosted with zero per-token cost.
Can Gemma 4 run on-device? Yes. The E2B and E4B variants run fully offline on phones, Raspberry Pi, and Jetson Nano with native vision and audio input. It is the only category where Gemma 4 beats Opus 4.6, GPT-5.4 Pro, and Gemini 3.1 Pro outright.
Is Gemma 4 free for commercial use? Yes. Gemma 4 is released under standard Apache 2.0, which permits commercial use, modification, and fine-tuning without a custom license review — a meaningful upgrade over Gemma 3’s restrictive clauses.
Related AI Insights
- AI Coding Agents in 2026: GPT-5, Claude Code, and Developers
- Best AI Assistants 2026: Work Faster and Create More
- The Agentic AI Revolution 2026
- MCP and Agentic AI Explained
- Powerful AI Governance Framework: 7 Strategies for 2026
- GPT-Rosalind: OpenAI’s Life Sciences AI for Drug Discovery
- DeepSeek V4 Pro: Open Frontier AI at 1/10 the Cost