Key takeaways (May 17, 2026)
- DeepSeek shipped V4 Pro and V4 Flash variants by mid-2026, continuing the open-weight pricing pressure.
- Pro variants compete on reasoning quality; Flash variants compete on cost-per-token.
- U.S. and EU export controls continue to shape which clouds host the models.
- Open weights remain the main differentiation vs. closed-frontier rivals.
DeepSeek V4 Pro just dropped, and it is the cheapest path to frontier-class AI that has ever existed. On April 24, 2026, DeepSeek released the V4 preview, shipping two models with open weights under the MIT license: a 1.6 trillion parameter V4 Pro and a 284 billion parameter V4 Flash. Both run a new Hybrid Attention architecture that pushes the context window to 1 million tokens while using a fraction of the compute the previous generation needed.
This article walks through what changed in V4 Pro and V4 Flash, how the Hybrid Attention design actually works, where the benchmarks land against Claude Opus 4.7 and GPT-5.5, what the price-per-million math really looks like, and where I would and would not deploy these models in production today.
What DeepSeek V4 actually shipped
DeepSeek dropped V4 quietly on the same day OpenAI shipped GPT-5.5, and the timing was not an accident. The headline numbers are blunt.
- V4 Pro: 1.6 trillion total parameters, 49 billion active, 1M token context, MoE with Hybrid Attention.
- V4 Flash: 284 billion total parameters, 13 billion active, 1M token context, same architecture family.
- License: MIT. Anyone can download, fine-tune, and self-host.
- Distribution: DeepSeek API, Hugging Face, NVIDIA NIM, Together AI, DeepInfra, OpenRouter.
I have been testing the model through the official API and through OpenRouter for the last few days. The release is technically still labeled a preview, but the weights are public and stable, and I have not seen API errors or unexpected throttling. NVIDIA shipped optimized Blackwell endpoints for V4 the same week, so latency on the hosted side is competitive with any Western lab.
The strategic story here is what makes this release actually matter. As MIT Technology Review pointed out, DeepSeek is the only group consistently shipping frontier-grade open weights. Meta paused Llama 4 Behemoth. Mistral has gone partly closed. Anthropic, OpenAI, and Google are not even pretending. V4 Pro is the first open model to seriously compete with Opus 4.7 and GPT-5.5 across the full surface of coding, math, and long-context reasoning.
The Hybrid Attention architecture, in plain terms
V4’s new attention design is the part that actually changes the economics, so it is worth understanding even if you do not run inference yourself. The DeepSeek tech report introduces two attention modes that get interleaved through the network.
Compressed Sparse Attention (CSA). This stage groups tokens into blocks, builds a compressed representation per block, then runs sparse top-k selection on those blocks. A small sliding window keeps the most recent tokens uncompressed so short-range patterns are preserved. Learnable attention sinks hold attention logits stable so the model does not collapse on long inputs.
Heavily Compressed Attention (HCA). HCA goes further. It consolidates many tokens into a single compressed entry, then runs dense attention over the much shorter compressed sequence. This is where the bulk of the long-context speedup comes from.
The two modes are interleaved across layers. The result, according to DeepSeek’s V4 Pro model card, is that at a 1M-token context, V4 Pro uses about 27 percent of the per-token inference FLOPs and only 10 percent of the KV cache that V3.2 needed. That is the difference between long-context as a demo feature and long-context as something you can actually budget for in production.
When I tested V4 Pro on a 600K-token codebase summarization task on the hosted API, end-to-end latency landed around 38 seconds for a 4K-token output. That is not real-time, but it is also not the multi-minute wait I have learned to expect from million-token prompts on other providers. The cache-hit path on subsequent calls dropped that to roughly 11 seconds, which is the part I think most teams underestimate.
Benchmarks: where V4 Pro actually wins
I pulled together the leading public numbers from Artificial Analysis, DeepSeek’s release post, and the model cards. The picture is more interesting than “open model loses everywhere by a lot,” which was the V3 story a year ago.
| Benchmark | DeepSeek V4 Pro | Claude Opus 4.7 | GPT-5.5 (xhigh) | Gemini 3.1 Pro |
|---|---|---|---|---|
| Intelligence Index | 56 | 57 | 60 | 57 |
| LiveCodeBench | 93.5% | 88.8% | — | — |
| Codeforces rating | 3206 | — | 3168 | 3052 |
| Terminal-Bench 2.0 | 67.9% | 65.4% | 82.7% | — |
| SWE-bench Pro | 55.4% | 64.3% | — | — |
| MMLU-Pro | 86.2% | 88.1% | 89.7% | 91.0% |
| SimpleQA-Verified | 57.9% | — | — | 75.6% |
| IMOAnswerBench (math) | 89.8% | 75.3% | — | — |
| Apex Shortlist (math) | 90.2% | — | — | — |
A few things stand out from this matrix. V4 Pro is the strongest model in the world right now on competitive programming benchmarks. The Codeforces 3206 rating slots it at roughly the 23rd human competitor on the platform, and it is the first time an open-weights model has topped a frontier closed model on that surface. On math contests, V4 Pro sets a new state of the art on Apex Shortlist and pushes IMOAnswerBench up by 14 points over Opus 4.6.
Where V4 Pro loses, it loses honestly. SimpleQA-Verified is the most painful gap — Gemini 3.1 Pro nearly doubles V4 Pro’s score on factual recall. SWE-bench Pro, which measures full-coverage multi-step engineering work, still favors Opus 4.7. Terminal-Bench 2.0, which weights real agentic tool use, leans GPT-5.5. If your workload is research, knowledge synthesis, or end-to-end agent execution against a real terminal, V4 Pro is not the obvious pick.
For a deeper look at how the closed-frontier models stack up against each other, see my earlier breakdown on Gemma 4 versus Opus 4.6, GPT-5.4 Pro, and Gemini 3 Pro. V4 Pro now sits in that conversation.
The price chart that breaks the market
This is the part that will matter to most people reading this. Frontier-class intelligence has been pricing itself out of mass developer adoption, and V4 Pro just smashed that floor.
| Model | Input ($ / 1M tokens) | Output ($ / 1M tokens) | Cache hit |
|---|---|---|---|
| DeepSeek V4 Flash | $0.14 | $0.28 | 20% of input |
| DeepSeek V4 Pro | $1.74 | $3.48 | 20% of input |
| DeepSeek V4 Pro (promo through May 31) | $0.036 | $3.48 | — |
| Claude Opus 4.7 | $15 | $25 | varies |
| GPT-5.5 | $20 | $30 | varies |
Source: DeepSeek pricing docs and VentureBeat’s pricing summary.
V4 Pro at list price is roughly one-seventh the output cost of Opus 4.7 and one-ninth the output cost of GPT-5.5. V4 Flash at full price is more than ninety times cheaper on output than GPT-5.5. During the promo period, the input cost on V4 Pro is functionally a rounding error compared to its US peers.
The cache-hit math matters more than people realize. Most agent loops repeat the same system prompt, the same tool schemas, and the same task primer on every turn. If 70 percent of your input tokens are cached, V4 Pro effectively prices an agent turn at roughly $0.35 per million combined input plus output. On Opus 4.7, that same turn lands closer to $7. Multiply that across an agentic pipeline that fires thousands of tool calls per task, and the gap turns into a real budget line.
What this means for agentic coding
DeepSeek built V4 Pro as a coding model first, and it shows. According to the release post, V4 Pro is already powering DeepSeek’s own internal coding agent, and the company designed the harness work around standard tools so the model plugs into Claude Code, OpenCode, and OpenClaw with minimal glue. That detail is more important than it sounds.
What I have seen is that frontier-class open weights only matter if the surrounding tooling already speaks the right protocol. With Anthropic’s Claude Code ecosystem and OpenAI’s Agents SDK harness update both standardizing on similar conventions, plugging V4 Pro into an existing agent loop is a configuration change, not a rewrite. I swapped Opus 4.7 out of a small repo-cleanup agent and pointed it at V4 Pro through OpenRouter in about 20 minutes. The model finished the task. The cost ledger for the run was 91 percent lower.
Where I would not use V4 Pro yet:
- Long-running production agents on regulated data. The data residency and Chinese-jurisdiction risk is real. Self-host or do not.
- Knowledge-heavy single-shot Q&A. Gemini 3.1 Pro is a stronger pick for any task that turns on factual recall.
- Multi-step terminal agents with no human in the loop. GPT-5.5 still leads Terminal-Bench 2.0 by 15 points, and that gap shows up in real workflows.
For most other agent shapes — code review, refactoring, data extraction, documentation generation, RAG over private documents, internal coding copilots — V4 Pro is in the running. If you are designing an agent evaluation harness, this is a model worth slotting into your test matrix.
The data residency problem nobody is solving
The biggest blocker for V4 Pro adoption in Western enterprises is not benchmarks. It is jurisdiction. DeepSeek is a Chinese company. Routing customer data through the hosted API triggers GDPR, HIPAA, and most US enterprise procurement reviews instantly. I have already seen two CIO drafts banning DeepSeek’s hosted endpoints on the strength of that alone.
The MIT license fixes part of this. You can download the weights, run them on your own NVIDIA Blackwell box or an H200 cluster, and never send a token to DeepSeek. The hard part is operational. V4 Pro is 1.6T parameters total. Even at MoE-active 49B, you are looking at multi-GPU inference, careful KV cache budgeting, and an ops team that knows how to keep an agent fleet alive. Most companies that need the cost savings the most are also the least equipped to run the model themselves.
What I expect to see over the next quarter is a wave of managed inference providers — Together AI, DeepInfra, NVIDIA NIM partners, and Cloudflare-style edges — racing to offer a Western-jurisdiction V4 Pro endpoint. That is the deployment shape that will actually move enterprise budgets. If you are already thinking through agentic AI deployments and risks, V4 Pro is the model that forces the data-residency conversation to the front of the architecture review.
My recommendation
If you are building a developer tool, an internal coding copilot, or a high-volume RAG pipeline and you control where the data goes, V4 Pro should be in your evaluation set this week. The Codeforces and LiveCodeBench numbers are not noise. The cost gap to Opus 4.7 and GPT-5.5 is real and durable, since the weights are open. Pair it with a strong multi-agent orchestration layer and you have an agent stack that costs an order of magnitude less to run.
If you are an enterprise dealing with regulated data and you cannot self-host, do not move yet. Wait for a US-jurisdiction managed deployment. The savings are not worth the compliance risk in the meantime.
If you are betting your roadmap on closed frontier models holding their lead, this release is the warning shot. The gap between the best open model and the best closed model has been 12 to 18 months for the past two years. V4 Pro just compressed that gap into a single benchmark run on the same day that GPT-5.5 launched. The next move belongs to the closed labs, and I would not assume they will keep the lead by default.
Where AI goes next after V4
The DeepSeek V4 release is the clearest signal so far that 2026 is the year open weights stop trailing closed models on the metrics that pay the bills. The architecture work behind Hybrid Attention is going to get copied. The pricing is going to force OpenAI and Anthropic to defend their margins on something other than raw intelligence. And the harness compatibility means that any team running an agent loop today can A/B test V4 Pro against a closed model in an afternoon. That is the part that actually moves the market.