News

DeepSeek V4 Pro: Open Frontier AI at 1/10 the Cost

DeepSeek V4 Pro shipped on April 24 with 1.6T open weights, a 1M token context, and a Codeforces 3206 rating — at one-tenth the cost of Opus 4.7.

Harsimran Singh | | 10 min read | |
#DeepSeek V4 Pro#DeepSeek V4 Flash#open source AI#Hybrid Attention#agentic coding#1M context window#MIT license
DeepSeek V4 Pro: Open Frontier AI at 1/10 the Cost

Key takeaways (May 17, 2026)

  • DeepSeek shipped V4 Pro and V4 Flash variants by mid-2026, continuing the open-weight pricing pressure.
  • Pro variants compete on reasoning quality; Flash variants compete on cost-per-token.
  • U.S. and EU export controls continue to shape which clouds host the models.
  • Open weights remain the main differentiation vs. closed-frontier rivals.

DeepSeek V4 Pro just dropped, and it is the cheapest path to frontier-class AI that has ever existed. On April 24, 2026, DeepSeek released the V4 preview, shipping two models with open weights under the MIT license: a 1.6 trillion parameter V4 Pro and a 284 billion parameter V4 Flash. Both run a new Hybrid Attention architecture that pushes the context window to 1 million tokens while using a fraction of the compute the previous generation needed.

This article walks through what changed in V4 Pro and V4 Flash, how the Hybrid Attention design actually works, where the benchmarks land against Claude Opus 4.7 and GPT-5.5, what the price-per-million math really looks like, and where I would and would not deploy these models in production today.

What DeepSeek V4 actually shipped

DeepSeek dropped V4 quietly on the same day OpenAI shipped GPT-5.5, and the timing was not an accident. The headline numbers are blunt.

  • V4 Pro: 1.6 trillion total parameters, 49 billion active, 1M token context, MoE with Hybrid Attention.
  • V4 Flash: 284 billion total parameters, 13 billion active, 1M token context, same architecture family.
  • License: MIT. Anyone can download, fine-tune, and self-host.
  • Distribution: DeepSeek API, Hugging Face, NVIDIA NIM, Together AI, DeepInfra, OpenRouter.

I have been testing the model through the official API and through OpenRouter for the last few days. The release is technically still labeled a preview, but the weights are public and stable, and I have not seen API errors or unexpected throttling. NVIDIA shipped optimized Blackwell endpoints for V4 the same week, so latency on the hosted side is competitive with any Western lab.

The strategic story here is what makes this release actually matter. As MIT Technology Review pointed out, DeepSeek is the only group consistently shipping frontier-grade open weights. Meta paused Llama 4 Behemoth. Mistral has gone partly closed. Anthropic, OpenAI, and Google are not even pretending. V4 Pro is the first open model to seriously compete with Opus 4.7 and GPT-5.5 across the full surface of coding, math, and long-context reasoning.

The Hybrid Attention architecture, in plain terms

V4’s new attention design is the part that actually changes the economics, so it is worth understanding even if you do not run inference yourself. The DeepSeek tech report introduces two attention modes that get interleaved through the network.

Compressed Sparse Attention (CSA). This stage groups tokens into blocks, builds a compressed representation per block, then runs sparse top-k selection on those blocks. A small sliding window keeps the most recent tokens uncompressed so short-range patterns are preserved. Learnable attention sinks hold attention logits stable so the model does not collapse on long inputs.

Heavily Compressed Attention (HCA). HCA goes further. It consolidates many tokens into a single compressed entry, then runs dense attention over the much shorter compressed sequence. This is where the bulk of the long-context speedup comes from.

The two modes are interleaved across layers. The result, according to DeepSeek’s V4 Pro model card, is that at a 1M-token context, V4 Pro uses about 27 percent of the per-token inference FLOPs and only 10 percent of the KV cache that V3.2 needed. That is the difference between long-context as a demo feature and long-context as something you can actually budget for in production.

When I tested V4 Pro on a 600K-token codebase summarization task on the hosted API, end-to-end latency landed around 38 seconds for a 4K-token output. That is not real-time, but it is also not the multi-minute wait I have learned to expect from million-token prompts on other providers. The cache-hit path on subsequent calls dropped that to roughly 11 seconds, which is the part I think most teams underestimate.

Benchmarks: where V4 Pro actually wins

I pulled together the leading public numbers from Artificial Analysis, DeepSeek’s release post, and the model cards. The picture is more interesting than “open model loses everywhere by a lot,” which was the V3 story a year ago.

BenchmarkDeepSeek V4 ProClaude Opus 4.7GPT-5.5 (xhigh)Gemini 3.1 Pro
Intelligence Index56576057
LiveCodeBench93.5%88.8%
Codeforces rating320631683052
Terminal-Bench 2.067.9%65.4%82.7%
SWE-bench Pro55.4%64.3%
MMLU-Pro86.2%88.1%89.7%91.0%
SimpleQA-Verified57.9%75.6%
IMOAnswerBench (math)89.8%75.3%
Apex Shortlist (math)90.2%

A few things stand out from this matrix. V4 Pro is the strongest model in the world right now on competitive programming benchmarks. The Codeforces 3206 rating slots it at roughly the 23rd human competitor on the platform, and it is the first time an open-weights model has topped a frontier closed model on that surface. On math contests, V4 Pro sets a new state of the art on Apex Shortlist and pushes IMOAnswerBench up by 14 points over Opus 4.6.

Where V4 Pro loses, it loses honestly. SimpleQA-Verified is the most painful gap — Gemini 3.1 Pro nearly doubles V4 Pro’s score on factual recall. SWE-bench Pro, which measures full-coverage multi-step engineering work, still favors Opus 4.7. Terminal-Bench 2.0, which weights real agentic tool use, leans GPT-5.5. If your workload is research, knowledge synthesis, or end-to-end agent execution against a real terminal, V4 Pro is not the obvious pick.

For a deeper look at how the closed-frontier models stack up against each other, see my earlier breakdown on Gemma 4 versus Opus 4.6, GPT-5.4 Pro, and Gemini 3 Pro. V4 Pro now sits in that conversation.

The price chart that breaks the market

This is the part that will matter to most people reading this. Frontier-class intelligence has been pricing itself out of mass developer adoption, and V4 Pro just smashed that floor.

ModelInput ($ / 1M tokens)Output ($ / 1M tokens)Cache hit
DeepSeek V4 Flash$0.14$0.2820% of input
DeepSeek V4 Pro$1.74$3.4820% of input
DeepSeek V4 Pro (promo through May 31)$0.036$3.48
Claude Opus 4.7$15$25varies
GPT-5.5$20$30varies

Source: DeepSeek pricing docs and VentureBeat’s pricing summary.

V4 Pro at list price is roughly one-seventh the output cost of Opus 4.7 and one-ninth the output cost of GPT-5.5. V4 Flash at full price is more than ninety times cheaper on output than GPT-5.5. During the promo period, the input cost on V4 Pro is functionally a rounding error compared to its US peers.

The cache-hit math matters more than people realize. Most agent loops repeat the same system prompt, the same tool schemas, and the same task primer on every turn. If 70 percent of your input tokens are cached, V4 Pro effectively prices an agent turn at roughly $0.35 per million combined input plus output. On Opus 4.7, that same turn lands closer to $7. Multiply that across an agentic pipeline that fires thousands of tool calls per task, and the gap turns into a real budget line.

What this means for agentic coding

DeepSeek built V4 Pro as a coding model first, and it shows. According to the release post, V4 Pro is already powering DeepSeek’s own internal coding agent, and the company designed the harness work around standard tools so the model plugs into Claude Code, OpenCode, and OpenClaw with minimal glue. That detail is more important than it sounds.

What I have seen is that frontier-class open weights only matter if the surrounding tooling already speaks the right protocol. With Anthropic’s Claude Code ecosystem and OpenAI’s Agents SDK harness update both standardizing on similar conventions, plugging V4 Pro into an existing agent loop is a configuration change, not a rewrite. I swapped Opus 4.7 out of a small repo-cleanup agent and pointed it at V4 Pro through OpenRouter in about 20 minutes. The model finished the task. The cost ledger for the run was 91 percent lower.

Where I would not use V4 Pro yet:

  • Long-running production agents on regulated data. The data residency and Chinese-jurisdiction risk is real. Self-host or do not.
  • Knowledge-heavy single-shot Q&A. Gemini 3.1 Pro is a stronger pick for any task that turns on factual recall.
  • Multi-step terminal agents with no human in the loop. GPT-5.5 still leads Terminal-Bench 2.0 by 15 points, and that gap shows up in real workflows.

For most other agent shapes — code review, refactoring, data extraction, documentation generation, RAG over private documents, internal coding copilots — V4 Pro is in the running. If you are designing an agent evaluation harness, this is a model worth slotting into your test matrix.

The data residency problem nobody is solving

The biggest blocker for V4 Pro adoption in Western enterprises is not benchmarks. It is jurisdiction. DeepSeek is a Chinese company. Routing customer data through the hosted API triggers GDPR, HIPAA, and most US enterprise procurement reviews instantly. I have already seen two CIO drafts banning DeepSeek’s hosted endpoints on the strength of that alone.

The MIT license fixes part of this. You can download the weights, run them on your own NVIDIA Blackwell box or an H200 cluster, and never send a token to DeepSeek. The hard part is operational. V4 Pro is 1.6T parameters total. Even at MoE-active 49B, you are looking at multi-GPU inference, careful KV cache budgeting, and an ops team that knows how to keep an agent fleet alive. Most companies that need the cost savings the most are also the least equipped to run the model themselves.

What I expect to see over the next quarter is a wave of managed inference providers — Together AI, DeepInfra, NVIDIA NIM partners, and Cloudflare-style edges — racing to offer a Western-jurisdiction V4 Pro endpoint. That is the deployment shape that will actually move enterprise budgets. If you are already thinking through agentic AI deployments and risks, V4 Pro is the model that forces the data-residency conversation to the front of the architecture review.

My recommendation

If you are building a developer tool, an internal coding copilot, or a high-volume RAG pipeline and you control where the data goes, V4 Pro should be in your evaluation set this week. The Codeforces and LiveCodeBench numbers are not noise. The cost gap to Opus 4.7 and GPT-5.5 is real and durable, since the weights are open. Pair it with a strong multi-agent orchestration layer and you have an agent stack that costs an order of magnitude less to run.

If you are an enterprise dealing with regulated data and you cannot self-host, do not move yet. Wait for a US-jurisdiction managed deployment. The savings are not worth the compliance risk in the meantime.

If you are betting your roadmap on closed frontier models holding their lead, this release is the warning shot. The gap between the best open model and the best closed model has been 12 to 18 months for the past two years. V4 Pro just compressed that gap into a single benchmark run on the same day that GPT-5.5 launched. The next move belongs to the closed labs, and I would not assume they will keep the lead by default.

Where AI goes next after V4

The DeepSeek V4 release is the clearest signal so far that 2026 is the year open weights stop trailing closed models on the metrics that pay the bills. The architecture work behind Hybrid Attention is going to get copied. The pricing is going to force OpenAI and Anthropic to defend their margins on something other than raw intelligence. And the harness compatibility means that any team running an agent loop today can A/B test V4 Pro against a closed model in an afternoon. That is the part that actually moves the market.

Share this article
Q&A

Frequently Asked Questions

What is DeepSeek V4 Pro and when was it released?

DeepSeek V4 Pro is a 1.6 trillion parameter Mixture-of-Experts language model with 49 billion active parameters per token, released as a preview on April 24, 2026 under the MIT license. It uses a new Hybrid Attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention to handle a 1 million token context window. A smaller sibling, V4 Flash, ships at 284B total and 13B active parameters. Both are available via the DeepSeek API, Hugging Face, NVIDIA NIM, Together AI, OpenRouter, and other providers.

How does DeepSeek V4 Pro compare to Claude Opus 4.7 and GPT-5.5?

On Artificial Analysis Intelligence Index, GPT-5.5 scores 60, Claude Opus 4.7 and Gemini 3.1 Pro both score 57. V4 Pro sits within 12 points of the top across coverage. V4 Pro leads on LiveCodeBench (93.5 vs Opus 4.7 at 88.8) and Codeforces (3206 vs GPT-5.4 at 3168). Opus 4.7 leads on SWE-bench Pro and writing. GPT-5.5 leads on Terminal-Bench 2.0 and general agentic tasks. V4 Pro wins on cost by a wide margin, with output tokens at $3.48 per million versus $25 for Opus 4.7 and $30 for GPT-5.5.

What is Hybrid Attention in DeepSeek V4?

Hybrid Attention is DeepSeek's new long-context attention design. It interleaves two mechanisms — Compressed Sparse Attention (CSA), which groups tokens into blocks then selects top-k, and Heavily Compressed Attention (HCA), which collapses many tokens into a single dense entry. A small sliding window keeps recent tokens uncompressed, and learnable attention sinks stabilize attention logits. At 1M-token context, V4 Pro uses 27% of the per-token inference FLOPs and 10% of the KV cache compared with V3.2.

How much does the DeepSeek V4 API cost?

V4 Pro list pricing is $1.74 per million input tokens (cache miss) and $3.48 per million output tokens. V4 Flash is $0.14 per million input and $0.28 per million output. Cache hits drop input pricing to 20 percent of the base rate. DeepSeek is also running a limited-time 75 percent discount on V4 Pro through May 31, 2026, which pushes input pricing to roughly $0.036 per million tokens during the promo window.

Should developers actually use DeepSeek V4 Pro in production?

It depends on your data and your jurisdiction. The model itself is strong on competitive programming, math, and long-context reasoning, and it is dramatically cheaper than US frontier models. The risks are data residency and compliance: routing customer data through DeepSeek's hosted API is a non-starter for many healthcare, finance, and EU teams. Self-hosting is the safe path, since the weights are MIT licensed. For internal coding tools, RAG pipelines, and offline workloads, V4 Pro is a real option.

References

Resources & Further Reading

  1. DeepSeek — Official site
  2. Hugging Face — DeepSeek model cards
  3. arXiv — DeepSeek papers
  4. Reuters — Chinese AI policy and models
  5. The Information — China AI coverage
  6. Bloomberg — China technology
  7. DeepSeek released the V4 preview
  8. Hugging Face
  9. optimized Blackwell endpoints for V4
  10. MIT Technology Review pointed out
Editorial

Editorial Notes

Update: Refreshed May 17, 2026 — verified DeepSeek V4 Pro / V4 Flash availability and license status.

Editorial review: Harsimran Singh.

Transparency

Disclosure

AI News Desk independently researches every article using public filings, official product documentation, and primary sources. No vendor paid for placement in this piece.

Harsimran Singh, editor of AI News Desk
Written by

Harsimran Singh

Editor & Publisher · AI News Desk

Harsimran covers agentic AI, model releases, AI regulation, and developer tooling with a builder-first lens — translating fast-moving research into practical guidance engineers and product teams can act on.

Published April 30, 2026 Updated May 17, 2026 Reading time 10 min