Building With Claude Opus 4.7 — What the New API Features Actually Do
Opus 4.7 looks like a point release. It isn't. Here's what changes when you actually build with it — through the lens of integrating it into Tally Digital's Website Auditor.
Claude Opus 4.7 landed on 16 April 2026. If you glance at the version number, it reads like a routine point release. The benchmarks say otherwise.
SWE-bench Verified jumped from 80.8% to 87.6%. CursorBench went from 58% to 70%. XBOW's visual-acuity test climbed from 54.5% to 98.5% — the one-point increment most likely to feel like a new model. Rakuten reports 4.7 resolves three times as many tasks on their internal SWE benchmark as 4.6.
That is not the story of an incremental release. This post is about what actually lands in your code when you upgrade, told through the lens of integrating 4.7 into Tally Digital's Website Auditor — the lead-gen tool I'm building where visitors submit a URL, Playwright audits it, and Claude writes the recommendations.
The Release in One Paragraph
Model ID: claude-opus-4-7. Generally available on claude.ai, the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing is unchanged from 4.6 — $5 per million input tokens, $25 per million output. Context window: 1M input, 128K output. Pricing parity is the headline: this is the first Anthropic model in a while where you get a meaningful capability jump without paying more per token. The catch — and we'll get to it — is the updated tokenizer.
The Website Auditor, Briefly
Context for the rest of this post. A visitor lands on tallydigital.io, enters their site URL and email, verifies the email, and kicks off an audit. An Inngest function orchestrates ten steps: a Railway Docker container runs Playwright and Chromium against six categories, lightweight steps on Vercel calculate scores, Gemini does research enrichment, Claude writes the final personalised recommendations, and Resend ships a report by email. Results live in Supabase with Realtime pushing progress to the visitor.
Every requirement the auditor places on the Claude call maps to something new in 4.7: long agent work, image analysis, hard cost limits per user request, and structured output that has to be right on the first pass. Here is how each new feature slots in.
The xhigh Effort Level
Effort levels let you trade speed for reasoning depth on a single call. 4.6 shipped with low, medium, high, and max. 4.7 adds xhigh — "extra high" — sitting between high and max.
That tier matters because max is expensive and slow, and high leaves real quality on the table for synthesis-heavy work. In the auditor, step nine is where Claude reads the entire audit payload (Playwright findings, Lighthouse scores, Gemini research, the user's submitted URL and email) and writes the recommendations that double as a sales pitch. Running that on high produced generic output. Running it on max was overkill and slow enough to push our p95 past the verification email arrival time. xhigh is the sweet spot.
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic()
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
effort: 'xhigh',
max_tokens: 4096,
messages: [
{ role: 'user', content: buildRecommendationPrompt(audit) },
],
})Reach for xhigh on planning and synthesis steps where the model needs to weigh a lot of context before speaking. Keep high for executions — the per-tool-call step of an agent, or a classifier. Only pick max when you have watched xhigh fail on the same input.
Task Budgets (Public Beta)
Task budgets are the feature I've been waiting for. You set a hard token ceiling on a call — effectively a per-request cost cap — and the model has to operate inside it. The public beta launched with 4.7.
Why this matters for the auditor: the whole funnel is built around "free audit". If a visitor submits a sprawling e-commerce site with 400 pages, an uncapped Claude call in the recommendation step can quietly burn through my margin on a single request. Without task budgets, the defensive play is aggressive truncation of input, which degrades output quality. With task budgets, I let the model see everything and trust it to prioritise inside the budget.
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
effort: 'xhigh',
max_tokens: 4096,
task_budget: 60_000, // ~$1.75 ceiling at Opus 4.7 pricing
messages: [
{ role: 'user', content: buildRecommendationPrompt(audit) },
],
})Anthropic's release post confirms the public beta but doesn't settle on a final field name, so double-check the SDK before you copy this verbatim. The shape is what matters: one parameter, one ceiling, no extra plumbing. Anything that scales cost per end-user request — audits, reviews, generated reports — stops being a margin lottery.
Higher-Resolution Image Input
4.7 accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels. That is more than three times as many pixels as prior Claude models.
For the auditor this is not a nice-to-have. Earlier models forced us to downscale Playwright screenshots aggressively. Small text — exactly the text that matters for accessibility audits, mobile usability, and CTA analysis — vanished into the compression. We were running OCR separately and passing extracted text alongside a reduced image. Ugly, brittle, and it meant the model couldn't reason about visual hierarchy.
XBOW's visual-acuity benchmark jumped from 54.5% on Opus 4.6 to 98.5% on 4.7 — Anthropic cites this directly in the announcement. That is not a resolution bump; it is a capability change. The model can now see the interface the user actually sees.
import { readFileSync } from 'fs'
const screenshot = readFileSync('audit/homepage.png')
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
effort: 'xhigh',
max_tokens: 2048,
messages: [
{
role: 'user',
content: [
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: screenshot.toString('base64') } },
{ type: 'text', text: 'Audit this landing page for mobile usability issues. Quote specific text where you see problems.' },
],
},
],
})Same number of tokens per image as before, so the practical cost is unchanged — you are just getting sharper eyes for the same money.
File-System Memory Across Sessions
4.7 is materially better at file-system-based memory across multi-session work. In a ten-step pipeline like the auditor, that is the difference between stuffing full audit state into every prompt and letting the model read what it needs, when it needs it.
Before 4.7, my working pattern was to concatenate enrichment findings, Playwright output, and previous step results into a growing system prompt. Input token counts ballooned. With file-system memory the model can pull the page-by-page Playwright report when it's writing the mobile section, the Lighthouse JSON when it's writing the performance section, and skip everything else. The orchestration stays in Inngest; the context management moves closer to the model.
I'm not showing code here because the API surface is still settling and the post ages better without brittle detail. What I'll say is this: the moment your agent runs for minutes rather than seconds, file-system memory stops being an optimisation and starts being the thing that makes the architecture legible.
The Gotcha: Literal Instructions and a New Tokenizer
Two changes in 4.7 compound. If you flip your production model ID from claude-opus-4-6 to claude-opus-4-7 and do nothing else, the honest outcome is that some of your prompts will get worse.
First: 4.7 follows instructions much more literally. Prompts that relied on 4.6 reading between the lines — the soft implicit guidance that used to "just work" — land differently. If you said "summarise this concisely", 4.6 would deliver a reasonable paragraph. 4.7 may give you one sentence because you said "concisely" and it took you at your word.
Second: the tokenizer was updated. The same input text can tokenize to between 1.0× and 1.35× as many tokens as on 4.6, depending on content. Per-token pricing is identical, but effective cost per call is not.
// 4.6-era prompt — implicit, assumes the model fills gaps
const v46 = `Summarise this audit for the business owner. Keep it professional.`
// 4.7 retune — explicit structure, length, and audience framing
const v47 = `You are writing a summary for a non-technical business owner.
Structure: one-paragraph overview (3–4 sentences), then three bullet points
on the most important fixes. Do not mention technical metrics by name
unless the owner will have heard of them. Do not use the word "optimise".`
Per-token pricing is unchanged, but the new tokenizer can turn the same prompt into 1.0–1.35× more tokens. Re-measure cost before assuming your bill stays flat, and budget a retune pass for every production prompt before flipping the model ID.
Claude Code Gets /ultrareview
A small one, but I've been using it enough to mention. Claude Code ships a new slash command — /ultrareview — that runs a dedicated code review pass using 4.7. Pro and Max users get three of them free; Auto mode extends to Max users.
The ones I've run on the auditor caught a concurrency bug in the Inngest step that validated email tokens, and an over-abstracted wrapper around the Supabase client that I'd have merged without thinking about it. It is more opinionated than the default review and it takes long enough to run that you will feel the effort tier working. Worth it on anything touching money, authentication, or state machines.
The Benchmarks in One Table
| Benchmark | Opus 4.7 | Opus 4.6 |
|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% |
| Terminal-Bench 2.0 | 69.4% | 65.4% |
| GPQA Diamond | 94.2% | 91.3% |
| Finance Agent | 64.4% | 60.7% |
| CursorBench | 70% | 58% |
| XBOW visual acuity | 98.5% | 54.5% |
| Databricks OfficeQA Pro | 21% fewer errors | baseline |
A few caveats before you put too much weight on any one row. XBOW's visual-acuity figure comes from XBOW's own testing, quoted by Anthropic. Databricks OfficeQA Pro is a relative error reduction, not an absolute score. CursorBench is quoted without published methodology. SWE-bench Verified, Terminal-Bench 2.0, GPQA Diamond, and Finance Agent match the numbers in Anthropic's charts.
The only benchmark that actually matters is your workload. Mine is the auditor, and the quality jump on the recommendation step was visible without running a formal eval. That is rare.
What I'm Actually Changing
I came into this upgrade expecting a modest lift. A few benchmark points, maybe a vision tweak. What I got instead is a model that changes two architectural decisions I'd already made for the auditor.
Decision one: recommendation quality at scale. Task budgets mean I can keep the "free audit" as a real lead-gen offer without capping input aggressively. The unit economics work now.
Decision two: image analysis as a first-class step. Higher-res image input plus the visual-acuity capability jump means I can stop running OCR alongside screenshots. One Claude call does what three services used to do.
What I'm watching: instruction-following regressions on prompts I haven't retuned yet, and tokenizer-driven cost drift on high-volume endpoints. If xhigh becomes my default over high for synthesis work — and it looks like it will — the shape of my Anthropic bill changes too. The per-token price didn't move; the per-call price will.
Point releases don't usually force a retune. This one does. Budget a day for it and you come out ahead.
At Tally Digital we build AI-powered tools for NZ businesses — including the Website Auditor this post is built around. If you're looking at integrating the Claude API into your product and want to skip the trial-and-error, get in touch.
Share this article