Connect with us

AI

Gemini 3.5 Flash Lands Sixth on Android Bench, Costs the Most

Google’s Gemini 3.5 Flash scored 63.7 and ranked sixth on Android Bench, behind OpenAI’s GPT 5.5, while costing $147.1 per run, the highest in the ranking.

Published

on

Google’s newest coding-focused Gemini model is the costliest entry on Google’s own Android development leaderboard. Gemini 3.5 Flash scored 63.7 on Android Bench and ranked sixth, behind OpenAI’s GPT 5.5 and Google’s older Gemini 3.1 Pro Preview. The model also carries the highest average cost per run at $147.1 among the top six in the ranking Google refreshed on June 9, 2026.

The result puts Google in an unusual spot. The Flash badge has long signaled faster answers and a smaller bill than its Pro siblings, but Android Bench inverts that pitch. The new model used more than four times the tokens of its predecessor and falls behind it on the percentage of test cases it solves. For Android developers picking a default coding assistant today, the math looks different than the Flash launch positioning suggested.

What the Android Bench Leaderboard Shows

Android Bench is Google’s official ranking of AI models on Android development tasks. The score column reflects the average percentage of 100 test cases a model successfully resolved across 10 runs, with a confidence interval printed beside each entry. Google published the latest refresh on June 9, 2026, adding Gemini 3.5 Flash and moving older entries (GPT 5.3 Codex, Claude Opus 4.6, and several Gemini models) to an archive page. The leaderboard is built from real GitHub issues in Android repositories, which Google says gives developers a model-agnostic reference for picking a coding assistant. The full methodology, including the test harness on GitHub, is published alongside the Android Bench rankings.

GPT 5.5 leads at 74.0, with GPT 5.4 and Gemini 3.1 Pro Preview tied at 72.4, then Claude Opus 4.7 at 68.7 and Claude Opus 4.6 at 66.6. Flash is the only Google model in the top six that scores below the 66-mark.

Gemini 3.5 Flash lands sixth, between Claude Opus 4.6 above and GLM 5.1 below. Here is how the top of the table breaks down, drawn directly from Google’s published figures. Cost and latency columns tell the deeper part of the story.

Model Score Avg Latency (h) Avg Total Tokens (M) Avg Cost ($)
GPT 5.5 74.0 15.7 64.7 $134.2
GPT 5.4 72.4 21.2 64.2 $91.7
Gemini 3.1 Pro Preview 72.4 11.1 73.3 $47.9
Claude Opus 4.7 68.7 11.6 90.0 $124.3
Claude Opus 4.6 66.6 9.9 69.5 $84.4
Gemini 3.5 Flash 63.7 14.2 355.9 $147.1

The Flash Name Now Costs More

Gemini 3.5 Flash’s pricing data sits at the top of the leaderboard in a way Google did not advertise at launch. The model used an average of 355.9 million total tokens per Android Bench run, more than any other entry in the top ten. By comparison, Gemini 3.1 Pro Preview used 73.3 million tokens for the same workload, and OpenAI’s GPT 5.5 used 64.7 million. The 5.5x token gap over GPT 5.5 translates into the highest average cost in the top six: $147.1 per full benchmark run.

  • 355.9M average tokens per Android Bench run (highest in top ten)
  • $147.1 average cost per run (highest in top six)
  • 14.2 hours average latency (second-slowest in top six, behind only GPT 5.4)

The Flash badge, historically tied to Google’s speed-and-value play, sits at the opposite end of both columns here. Per 9to5Google’s June 12 breakdown, the pricing picture is the headline finding. The same report finds GPT 5.5 ranks similarly in raw cost per run, but Flash used 5.5x more tokens to get there.

An Older Gemini Already Beats It

Google does not need an external rival to find a stronger option on this list. Gemini 3.1 Pro Preview, the model Gemini 3.5 Flash was positioned to succeed, scored 72.4 on Android Bench against Flash’s 63.7. It also completed the 100-task run in 11.1 hours versus Flash’s 14.2. The older preview was tested on February 28, 2026, more than two months before Flash’s May 20, 2026 entry, and it still costs an average of $47.9 per run.

For an Android developer weighing which model to default to, the comparison is unusually clean. The newer Flash model is slower, less accurate, and more expensive on Google’s own benchmark. Google framed Flash at launch as ‘a cheaper and faster alternative to Gemini 3.1 Pro,’ per a breakdown of the Android Bench refresh. Android Bench shows Flash trailing by a 9% gap in performance success, with the older preview running at around a third of Flash’s per-run cost.

Google has not publicly responded to the Android Bench placement of Gemini 3.5 Flash. The model remains the default in the Gemini app and AI Mode, and Gemini 3.5 Pro sits on Google’s internal roadmap.

What Google Pitched on Launch Day

Google introduced Gemini 3.5 Flash on May 19, 2026 at Google I/O, positioning it as the opening release of a new flagship family. Sundar Pichai, CEO of Google and Alphabet, announced the model during the keynote, calling it the first of the 3.5 series and confirming Gemini 3.5 Pro would follow.

Google’s own Gemini 3.5 launch announcement lays out a different set of strengths than Android Bench captures. The company says Flash ‘outperforms Gemini 3.1 Pro on challenging coding and agentic benchmarks like Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo) and MCP Atlas (83.6%), and lead in multimodal understanding (84.2% on CharXiv Reasoning).’ Google also claims Flash is 4 times faster than other frontier models when measured in output tokens per second.

The internal benchmarks cover agentic coding harnesses and multimodal reasoning, not Android-specific code generation. Android Bench scores code written against real GitHub issues in Android repositories, a narrower and more production-shaped workload.

[Gemini] 3.5 Flash is especially good when deploying multiple agents simultaneously and completing long-running tasks with massive improvements in coding and tool use. It can independently execute complex coding pipelines or manage iterative research projects entirely by itself.

The quote comes from Koray Kavukcuoglu, CTO of Google DeepMind and Chief AI Architect at Google, at a pre-Google I/O 2026 media briefing, as reported by Mashable. Google’s launch post pairs that pitch with internal benchmark wins on Terminal-Bench 2.1 (76.2%) and the GDPval-AA agentic eval. Android Bench, which Google also runs, gives a different reading on Android coding performance.

The Limits of Any Public Leaderboard

Public Android coding benchmarks carry known constraints. Zencoder CEO Andrew Filev told The New Stack that open benchmarks like Android Bench are valuable but face data contamination, since ‘public repositories leak into training,’ and that small framing changes can reorder rankings. Filev said his team has seen models that cluster within a few points on public evals spread dramatically on private benchmarks built to mimic the same workload. In Zencoder’s own research, a small change in how the team framed test cases shifted the model spread from six percentage points to 26 and completely reordered the rankings.

The Android Bench workload itself is built from public GitHub Android repositories, which makes the data contamination concern concrete. Google’s leaderboard page says the score reflects real challenges of varying difficulty, sourced from public Android repositories, so the source material overlaps with what models are trained on. For Android developers choosing between Flash and 3.1 Pro Preview, the leaderboard gives a clean head-to-head, but it remains one test of one workload.

What Android Bench Hasn’t Tested Yet

Google has not released benchmark numbers for Claude Opus 4.8 or the Fable 5 model that Anthropic and other labs shipped in June 2026, and Android Bench already reflects that absence. The leaderboard is updated when Google tests new entrants, so future refreshes could move Flash up or down as those models are evaluated. Sundar Pichai said at the pre-briefing that Gemini 3.5 Pro is being used internally and will roll out ‘next month,’ putting its Android Bench entry in the same June-July window as Claude Opus 4.8. Gemini’s role across the wider Android stack is also expanding. Google is pushing Gemini Go into 2GB budget phones in 180 countries, and Apple is paying Google to power its rebuilt Siri on Gemini.

For Android developers choosing a default model today, the data points to the older Gemini 3.1 Pro Preview or one of the GPT 5.x models over Flash for raw Android coding work. Google’s own leaderboard puts three older or sibling models ahead of Flash on both score and cost.

Flash may still hold advantages on agentic harnesses and multimodal tasks outside the Android Bench workload, and Google’s launch partners (Shopify, Macquarie Bank, Salesforce, Ramp, Xero, Databricks) are piloting it in those contexts. The benchmark gap is also a moving target as new entries arrive. Android Bench will publish its next refresh once those models are tested.

Frequently Asked Questions

What is Android Bench?

Android Bench is Google’s official leaderboard for ranking AI models on Android development tasks. The score column reflects the average percentage of 100 test cases a model successfully resolved across 10 runs, with confidence intervals beside each entry. Google builds the workload from real GitHub issues in Android repositories. The full test harness is also published on GitHub.

How did Gemini 3.5 Flash score on Android Bench?

Gemini 3.5 Flash scored 63.7 and ranks sixth on the leaderboard Google refreshed on June 9, 2026. GPT 5.5 leads the pack at 74.0, with GPT 5.4 and Gemini 3.1 Pro Preview sitting at 72.4 each. Flash also runs with a 14.2-hour average latency, slower than four of the five models ahead of it.

Why does Gemini 3.5 Flash cost more than other models on the leaderboard?

Flash used an average of 355.9 million tokens per Android Bench run, the highest in the top ten. By comparison, GPT 5.5 used 64.7 million and Gemini 3.1 Pro Preview used 73.3 million. That token volume drove an average cost of $147.1 per full benchmark run, the highest in the top six. The leaderboard also lists DeepSeek V4 Pro at the lowest cost ($13.7) for comparison.

When will Gemini 3.5 Pro be available?

Sundar Pichai said at the Google I/O 2026 pre-briefing that Gemini 3.5 Pro is being used internally and ‘will roll out to everyone next month.’ Google’s launch post confirms Pro is in internal use ahead of a wider release. The exact date for the public rollout has not been announced.

Should Android developers use Gemini 3.5 Flash?

On Google’s own Android Bench numbers, no. The older Gemini 3.1 Pro Preview scored higher, ran faster, and cost around a third per run. For raw Android coding today, the data points to Gemini 3.1 Pro Preview or one of the GPT 5.x models over Flash. Flash may still hold advantages on agentic and multimodal tasks outside the Android Bench workload.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending