LLMsMarch 3, 20264 min read

Google Launches Gemini 3.1 Flash-Lite: Adjustable Thinking at One-Eighth the Cost of Pro

Google released Gemini 3.1 Flash-Lite on March 3, 2026, available in preview through the Gemini API. It is the fastest and most cost-efficient model in Google's lineup, designed for high-volume production workloads where cost per token matters as much as quality.

Benchmark Performance

Despite being positioned as a budget model, Flash-Lite posts strong numbers across reasoning and knowledge benchmarks. It scores 86.9% on GPQA Diamond, a graduate-level science reasoning test. It reaches 76.8% on MMMU Pro for multimodal understanding and 88.9% on MMMLU for multilingual question answering. On the Arena.ai leaderboard, it achieved an Elo score of 1432.

These scores surpass larger Gemini models from previous generations, including Gemini 2.5 Flash. A model that costs a fraction of its predecessor while outperforming it on key benchmarks represents a meaningful shift in the cost-performance curve.

Adjustable Thinking Levels

The most distinctive feature is adjustable thinking. Available in both AI Studio and Vertex AI, developers can select how much reasoning the model applies to each task. Lower thinking levels produce faster, cheaper responses for straightforward queries. Higher levels engage deeper reasoning for complex problems.

This gives developers granular control over the speed-cost-quality tradeoff at the API level. A content moderation pipeline processing millions of messages per day can use minimal thinking. A code generation task that requires careful logic can use maximum thinking. The same model handles both, with the developer choosing the appropriate level per request.

Speed and Pricing

Flash-Lite runs 2.5 times faster than Gemini 2.5 Flash. Google priced it at $0.25 per million input tokens and $1.50 per million output tokens, roughly one-eighth the cost of the Pro model. At this price point, it becomes viable for use cases that were previously too expensive to run through a frontier model.

High-volume translation, bulk content moderation, document classification, and real-time chat applications are all workloads where per-token cost is the primary constraint. Flash-Lite targets these scenarios directly.

Production Use Cases

Google highlights several intended applications: high-volume translation and localization, content moderation at scale, UI and dashboard generation, simulation creation, and complex instruction following. The model is multimodal, handling text, images, and structured data.

The combination of low cost and strong multilingual performance (88.9% MMMLU) makes it particularly relevant for applications serving global audiences. Localization workflows that previously required separate models or expensive API calls can now run through a single low-cost endpoint.

Where It Fits in Google's Model Lineup

Google's Gemini 3 family now spans a wide range: Flash-Lite for cost-sensitive high-volume work, Flash for balanced performance, and Pro for maximum capability. Each tier serves different production requirements, and the adjustable thinking feature blurs the boundaries between them by letting a cheaper model handle harder tasks when needed.

The competitive implication is clear. The floor for what a budget AI model can deliver keeps rising. Tasks that required a frontier model six months ago can now be handled by a model costing a fraction of the price, at higher speed, with comparable or better accuracy. For anyone building AI-powered products, the economics just shifted again.

Genera

Google Launches Gemini 3.1 Flash-Lite: Adjustable Thinking at One-Eighth the Cost of Pro

Benchmark Performance

Adjustable Thinking Levels

Speed and Pricing

Production Use Cases

Where It Fits in Google's Model Lineup

Related Articles

OpenAI Releases GPT-5.3 Instant: 400K Context, 27% Fewer Hallucinations, Less Overrefusal

Text to Video AI: How the Technology Actually Works

A Complete Guide to AI Image Generation Styles