AI

Google AI Edge Puts Gemma 4 12B on the Mac for Local Apps

Google AI Edge Gallery for macOS brings Gemma 4 12B, Eloquent and LiteRT-LM to Apple Silicon, with local apps now facing a hardware test.

Published

2 months ago

June 4, 2026

Logan Pierce

Google AI Edge Gallery for macOS now lets developers run Gemma 4 12B locally on Apple Silicon, bringing a 12B multimodal model, the Eloquent dictation app and LiteRT-LM serving into workflows that can process text, audio and images without a cloud round trip. In its Gemma 4 12B launch note, Google said the model is built for laptops and released under an Apache 2.0 license.

The timing puts Google directly on Apple’s desktop at the moment Mac developers already have Apple’s own Foundation Models framework in the operating system. Local AI apps now have a larger Google model to test against the default tools built into the Mac.

The Mac Release Lands With a 12B Model

Google’s official developer blog said Wednesday, June 3, that AI Edge Gallery is now available on macOS and can use the 12B model to generate Python, run it locally and return charts inside the chat bubble. The same Gemma 4 Mac workflow post said AI Edge Eloquent is now on macOS with local dictation, text rewriting by voice command and transcription for audio or video files.

LiteRT-LM, the runtime layer behind the release, also gained a serve command that lets a Mac expose a local OpenAI-compatible endpoint. That means a coding tool can send a prompt to the machine’s own model through an application programming interface (API, a software connection point) instead of sending it to a remote service.

The release arrives two months after the broader Gemma 4 family introduced smaller E2B and E4B edge models, a 26B mixture-of-experts model and a 31B dense model. The new 12B entry fills the laptop slot between tiny mobile models and workstations. Oton’s coverage of Samsung and Google’s smart-glasses push showed one consumer surface for Gemini-era AI; the Mac release is the workbench.

Google AI Edge Gallery Gemma 4 12B macOS local AI

A 16GB Floor Keeps This in Developer Territory

Google says the model is small enough to run locally on dedicated graphics processing unit (GPU, a chip used for parallel math) laptops with 16GB of video random access memory or unified memory. On a Mac, unified memory is the shared pool used by the processor, GPU and neural hardware, so the requirement lands differently from a Windows gaming laptop with a separate graphics card.

The LiteRT-LM 12B model card gives a closer look at the Mac target. Its benchmark lists a MacBook Pro M4 GPU with a 6,235 MB model file, 7,763 MB of GPU memory use, 243.55 prefill tokens per second, 29.56 decode tokens per second and a 4.2 second time to first token.

16GB is Google’s local laptop floor for the full model.
6,235 MB is the LiteRT-LM file size listed on Hugging Face.
7,763 MB is the measured GPU memory use on the MacBook Pro M4 benchmark.
29.56 tokens/sec is the listed decode speed for that Mac benchmark.

Those numbers are good enough for local experiments and slower than cloud chat feels when a data-center model is already warm. They also explain why Google is leading with developers, demo apps and command-line serving before pitching the model as a mainstream desktop assistant.

Which Mac Workflows Get the New Tools?

Google is shipping three Mac entry points and a model distribution path at the same time. Each part touches a different habit: chat testing, dictation, terminal workflows and model choice.

Mac Piece	What Google Is Shipping	Local Job It Targets
AI Edge Gallery for Mac	A local showcase app with chat, visual input and sandboxed Python execution	Data analysis, code tests and chart creation on the machine
AI Edge Eloquent	A macOS dictation and editing app with hotkey access and Voice Edit	Speech cleanup, rewriting selected text and transcribing local files
LiteRT-LM command line interface	A terminal tool with local chat and a serve command	Connecting editors, agents and other tools to a local endpoint
Model weights	Pre-trained and instruction-tuned checkpoints through developer model hubs	Picking a runtime such as LM Studio, Ollama, llama.cpp or MLX

For developers, the command line interface (CLI, a terminal tool) is the piece that makes the Mac feel less like a demo box. It gives the model a way to sit behind tools that already expect a remote chat service, then keeps the prompt, files and response path on the machine.

The Encoder-Free Design Cuts Out Extra Model Pieces

Google’s new developer guide describes a dense multimodal model with a unified, encoder-free architecture. In plain English, the model avoids some of the helper models that usually preprocess pictures or audio before the large language model (LLM, software trained to generate and analyze language) sees them.

Vision input uses a 35M-parameter embedder. Raw 48 by 48 pixel patches are projected into the model’s hidden dimension with a single matrix multiplication.
Audio input drops the separate audio encoder. Raw 16 kHz audio is sliced into 40 ms frames with 640 floats each, then projected into the same input space as text tokens.
Fine-tuning can update the shared multimodal path because text, vision and audio feed the same weight set.

Google also ships multi-token prediction (MTP, a draft-model method for speculative decoding) with Gemma 4. In the launch note, the company says the drafter is meant to reduce latency while preserving output quality, a claim developers will test on their own files and prompts.

Apple Owns the Host Platform Google Is Targeting

Apple controls the system framework on the machines Google is targeting. Its Foundation Models framework update says developers can call a 3 billion parameter on-device model from Swift across iOS 26, iPadOS 26 and macOS 26, with offline availability and no inference charge when Apple Intelligence is enabled.

Google’s route gives Mac developers open weights, outside runtimes and a 12B model they can run through LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang, vLLM or Unsloth. Apple gives developers the shortest path to App Store-native features. Microsoft is nearby too: the Phi open model page lists Phi-4-mini and Phi-4-multimodal as small language models with text, audio and vision support available through Azure AI Foundry, Hugging Face and Ollama.

macOS gives Google a capable local machine and a demanding comparison point. Apple Silicon already runs the platform owner’s model, so any separate Google runtime has to earn space through model choice, tooling and workloads Apple does not cover well enough for developers.

Local Serving Brings the API Fight Home

With litert-lm serve, LiteRT-LM can act as a local endpoint for tools that already know how to talk to OpenAI-style services. Google’s Mac workflow post names Continue, Aider, OpenClaw, Hermes and OpenCode as examples that can point at a local endpoint.

That changes the cost and privacy habit for small jobs. A developer can keep a client CSV on a laptop, ask for a chart, inspect the generated Python and rerun the work offline. A writer can highlight a paragraph in another app and use Eloquent’s Voice Edit to reshape it without sending the audio to a cloud dictation service.

Data analysis on private files, with Python execution staying in the local Gallery sandbox.
Code help in editors that already support OpenAI-compatible endpoints.
Speech cleanup through a hotkey, including local transcription of audio and video files.
Offline demos for teams that cannot depend on conference Wi-Fi or company network access.

Hosted AI still absorbs huge capital; Oton’s recent look at Anthropic’s valuation race with OpenAI shows how much money is chasing the cloud model layer. Google’s Mac release gives developers a way to reserve hosted calls for jobs that outgrow the laptop.

The Downloads Are Available Now

Google is making the new model available through Hugging Face and Kaggle, with pre-trained and instruction-tuned checkpoints. The company also points Mac users toward Gallery, Eloquent, LiteRT-LM, LM Studio and Ollama, a spread that makes the launch feel more like a stack than a single app drop.

For Mac users, the first choice is the app path: Gallery for prompts, visual input and code; Eloquent for dictation; LiteRT-LM for a terminal endpoint developers can wire into their own stack. The release still asks users to manage model downloads, hardware limits and app maturity.

Google’s Mac downloads and model weights are available now through its AI Edge pages, Hugging Face and Kaggle.