India, June 7 -- Google has announced the launch of Gemma 4 12B, a dense multimodal model featuring a unified, encoder-free architecture.

Gemma 4 12B marks several key milestones for local AI development. According to Google's blog post, it introduces a multimodal encoder-free design, eliminating the need for heavy, multi-stage vision and audio encoders. Instead, multimodal inputs are fed directly into the LLM backbone, helping reduce latency in processing images, audio, and other data types.

The company also described it as its first medium-sized model with native audio input. Within the Gemma family, audio capabilities were previously limited to smaller edge-focused models such as E4B. With Gemma 4 12B, Google expands audio understand...