Gemma 4 Goes Apache 2.0: Google's Multimodal On-Device AI Model Explained

Niels
Niels Co-founder
Publicado el 11 abr 2026Actualizado el 13 abr 2026

Logo Google Gemma

Google DeepMind just made a major move with Gemma 4: four open-weight models released under the Apache 2.0 license, capable of processing text, images, video, and audio, and designed to run on everything from data centers to mobile phones. This is a significant shift. The Gemma family existed before, but under a restrictive license that held back many businesses. By switching to Apache 2.0, Google removes all commercial and redistribution barriers. You can now modify, redistribute, and commercialize these models without restrictions.

The announcement, authored by Clement Farabet (VP of Research at Google DeepMind) and Olivier Lacombe (Group Product Manager), explicitly positions Gemma 4 as a response to developer demand for more "digital sovereignty" over their data, infrastructure, and deployment choices. This is not just marketing talk: the Apache 2.0 license is the same one used by projects like Kubernetes and TensorFlow. It allows everything, with no hidden conditions.

The Gemma ecosystem has already reached over 400 million downloads since its inception and counts more than 100,000 community variants in the "Gemmaverse." Gemma 4 builds on that momentum with a massive qualitative leap in benchmarks, multimodality, and local efficiency.

What Are the 4 Gemma 4 Models and Which One Should You Pick?

Gemma 4 - Page officielle Google AI

Model

Effective params

Total params

Architecture

Context

Typical use

Gemma 4 E2B

2.3B

5.1B

Dense

128K

Mobile, embedded

Gemma 4 E4B

4.5B

8B

Dense

128K

Lightweight apps, edge

Gemma 4 12B

12B

12B

Dense

256K

Workstation, single GPU

Gemma 4 27B A4B

3.8B active

25.2B

MoE

256K

Server, batch, production

Gemma 4 comes in four sizes spanning two architecture families: dense and Mixture-of-Experts (MoE).

The dense models include E2B (2.3 billion effective parameters, 5.1 billion with embeddings), E4B (4.5 billion effective, 8 billion with embeddings), and 31B (30.7 billion parameters). The first two are built for edge and mobile deployment. They accept text, images, video, and audio as input. The 31B is the most powerful dense model, limited to text and images.

The fourth model is the 26B A4B, a Mixture-of-Experts with 25.2 billion total parameters but only 3.8 billion active per inference token. It uses a configuration of 128 routed experts with 8 active per token, plus one shared expert. This architecture delivers near-31B performance while activating only a fraction of the parameters.

How to choose: if you are building a mobile or embedded application, E2B or E4B are your candidates. If you have a workstation with a decent GPU (24 GB VRAM is sufficient when quantized), the 26B A4B offers the best performance-to-cost ratio. If you want maximum quality locally, the 31B is the obvious choice, but it requires more memory.

For context length, the small models support 128K tokens and the large ones 256K tokens, enabling processing of very long documents or extended conversations.

How to Deploy Gemma 4 Locally on Your Own Hardware

One of Gemma 4's main selling points is how easy it is to deploy locally. Google worked with the major inference frameworks to guarantee day-one support.

Weights are available on Hugging Face, Kaggle, and Ollama. Framework support covers vLLM, llama.cpp, MLX (for Apple Silicon), Transformers, and even ONNX exports for browser or edge deployment.

Here are the approximate memory requirements for weights only:

E2B requires 9.6 GB in BF16, 4.6 GB in SFP8, or 3.2 GB in Q4. E4B needs 15 GB in BF16, 7.5 GB in SFP8, or 5 GB in Q4. The 26B A4B requires 48 GB in BF16, 25 GB in SFP8, or 15.6 GB in Q4. The 31B goes up to 58.3 GB in BF16, 30.4 GB in SFP8, or 17.4 GB in Q4.

These figures do not include the KV cache for long context or software overheads. In practice, with an RTX 4090 GPU (24 GB), you can comfortably run the 26B A4B in Q4 quantization.

For Apple Silicon users, MLX supports TurboQuant, which promises the same accuracy as the uncompressed baseline with roughly 4x less active memory and significantly faster performance. Deployment via llama.cpp also works directly for local applications like LM Studio or Jan.

Google also distributes Gemma 4 through AI Edge Gallery and Android AICore Developer Preview for on-device mobile use cases. This means you can integrate a capable multimodal model directly into an Android application.

Is Gemma 4 Truly Multimodal? Text, Image, Video, and Audio Capabilities

Yes, and this is one of the most significant advances over previous generations.

All Gemma 4 models accept text and images as input. The E2B and E4B variants add native audio support. Google states that all models can process video (via frame extraction), though implementation details vary by size.

The vision encoder uses learned 2D positions and multidimensional RoPE. It preserves original aspect ratios and can encode images with different token budgets (70, 140, 280, 560, 1120), letting you find the optimal trade-off between speed, memory, and quality.

In practice, testing shows Gemma 4 handles OCR, object detection, GUI pointing, audio transcription, and even video scene description. The model natively responds in JSON for detection tasks without requiring specific instructions.

On vision benchmarks: the 31B reaches 76.9% on MMMU Pro and 85.6% on MATH-Vision. Even the small E4B scores 52.6% on MMMU Pro, which is remarkable for a model of that size. For long context, the 31B achieves 66.4% on MRCR v2 (8 needles, 128K), compared to just 13.5% for Gemma 3 27B.

Gemma 4 Benchmarks: How Does It Compare to Llama, Qwen, and Closed Models?

Gemma 4's official benchmarks are impressive and position the family as the most performant among open models of comparable size.

On MMLU Pro, the 31B reaches 85.2% and the 26B A4B hits 82.6%. On AIME 2026 (mathematical reasoning without tools), the 31B scores 89.2% and the 26B A4B 88.3%, while Gemma 3 27B only managed 20.8%. On LiveCodeBench v6 (coding), the 31B gets 80% and achieves a Codeforces ELO of 2150.

On GPQA Diamond (doctoral-level questions), the 31B scores 84.3%. On agentic benchmarks (Tau2), the 31B reaches 76.9% versus just 16.2% for Gemma 3 27B.

Google claims positions 3 and 6 among open models on the Arena AI leaderboard for the 31B and 26B variants, with ELO scores of 1452 and 1441 respectively.

These numbers put Gemma 4 in direct competition with the best open models (Qwen, Llama, DeepSeek) and sometimes beyond. The 26B A4B is particularly interesting because it achieves near-31B performance with only 3.8 billion active parameters, drastically reducing inference costs.

Why Apache 2.0 Is a Game-Changer for Enterprise AI Projects

The shift to Apache 2.0 is not a licensing detail. It is a fundamental change in what you can do with these models.

Under Gemma's previous licenses (and those of many competitors like Llama), there were restrictions on commercial use, redistribution, or requirements to display the license in certain ways. Apache 2.0 eliminates all of that. You can integrate Gemma 4 into a commercial product, redistribute modified versions, and fine-tune it for your use without any sharing obligations.

For businesses deploying AI solutions on-premise, in regulated environments, or with data sovereignty constraints, this is a decisive factor. You retain full control over your data, infrastructure, and code. No dependency on a cloud API, no data leaving your servers.

The combination of a performant multimodal model, a permissive license, and easy local deployment is rare. It positions Gemma 4 as the default choice for any AI project that needs performance without compromising sovereignty.

Consider the practical implications. A healthcare company that needs to analyze medical images on-premise can now use a model that scores 61.3% on MedXPertQA multimodal, without sending any patient data to the cloud. A legal firm processing thousands of document pages can leverage the 256K context window locally, with full control over sensitive materials. A robotics startup building physical AI agents can deploy the E2B variant on NVIDIA Jetson devices for real-time multimodal inference at the edge.

The architectural innovations, particularly Per-Layer Embeddings, shared KV cache, and the MoE expert routing, are not just academic improvements. They translate directly into the ability to run sophisticated AI on hardware you already own. The E2B runs comfortably on a Raspberry Pi-class device. The 26B A4B fits on a single consumer GPU. The 31B runs on a workstation.

With an ecosystem of over 400 million downloads and massive day-one tooling support across vLLM, llama.cpp, MLX, Ollama, and ONNX, adoption should be swift. Google has also ensured compatibility with emerging agent frameworks like OpenClaw, Hermes, and Pi, meaning Gemma 4 slots directly into the agentic AI workflows that are rapidly becoming the standard development paradigm.

What Does Gemma 4 Mean for the Future of Edge AI and On-Device Intelligence?

Gemma 4 is not just a better model. It represents a structural shift in how AI capabilities are distributed. For the first time, a single model family covers the entire deployment spectrum from a Raspberry Pi to a data center GPU, all under a license that imposes zero restrictions.

The implications for physical AI are significant. NVIDIA has already confirmed that Jetson Orin Nano supports the E2B and E4B variants, enabling multimodal inference on small, embedded, and power-constrained systems. The same model family scales across the Jetson platform up to Jetson Thor, supporting robotics, smart machines, and industrial automation use cases that depend on low-latency, on-device intelligence.

For mobile developers, the Android AICore Developer Preview and AI Edge Gallery distribution mean that Gemma 4 can be integrated into Android apps with minimal friction. A shopping app that identifies products from photos, a translation app that handles speech and text, or a field service app that reads equipment labels and provides maintenance instructions, all become feasible with a single on-device model.

The question is no longer whether open models are competitive. It is which ones to use, and Gemma 4 just made the answer significantly simpler.

logo emelia

Descubre Emelia, tu herramienta de prospección todo en uno.

logo emelia

Precios claros, transparentes y sin costes ocultos.

Sin compromiso, precios para ayudarte a aumentar tu prospección.

Start

37€

/mes

Envío ilimitado de emails

Conectar 1 cuenta de LinkedIn

Acciones LinkedIn ilimitadas

Email Warmup incluido

Extracción ilimitada

Contactos ilimitados

Grow

Popular
arrow-right
97€

/mes

Envío ilimitado de emails

Hasta 5 cuentas de LinkedIn

Acciones LinkedIn ilimitadas

Email Warmup ilimitado

Contactos ilimitados

1 integración CRM

Scale

297€

/mes

Envío ilimitado de emails

Hasta 20 cuentas de LinkedIn

Acciones LinkedIn ilimitadas

Email Warmup ilimitado

Contactos ilimitados

Conexión Multi CRM

Llamadas API ilimitadas

Créditos(opcional)

No necesitas créditos si solo quieres enviar emails o hacer acciones en LinkedIn

Se pueden utilizar para:

Buscar Emails

Acción IA

Buscar Números

Verificar Emails

1,000
5,000
10,000
50,000
100,000
1,000 Emails encontrados
1,000 Acciones IA
20 Números
4,000 Verificaciones
19por mes

Descubre otros artículos que te pueden interesar!

Ver todos los artículos
MarieMarie Head Of Sales
Leer más
NielsNiels Co-founder
Leer más
MarieMarie Head Of Sales
Leer más
MarieMarie Head Of Sales
Leer más
Software
Publicado el 1 jun 2025

7 Alternativas y competidores de LeadFuze

MarieMarie Head Of Sales
Leer más
NielsNiels Co-founder
Leer más
Made with ❤ for Growth Marketers by Growth Marketers
Copyright © 2026 Emelia All Rights Reserved