DeepMind

Gemini MoE Architecture -- Sparse expert routing for efficient trillion-parameter models

graph TD subgraph INPUT["Input Pipeline"] A["Input Tokens"] --> B["Embedding Layer"] B --> C["Transformer Layers x N"] end subgraph MOE_LAYER["Sparse MoE Layer (replaces dense FFN)"] C --> R["Router Network (Learned Gating)"] R --> |"Softmax over logits"| G["Top-K Selection (e.g. K=2 of 64)"] G --> E1["Expert FFN 1"] G --> E2["Expert FFN 2"] G -.-> |"Not activated"| E3["Expert FFN 3...64"] E1 --> W["Weighted Combination (gate scores)"] E2 --> W W --> OUT["Layer Output"] R --> |"Auxiliary loss"| LB["Load Balancing Loss"] end subgraph TRAINING["Training Infrastructure"] TPU["TPU v5p / Ironwood Pods"] --> JP["Jupiter Network Fabric"] JP --> DP["Data Parallelism"] JP --> EP["Expert Parallelism (experts across devices)"] DP --> JAX["XLA/JAX Compiler"] EP --> JAX JAX --> MP["Mixed Precision Training"] AC["AlphaChip AI-Designed Layout"] -.-> TPU end subgraph SERVING["Serving Pipeline"] PW["Pathways Distributed Runtime"] --> MH["Multi-Host Inferencing"] MH --> SD["Speculative Decoding"] SD --> CH{"MoE Challenge: Draft tokens activate more experts"} CH --> |"Verification overhead 2-3x"| S1["Adaptive Grouped Speculation"] CH --> S2["Context-Aware Scheduling"] CH --> S3["Confidence-Based Deferral"] MH --> VL["vLLM on TPU (PyTorch)"] end OUT --> PW style INPUT fill:#1c2333,stroke:#58a6ff,color:#e6edf3 style MOE_LAYER fill:#1c2333,stroke:#3fb950,color:#e6edf3 style TRAINING fill:#1c2333,stroke:#bc8cff,color:#e6edf3 style SERVING fill:#1c2333,stroke:#d29922,color:#e6edf3 style R fill:#222d3f,stroke:#3fb950,color:#e6edf3 style G fill:#222d3f,stroke:#3fb950,color:#e6edf3 style CH fill:#222d3f,stroke:#f85149,color:#e6edf3 style LB fill:#222d3f,stroke:#d29922,color:#e6edf3 style E3 fill:#161b22,stroke:#6e7681,color:#6e7681

Problem Formulation

How do you scale a language model to trillions of parameters without proportionally scaling inference cost? Dense transformers face a fundamental constraint: every token must pass through every parameter, making compute cost linear in model size. The goal is to decouple total model capacity from per-token compute.

  • Core insight: Not all parameters are needed for every token -- different tokens require different "expertise"
  • ML framing: Replace the dense FFN in each transformer block with a set of expert FFNs, gated by a learned router
  • Target metric: Maintain quality of a dense model with N parameters while only activating a fraction (e.g., 10%) per token
  • Example scale: A model with 1 trillion total parameters might activate only ~100 billion per token, achieving near-dense quality at a fraction of the FLOPS

MoE Architecture Deep Dive

In a standard (dense) transformer, every token passes through the same FFN layer. In a sparse MoE transformer, that FFN is replaced with multiple expert FFNs plus a router.

Aspect Dense Transformer Sparse MoE Transformer
FFN structure Single FFN per layer, all tokens use it N expert FFNs per layer, router selects top-K
Params activated per token 100% of model parameters ~2-10% of total parameters (top-K experts only)
FLOPS per token Proportional to total params Proportional to K * expert_size (much smaller)
Total capacity Limited by compute budget Can be 10-100x larger for same per-token cost
Memory footprint All params must be loaded All params must be loaded (serving challenge)
Expert specialization N/A Experts naturally specialize (e.g., code, math, language)

Router mechanism step by step:

  • Step 1 -- Router logits: Each token embedding is multiplied by a learned router weight matrix to produce logits for each expert
  • Step 2 -- Softmax gating: Softmax is applied to get a probability distribution over experts
  • Step 3 -- Top-K selection: The K experts with highest probabilities are selected (typically K=1 or K=2)
  • Step 4 -- Expert processing: Selected expert FFNs process the token independently
  • Step 5 -- Weighted output: Expert outputs are combined using the gating probabilities as weights
  • Load balancing loss: An auxiliary loss term penalizes uneven expert utilization, preventing "expert collapse" where the router sends all tokens to a few experts

Training Infrastructure

Training a sparse MoE model at Gemini scale requires specialized hardware and parallelism strategies beyond standard data parallelism.

  • TPU v5p / Ironwood pods: Google's custom AI accelerators, with Ironwood chips designed using AlphaChip (AI-driven chip layout optimization)
  • Jupiter network fabric: Custom high-bandwidth interconnect enabling fast all-to-all communication between TPU hosts -- critical for expert parallelism
  • Expert parallelism: Different experts are placed on different devices. When a token is routed to an expert, the token embedding must be sent to that expert's device. This requires all-to-all communication patterns that differ fundamentally from standard data or tensor parallelism
  • Data parallelism: Standard data parallelism is combined with expert parallelism -- each data-parallel replica has a full copy of shared parameters (attention, embeddings) but only a subset of experts
  • XLA/JAX stack: JAX with XLA compilation enables automatic optimization of the complex parallelism patterns, including sharding experts across pods
  • Mixed precision: BFloat16/FP8 training reduces memory and communication bandwidth requirements
  • AI Hypercomputer: Google's integrated architecture combining TPU hardware + Jupiter network + XLA/JAX software stack, purpose-built for models like Gemini

Serving and Inference Challenges

MoE models present unique serving challenges compared to dense models, particularly around memory, data movement, and speculative decoding.

Core serving challenge -- memory vs. compute:

  • All expert weights must reside in memory even though only a fraction are activated per token
  • This makes MoE models memory-bound rather than compute-bound during inference
  • Data movement (loading expert weights from HBM to compute units) becomes the bottleneck

Pathways distributed runtime:

  • Multi-host inferencing: Distributes model across multiple TPU hosts, enabling models too large for a single host
  • Dynamic scaling: Pathways can allocate compute resources dynamically based on load
  • Efficient routing: Implements the token routing at the system level, managing expert-to-device mapping

The speculative decoding + MoE problem:

  • Standard speculative decoding: A small draft model generates K candidate tokens cheaply, then the large verifier model accepts or rejects them in a single forward pass -- yielding speedups of 2-3x for dense models
  • Why it breaks with MoE: In standard autoregressive decoding, each token activates only its top-K experts. But during verification, the batch of K draft tokens collectively activates many more unique experts (potentially all of them), dramatically increasing the data movement required
  • Quantified impact: Verification time can increase 2-3x because loading additional expert weights from HBM dominates compute time. When throughput gains fail to offset this overhead, speculative decoding can cause slowdowns up to 1.5x

Solutions to the MoE speculation problem:

  • Adaptive grouped speculative decoding: Group draft tokens that route to the same experts, process them together to minimize expert loading overhead
  • Context-aware scheduling: Predict which experts will be needed based on the prompt context and pre-load them
  • Divided rollout: Dynamic load balancing across devices during verification to prevent hotspots when certain experts are disproportionately activated
  • Confidence-based deferral: Only speculate when the draft model is highly confident, reducing the number of wasted expert activations from rejected tokens
  • Blockwise parallel decoding: Parallelize generation at the block level to amortize expert loading costs over multiple tokens

Key Design Decisions

Decision Choice Tradeoff
Sparse MoE vs. Dense Sparse MoE for Gemini 2.5+ 10x more capacity at same per-token cost, but requires all expert weights in memory and introduces routing complexity
Top-K selection value Typically K=2 out of 64 experts K=1 is cheapest but risks information loss; K=2 balances quality and efficiency; higher K approaches dense cost
Load balancing mechanism Auxiliary loss in training objective Prevents expert collapse but adds a hyperparameter (loss weight) and can slightly hurt quality if too strong
TPU custom silicon vs. commodity GPUs TPU pods with custom Jupiter interconnect Optimized for all-to-all communication patterns of expert parallelism; less ecosystem flexibility than NVIDIA GPUs
Expert parallelism strategy Experts distributed across devices Enables larger expert counts but requires high-bandwidth interconnect (Jupiter) for token routing across devices
Speculative decoding approach Adaptive grouped speculation with MoE-aware scheduling Recovers speedups that naive speculation loses with MoE; adds system complexity for routing-aware batching

Interview Talking Points

  • Capacity vs. compute decoupling: "MoE lets you build a 1T-parameter model that only uses 100B parameters per token. The key insight is that different tokens need different expertise -- a math token doesn't need the poetry weights."
  • Router as a learned component: "The router is itself a neural network trained end-to-end. It learns which experts specialize in what, and the load balancing loss prevents degenerate solutions where all tokens go to one expert."
  • Memory-bound inference: "The counterintuitive thing about MoE serving is that even though you activate fewer FLOPs per token, you need ALL expert weights in memory. This makes MoE models memory-bandwidth-bound rather than compute-bound -- the bottleneck is loading expert weights from HBM, not the actual math."
  • Speculative decoding breaks with MoE: "Speculative decoding assumes verification is cheap because you batch draft tokens. But with MoE, those draft tokens collectively activate way more experts than single-token generation -- you go from loading 2 experts to potentially dozens. The data movement overhead can actually make it 1.5x slower."
  • Expert parallelism is not tensor parallelism: "Expert parallelism has a fundamentally different communication pattern. Tensor parallelism splits a single weight matrix; expert parallelism places whole experts on different devices and requires all-to-all token routing. That's why Google built Jupiter -- you need massive bisection bandwidth."
  • Hardware-software co-design: "Google's advantage is vertical integration: AlphaChip designs the TPU layout, Jupiter provides the interconnect for expert routing, XLA/JAX compiles the parallelism strategy, and Pathways orchestrates serving. Each layer is optimized for the MoE workload pattern."
  • Why not just train a bigger dense model? "Dense models hit a wall where doubling parameters doubles both training and inference cost. MoE breaks this -- you can scale capacity sublinearly. The cost is system complexity: routing, load balancing, expert parallelism, and MoE-aware serving."
  • Practical implication for system design: "When designing for MoE serving, think about expert placement, caching hot experts, routing-aware batching, and your interconnect bandwidth. The system design is fundamentally different from dense model serving."