HAG-MoE introduces a novel architectural primitive: a Mixture-of-Experts layer in which the routing hierarchy is derived entirely from the pre-existing multi-head attention structure — not from separately learned gate networks — and in which the dynamic number of activated experts per token is controlled by the Shannon entropy of the attention distribution. Expert selection feeds back to modulate attention output through lightweight identity embeddings, creating a principled bidirectional coupling between contextual focus and computational allocation.
Mixture-of-Experts (MoE) architectures achieve computational sparsity by routing each token to a small subset of expert FFNs. State-of-the-art routers are scalar linear projections that operate independently at each layer, treating all tokens uniformly in terms of the number of activated experts, and having no bidirectional coupling with the attention mechanism that precedes them. We identify three structural problems: (1) the routing hierarchy is an artificially added learned component, separate from the rich representational structure already encoded in multi-head attention; (2) the fixed top-k cardinality ignores token-level contextual uncertainty, as measured by attention entropy; and (3) the feedforward path is unidirectional — expert selection never informs the attention computation.
We propose HAG-MoE (Hierarchical Attention-Gated MoE), which addresses all three through a unified principle: use the attention mechanism itself, rather than auxiliary learned gates, to govern expert routing. HAG-MoE (i) partitions existing attention heads into coarse and fine sets, using their averaged distributions as a natural two-level routing hierarchy with zero added gate parameters; (ii) uses per-token Shannon entropy of the coarse attention distribution as the dynamic cardinality signal
- Motivation and Prior Art Gap
- Related Work and Positioning
- Architecture
- Mathematical Formulation
- Theoretical Properties
- Training Objective
- Implementation Details
- Repository Structure
- Ablation Roadmap
- References
Standard sparse MoE layers (Shazeer et al., 2017; Fedus et al., 2022) replace the FFN sublayer with
where
Problem 1: Artificial routing hierarchy. Hierarchical MoE models (HMoE, arxiv:2410.02935; SAGE, 2025) introduce two-stage learned gates — coarse gate
Problem 2: Uniform activation cardinality. All current MoE systems activate a fixed
Problem 3: Unidirectional attention-expert relationship. Every prior work is one-directional: attention output informs routing (DA-MoE, DASG-MoE, RMoE) or routing modifies attention structure (SwitchHead). None creates a within-layer bidirectional coupling where expert assignment feeds back to calibrate the attention-derived output. RMoE (Qiu et al., ICLR 2025) creates cross-layer routing state via GRU — this is recurrence across layers, not within-layer expert-to-attention feedback.
HAG-MoE addresses all three simultaneously.
| Model | Hierarchical Routing | Dynamic |
Attn↔Expert Feedback |
|---|---|---|---|
| Switch Transformer (2022) | ✗ flat, top-1 | ✗ | ✗ |
| Mixtral (2024) | ✗ flat, top-2 | ✗ | ✗ |
| DeepSeekMoE (2024) | Partial (shared+routed) | ✗ | Separate MLA |
| HMoE (2410.02935) | ✓ 2 learned gates | ✗ | ✗ |
| DA-MoE (2409.06669) | ✗ | ✓ importance-based | One-directional |
| RMoE (2408.06793, ICLR 2025) | ✗ | ✗ | Cross-layer GRU |
| SAGE (2511.18493) | ✓ shared+fine-grained | ✗ | ✗ |
| DASG-MoE (2509.10530) | ✗ | ✗ | Attn→routing only |
| SMoE-Attention (2505.00792) | ✗ | ✗ | Graph similarity |
| Quadratic MoE (2410.11222) | ✗ | ✗ | Math unification |
| GateTS (2508.17515) | ✗ | ✗ | Inspired, time-series |
| HAG-MoE (ours) | ✓ from MHA heads, zero extra gates | ✓ entropy-based | ✓ bidirectional within-layer |
Key distinctions from the closest neighbors:
vs. DA-MoE: DA-MoE uses raw attention weight magnitudes for token importance → variable K. HAG-MoE uses attention entropy (high when diffuse, not correlated with magnitude) for cardinality + attention distributions from partitioned head groups for hierarchical routing. Three different mechanisms.
vs. HMoE and SAGE: Both use separately trained coarse gates to define routing hierarchy. HAG-MoE derives hierarchy from the pre-existing head structure — zero additional gate parameters at the hierarchy level. The hierarchical signal source is fundamentally different.
vs. RMoE: RMoE propagates routing state across layers via GRU hidden states. HAG-MoE's feedback is within a single layer (expert embeddings modulate FFN output). These are orthogonal and composable — HAG-MoE can be stacked on top of RMoE.
vs. DASG-MoE: Uses grouped MHA attention weights and forwards them to MoE routing (one-directional). Does not use entropy for cardinality, does not partition heads into coarse/fine, no feedback modulation.
A standard transformer block:
X' = LayerNorm(MultiHeadAttn(X) + X)
X'' = LayerNorm(FFN(X') + X')
HAG-MoE replaces FFN with a Hierarchical Attention-Gated MoE block. The attention sublayer is untouched, but its internal attention weight matrices {A^(h)} are extracted and used to govern routing:
X' = LayerNorm(MultiHeadAttn(X) + X) # standard; extract A^(h)
X'' = LayerNorm(HAGMoE(X', {A^(h)}) + X') # replaces FFN
Five operations inside HAGMoE:
1. Head partition: Hc, Hf ← partition(H heads)
2. Attention aggregation: a_i^c, a_i^f ← aggregate(A^h, Hc, Hf)
3. Entropy cardinality: K_i ← EntropyGate(H(a_i^c))
4. Coarse routing: g_i ← CoarseGate(a_i^c, X')
5. Fine routing: {e, s} ← FineGate(a_i^f, X', g_i, K_i)
6. Expert compute: o_i ← sum_e s_e * E_e(x_i')
7. Feedback modulation: o_i ← o_i * (1 + gamma * r_i)
Input X ∈ ℝⁿˣᵈ
│
▼
┌───────────────────────────────────────────┐
│ Multi-Head Attention │
│ H heads → A^(h) ∈ ℝⁿˣⁿ (extracted) │
│ Output X' (with residual) │
└───────────┬───────────────────────────────┘
│ X' │ {A^(h)}
│ ▼
│ ┌─────────────────────────────┐
│ │ HEAD PARTITION │
│ │ Hc = {1,...,H/2} │
│ │ Hf = {H/2+1,...,H} │
│ └──────┬──────────┬───────────┘
│ │ │
│ a_i^c (coarse) a_i^f (fine)
│ │ │
│ ┌─────── ▼──┐ ┌────▼──────────────────┐
│ │ ENTROPY │ │ COARSE GATE │
│ │ H(a_i^c) │ │ g_i = TopG(W_g · c_i^c)│
│ └─────────┬─┘ └────────────┬───────────┘
│ │ │
│ K_i ←──┘ Expert group g_i
│ (dynamic cardinality) │
│ ┌──────────▼──────────────┐
│ │ FINE GATE (in g_i) │
│ │ {e,s} = TopK(W_e·c_i^f)│
│ └──────────┬──────────────┘
│ │
│ ┌──────────▼──────────────┐
│ │ EXPERT COMPUTE │
│ │ o_i = Σ s_e · E_e(x_i')│
│ └──────────┬──────────────┘
│ │
│ ┌──────────▼──────────────┐
│ │ FEEDBACK MODULATION │
│ │ r_i = W_r(Σ s_e · w_e) │
│ │ õ_i = o_i ⊙(1+γ·tanh r_i)│
│ └──────────┬──────────────┘
└──────────────────┬───────────┘
▼
LayerNorm(õ + X') → Output X''
| Symbol | Meaning |
|---|---|
| Sequence length | |
| Model dimension | |
| Number of attention heads | |
| Total number of experts | |
| Number of expert groups | |
| Experts per group | |
| Min/max active experts per token | |
| Attention weight matrix for head |
Partition the
The partition is fixed (not learned), grounded in the established empirical result that attention heads exhibit functional specialization — earlier heads in each layer tend to track positional/syntactic structure while later heads capture semantic content. The partition boundary can be validated per layer via probing classifiers.
For token
Compute the attention-weighted context representations:
For token
Normalize to
Compute dynamic expert cardinality:
where
Coarse gate — expert group selection:
Fine gate — expert selection within group
Each expert
Mixed output:
Assign each expert
Apply as output modulation:
where
where
Standard HMoE adds: coarse gate
HAG-MoE's routing overhead: coarse gate
Proposition. Under the Bayesian mixture interpretation of MoE, where
Intuition. High entropy in the attention distribution reflects a token that contextually depends on many positions — a broader and less determinate context. This corresponds to a wider posterior
The feedback term creates an additional gradient pathway from output loss to routing scores
The second term is a gradient path that flows directly from output quality to routing scores via the expert identity embeddings
| Configuration | HAG-MoE reduces to |
|---|---|
|
|
Standard SMoE with attention-derived fine gate |
| Learned coarse gate, |
Standard HMoE |
| Head partition off, importance-based |
DA-MoE variant |
| All off ( |
Switch Transformer |
HAG-MoE strictly generalizes all four. Each reduction corresponds to ablating one or more of the three novel contributions. This is the correct structure for a research contribution: it subsumes and improves upon prior work.
Language modeling loss
Group load balancing
Expert load balancing
Head divergence regularizer
Minimizing
Feedback modulation regularizer
Hyperparameter schedule: linearly ramp all
| Parameter | Small | Medium | Large |
|---|---|---|---|
| Layers |
12 | 24 | 36 |
| Model dim |
512 | 1024 | 2048 |
| Attention heads |
8 | 16 | 32 |
| Expert groups |
4 | 8 | 16 |
| Experts per group |
4 | 8 | 16 |
| Total experts |
16 | 64 | 256 |
|
|
1 / 4 | 1 / 6 | 1 / 8 |
| Feedback dim |
64 | 128 | 256 |
|
|
0.0 | 0.0 | 0.0 |
HAG-MoE requires
Option A (exact): Use standard nn.MultiheadAttention with need_weights=True, average_attn_weights=False. Approx 15% overhead vs FlashAttention but exact weights.
Option B (efficient): Run a secondary lightweight attention computation in parallel using only
entropy = -(attn_c * torch.log(attn_c.clamp(min=1e-8))).sum(dim=-1) # [B, n]
norm_entropy = entropy / math.log(n) # normalize to [0,1]
K_i = K_min + ((K_max - K_min) * torch.sigmoid(alpha * (norm_entropy - mu_H))).floor().int()- Initialize
$\gamma = 0.0$ (HAG-MoE = standard MoE at step 0) - Gradient clip: max norm 1.0 on expert identity embeddings
$\mathbf{w}_e$ - Use constant
$K_i = K_{\min}$ for first 1000 steps - Ramp
$\lambda_{\text{LB}}, \lambda_{\text{div}}, \lambda_\gamma$ linearly over first 5000 steps - Apply router z-loss to prevent logit explosion in coarse/fine gates
Dynamic
Capacity factor padding: Set per-token capacity to
Sorted dispatch: Sort tokens by
hag-moe/
├── README.md
├── LICENSE
├── setup.py / pyproject.toml
├── requirements.txt
│
├── hagmoe/
│ ├── core/
│ │ ├── model.py # HAGMoETransformer (full model)
│ │ ├── block.py # HAGMoEBlock (single transformer block)
│ │ ├── attention.py # MultiHeadAttention with weight extraction
│ │ ├── routing.py # CoarseGate, FineGate, EntropyGate
│ │ ├── experts.py # Expert FFNs (SwiGLU)
│ │ └── feedback.py # BidirectionalFeedback module
│ │
│ ├── training/
│ │ ├── losses.py # LB loss, div regularizer, gamma regularizer
│ │ ├── trainer.py # Training loop with lambda warm-up
│ │ └── schedulers.py # LR and lambda schedules
│ │
│ ├── analysis/
│ │ ├── entropy_viz.py # Per-token entropy distribution visualization
│ │ ├── routing_viz.py # Expert routing heatmaps per layer
│ │ ├── head_partition.py # Head specialization probing classifiers
│ │ └── cardinality_stats.py # K_i distribution analysis
│ │
│ └── research/
│ ├── special_cases.py # Reduction to prior architectures
│ ├── entropy_theory.py # Entropy-cardinality theory tests
│ └── gradient_analysis.py # Feedback gradient path analysis
│
├── configs/
│ ├── small.yaml # 12L/512d/16 experts
│ ├── medium.yaml # 24L/1024d/64 experts
│ └── large.yaml # 36L/2048d/256 experts
│
├── scripts/
│ ├── train.py # Pre-training script
│ ├── evaluate.py # Evaluation on benchmarks
│ └── ablate.py # Automated ablation runner
│
└── tests/
├── test_routing.py # Routing correctness
├── test_entropy_gate.py # K_i bounds [K_min, K_max]
├── test_feedback.py # gamma=0 init reduces to standard MoE
└── test_special_cases.py # All four reduction checks
The three novel contributions of HAG-MoE are independently ablatable:
| Variant | Head Partition | Entropy |
Feedback |
Expected insight |
|---|---|---|---|---|
| HAG-MoE (full) | ✓ | ✓ | ✓ | Full architecture |
| HAG-MoE-noFB | ✓ | ✓ | ✗ | Value of feedback path |
| HAG-MoE-fixK | ✓ | ✗ |
✓ | Value of entropy cardinality |
| HAG-MoE-learnedHier | ✗ learned gate | ✓ | ✓ | Value of head partition |
| HAG-MoE-fixK-noFB | ✓ | ✗ | ✗ | Head partition alone |
| DA-MoE baseline | ✗ | importance-$K$ | ✗ | Prior attention-informed routing |
| Standard SMoE | ✗ | ✗ | ✗ | Baseline |
Evaluation benchmarks: enwiki8 (BPC), WikiText-103 (PPL), MMLU (5-shot accuracy), GSM8K (accuracy), HumanEval (pass@1), GLUE average.
Key hypotheses: (1) Head partition ≥ learned coarse gate with zero added parameters; (2) entropy-based
Foundational MoE
- Jacobs et al. (1991). Adaptive Mixtures of Local Experts. Neural Computation.
- Shazeer et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. ICLR.
- Fedus, Zoph & Shazeer (2022). Switch Transformers: Scaling to Trillion Parameter Models. JMLR.
Recent MoE Architectures
- Jiang et al. (2024). Mixtral of Experts. arXiv:2401.04088.
- DeepSeek-AI et al. (2024). DeepSeek-V2/V3. arXiv:2405.04434 / 2412.19437.
- Muennighoff et al. (2024). OLMoE. arXiv:2409.02060.
Hierarchical MoE
- arXiv:2410.02935. Hierarchical MoE with Two-Stage Gating. 2024.
- Zhu et al. (2025). SAGE: Shape-Adapting Gated Experts. arXiv:2511.18493.
Attention-Informed Routing
- Aghdam et al. (2024). DA-MoE: Dynamic Expert Allocation via Attention-Derived Token Importance. arXiv:2409.06669.
- Shi et al. (2025). GateTS: Attention-Inspired Gating for Time-Series MoE. arXiv:2508.17515.
- Nguyen et al. (2025). Improving SMoE Routing with Graph of Tokens. arXiv:2505.00792.
Cross-Layer Routing
- Qiu et al. (ICLR 2025). Layerwise Recurrent Router for MoE (RMoE). arXiv:2408.06793.
Attention-MoE Unification
- arXiv:2410.11222. Quadratic Gating Functions in MoE. 2024.
- arXiv:2506.16419. Optimizing MoE Routers. 2025.
- Yang et al. (2025). Gated Attention for LLMs. arXiv:2505.06708.
Transformers and Attention Head Analysis
- Vaswani et al. (2017). Attention is All You Need. NeurIPS.
- Clark et al. (2019). What Does BERT Look At? Analysis of BERT's Attention. BlackboxNLP.
- Voita et al. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting. ACL.
- Shazeer (2020). GLU Variants Improve Transformers. arXiv:2002.05202.
Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala
Open source under the Apache 2.0 License.
@article{debnath2025hagmoe,
title = {HAG-MoE: Hierarchical Attention-Gated Mixture of Experts},
author = {Debnath, Devanik},
year = {2025},
note = {Preprint. https://github.com/Devanik21/HAG-MoE},
institute = {National Institute of Technology Agartala}
}Conceived from the observation that the attention mechanism is not just a preprocessor for routing — it is the routing. HAG-MoE makes that relationship explicit, bidirectional, and mathematically grounded.