Architecture Deep Dive

How the 40-layer unified Transformer works under the hood.

Architecture at a Glance

Input:  [Text Tokens] [Image Latent] [Noisy Video] [Noisy Audio]
                            │
                    ┌───────▼───────┐
                    │  Layers 1-4   │  ← Modality-Specific
                    │  (per-modal)  │
                    ├───────────────┤
                    │  Layers 5-36  │  ← Shared Parameters
                    │  (32 layers)  │     Per-Head Gating
                    ├───────────────┤
                    │  Layers 37-40 │  ← Modality-Specific
                    │  (per-modal)  │
                    └───────┬───────┘
                            │
Output: [Denoised Video Tokens] [Denoised Audio Tokens]
                            │
                    ┌───────▼───────┐
                    │  Super-Res    │  ← 256p/540p → 1080p
                    └───────────────┘

Sandwich Architecture

The 40-layer Transformer uses a sandwich design: the first and last 4 layers use modality-specific projections (separate parameters for text, video, and audio tokens), while the middle 32 layers share parameters across all modalities. This design allows each modality to have specialized input/output processing while sharing the bulk of the model's capacity.

Single-Stream Processing

Unlike multi-stream architectures that process video and audio separately, Happy Horse feeds all tokens — text, reference image latents, noisy video tokens, and noisy audio tokens — into a single unified token sequence. Joint self-attention across all modalities enables natural cross-modal alignment without explicit conditioning branches.

Per-Head Gating

Each attention head has a learned scalar gate with sigmoid activation. This stabilizes training by allowing the model to smoothly control how much each head contributes, preventing training instabilities common in large multimodal models.

Timestep-Free Denoising

Unlike most diffusion models, Happy Horse does not use explicit timestep embeddings. The model infers the current denoising state directly from the input latents. This simplifies the architecture and reduces the parameter count dedicated to timestep conditioning.

DMD-2 Distillation

The base model is distilled using Distribution Matching Distillation v2 (DMD-2) to produce a student model that generates high-quality output in just 8 denoising steps, without requiring Classifier-Free Guidance (CFG). This reduces inference cost by roughly 6-8x compared to the 50-step base model.

MagiCompiler

An in-house full-graph compiler that fuses operators across Transformer layers for approximately 1.2x end-to-end speedup. It optimizes attention kernels, memory access patterns, and layer-to-layer data flow for NVIDIA Hopper architecture GPUs.

Super-Resolution Module

A separate upscaling module that takes 256p or 540p output and upscales to 1080p. This two-stage approach allows fast iteration at low resolution with a final quality pass, reducing the compute burden of generating at full 1080p natively.

Inference Pipeline

Stage	Resolution	Time (H100)	Notes
Base Generation	256p	2.0s	8-step DMD-2, no CFG
Super-Resolution	540p	+6.0s	First upscale pass
Super-Resolution	1080p	+30.4s	Final quality pass
Total	1080p	38.4s	5-second clip, end-to-end