Happy Horse AI Video Model
The 15B-parameter open-source model that generates video + audio from text. It beat Seedance 2.0 in blind human tests — and it's going fully open source.

Weights not yet released — try it now via AI Video Arena blind test. Full open-source expected ~April 10. GitHub
15B
Parameters
#1
AI Video Arena
1333
Elo Score
7
Lip-Sync Languages
8
Denoising Steps
1080p
Max Resolution
See It in Action
Real outputs from Happy Horse 1.0 — click to play.
Kid holds out the rest of her cookie, smiles, says "Love you mommy." Cookie offering, sweet smile, little voice.
A cobblestone street after rain, looking dark and glossy, reflecting the yellow streetlamps perfectly.
A candid, handheld camera shot follows a young woman bundled in a thick, charcoal wool coat, speed-walking hunched over down a slushy Manhattan sidewalk at 7:30 AM. Her breath plumes in thick white clouds against the freezing grey air, and her nose is bright red from the cold.
What Makes It Special
Unified Transformer
40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers — single-stream processing with per-head gating for stable training.
Joint Video + Audio
Generates synchronized dialogue, ambient sound, and Foley alongside video frames — no post-production dubbing required.
8-Step DMD-2 Distillation
Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.
Multilingual Lip-Sync
Native support for 7 languages with industry-leading low Word Error Rate (14.6%).
1080p Output
5-8 second clips at 1080p in standard aspect ratios (16:9, 9:16) — suitable for social, advertising, and cinematic use cases.
Open & Self-Hostable
Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.
Benchmarks
Based on 2,000 human-rated comparisons on the Artificial Analysis Video Arena.
| Model | Visual Quality ↑ | Text Alignment ↑ | Physical Realism ↑ | WER (%) ↓ |
|---|---|---|---|---|
| Happy Horse 1.0 | 4.8 | 4.18 | 4.52 | 14.6% |
| LTX 2.3 | 4.76 | 4.12 | 4.56 | 19.23% |
| OVI 1.1 | 4.73 | 4.1 | 4.41 | 40.45% |
Inference Speed
On a single NVIDIA H100, generating a 5-second video clip.
2.0s
256p
5-sec clip on H100
8.0s
540p
with super-resolution
38.4s
1080p
full quality