What Is Happy Horse 1.0?

The open-source AI video generation model that topped the AI Video Arena.

Happy Horse 1.0 is a 15B-parameter unified Transformer model that jointly generates video and synchronized audio from text or image prompts. Released in early 2026, it achieved the #1 ranking on the Artificial Analysis AI Video Arena with an Elo score of 1333, surpassing Seedance 2.0 and other leading models in blind human evaluations.

Key Specifications

Parameters	15B
Architecture	40-layer Unified Single-Stream Transformer
Distillation	DMD-2 8-step distillation (no CFG)
Compiler	MagiCompiler full-graph compilation (~1.2x speedup)
Inputs	Text, Image
Outputs	Video, Synchronized Audio
Resolution	1080p (16:9, 9:16)
Duration	5-8 seconds
Lip-Sync Languages	English, Mandarin, Cantonese, Japanese, Korean, German, French
Hardware	NVIDIA H100 or A100 (≥48GB VRAM), FP8 quantization supported
License	Open Source (Commercial Use)

Core Capabilities

Unified Transformer

40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers — single-stream processing with per-head gating for stable training.

Joint Video + Audio

Generates synchronized dialogue, ambient sound, and Foley alongside video frames — no post-production dubbing required.

8-Step DMD-2 Distillation

Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.

Multilingual Lip-Sync

Native support for 7 languages with industry-leading low Word Error Rate (14.6%).

1080p Output

5-8 second clips at 1080p in standard aspect ratios (16:9, 9:16) — suitable for social, advertising, and cinematic use cases.

Open & Self-Hostable

Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.

Arena Elo Rankings

Elo scores from the Artificial Analysis AI Video Arena (April 8, 2026).

Text-to-Video

Text-to-Video with Audio

Image-to-Video

Image-to-Video with Audio

Known Limitations

Based on community testing and independent evaluations.

Best at single-character portrait scenarios; quality drops with multiple people or complex scenes
Requires H100/A100 GPU (≥48GB VRAM); consumer GPUs cannot run it currently
Generation length limited to ~10 seconds before quality degrades
High-definition output still benefits from super-resolution post-processing
Community quantization solutions are in progress but not yet mature for local deployment