Happy Horse WikiHappy Horse Wiki

What Is Happy Horse 1.0?

The open-source AI video generation model that topped the AI Video Arena.

Happy Horse 1.0 is a 15B-parameter unified Transformer model that jointly generates video and synchronized audio from text or image prompts. Released in early 2026, it achieved the #1 ranking on the Artificial Analysis AI Video Arena with an Elo score of 1333, surpassing Seedance 2.0 and other leading models in blind human evaluations.

Key Specifications

Parameters15B
Architecture40-layer Unified Single-Stream Transformer
DistillationDMD-2 8-step distillation (no CFG)
CompilerMagiCompiler full-graph compilation (~1.2x speedup)
InputsText, Image
OutputsVideo, Synchronized Audio
Resolution1080p (16:9, 9:16)
Duration5-8 seconds
Lip-Sync LanguagesEnglish, Mandarin, Cantonese, Japanese, Korean, German, French
HardwareNVIDIA H100 or A100 (≥48GB VRAM), FP8 quantization supported
LicenseOpen Source (Commercial Use)

Core Capabilities

Unified Transformer

40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers — single-stream processing with per-head gating for stable training.

Joint Video + Audio

Generates synchronized dialogue, ambient sound, and Foley alongside video frames — no post-production dubbing required.

8-Step DMD-2 Distillation

Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.

Multilingual Lip-Sync

Native support for 7 languages with industry-leading low Word Error Rate (14.6%).

1080p Output

5-8 second clips at 1080p in standard aspect ratios (16:9, 9:16) — suitable for social, advertising, and cinematic use cases.

Open & Self-Hostable

Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.

Arena Elo Rankings

Elo scores from the Artificial Analysis AI Video Arena (April 8, 2026).

Text-to-Video

Text-to-Video Elo scores from Artificial Analysis

Text-to-Video with Audio

Text-to-Video with Audio Elo scores from Artificial Analysis

Image-to-Video

Image-to-Video Elo scores from Artificial Analysis

Image-to-Video with Audio

Image-to-Video with Audio Elo scores from Artificial Analysis

Known Limitations

Based on community testing and independent evaluations.

  • Best at single-character portrait scenarios; quality drops with multiple people or complex scenes
  • Requires H100/A100 GPU (≥48GB VRAM); consumer GPUs cannot run it currently
  • Generation length limited to ~10 seconds before quality degrades
  • High-definition output still benefits from super-resolution post-processing
  • Community quantization solutions are in progress but not yet mature for local deployment