#1 on AI Video ArenaElo 1333

Happy Horse AI Video Model

The 15B-parameter open-source model that generates video + audio from text. It beat Seedance 2.0 in blind human tests — and it's going fully open source.

Where to Try It Watch Sample Videos

Confirmed: Alibaba project — developed by Taotian Group, led by ex-Kuaishou VP Zhang Di. Public release expected ~mid-April. Read more | GitHub

15B

Parameters

AI Video Arena

1333

Elo Score

Lip-Sync Languages

Denoising Steps

1080p

Max Resolution

See It in Action

Real outputs from Happy Horse 1.0 — click to play.

All videos + comparisons

Kid holds out the rest of her cookie, smiles, says "Love you mommy." Cookie offering, sweet smile, little voice.

A cobblestone street after rain, looking dark and glossy, reflecting the yellow streetlamps perfectly.

A candid, handheld camera shot follows a young woman bundled in a thick, charcoal wool coat, speed-walking hunched over down a slushy Manhattan sidewalk at 7:30 AM. Her breath plumes in thick white clouds against the freezing grey air, and her nose is bright red from the cold.

All videos + head-to-head comparisons

What Makes It Special

Unified Transformer

40-layer self-attention network with 4 modality-specific layers on each end and 32 shared layers — single-stream processing with per-head gating for stable training.

Joint Video + Audio

Generates synchronized dialogue, ambient sound, and Foley alongside video frames — no post-production dubbing required.

8-Step DMD-2 Distillation

Reduces denoising to just 8 steps without classifier-free guidance, accelerated further by the in-house MagiCompiler runtime.

Multilingual Lip-Sync

Native support for 7 languages with industry-leading low Word Error Rate (14.6%).

1080p Output

5-8 second clips at 1080p in standard aspect ratios (16:9, 9:16) — suitable for social, advertising, and cinematic use cases.

Open & Self-Hostable

Base model, distilled model, super-resolution module, and inference code released openly with commercial-use permission.

Benchmarks

Based on 2,000 human-rated comparisons on the Artificial Analysis Video Arena.

Model	Visual Quality ↑	Text Alignment ↑	Physical Realism ↑	WER (%) ↓
Happy Horse 1.0	4.8	4.18	4.52	14.6%
LTX 2.3	4.76	4.12	4.56	19.23%
OVI 1.1	4.73	4.1	4.41	40.45%

80% win rate vs OVI 1.1

60.9% win rate vs LTX 2.3

See all model comparisons

Inference Speed

On a single NVIDIA H100, generating a 5-second video clip.

2.0s

256p

5-sec clip on H100

8.0s

540p

with super-resolution

38.4s