How I Actually Judge AI Video Models

AI Video

How I Actually Judge AI Video Models

Most people evaluate AI video models on visual quality. Here's the 6-point production-first framework I use—covering native voice quality, physics pacing, text morphing, character control, voice integration, and frame-level control. Plus: what actually happened with Seedance 2.0.

12 April 2026

A production-first framework, with Seedance 2.0 as the cautionary tale

Most people evaluate AI video models on visual quality. I judge them differently, on the things that actually matter when you're shipping production work.

Here's my current framework. I originally wrote this with Seedance 2.0 as the reference point because it was what everyone was talking about. Now that it's out, I've added a section at the end on what actually happened.

1. Native Voice Quality

Models like Veo, Kling, Seedance 1.5, and most video models with built-in voice generation produce what I'd call a metallic voice. The voice is clear, but there's a slightly high-pitched quality along the soundwave, a non-soothing metallic texture that's hard to miss once you've heard it.

What was interesting in the early Seedance 2.0 demos was that I wasn't seeing this. The voice felt different. More on whether that held up below.

2. Physics and Movement Pace

This is about whether the video matches real-world pace at a subconscious level.

Kling, for example, inherently feels slow. When you animate an image with it, the animation rarely plays at real-world pace, it tends to feel like slow motion. They've improved this in newer releases, but the tendency is still there.

On the other end, models like Sora are inherently fast, which again feels non-real, just in the opposite direction.

What I'm looking for is a model that matches the actual pace of how things move in the real world, so the video makes sense to your brain without you consciously analyzing it. Early Seedance footage felt like it landed exactly at that pace.

3. Text Morphing

This is a gap I haven't seen any model solve yet.

When you animate an image that has text in it, whether it's part of a UI, a subtitle, or anywhere in the frame, the text gets morphed during animation. It degrades as the model tries to animate around it.

The problem multiplies when the text is in a non-English language for American or European models, and a non-Chinese language for Chinese models. Think about animating a product shot with Hindi text on screen. An American-trained model will morph and destroy that Hindi text far worse than it would with English. There's a clear training bias at play. But even setting that aside, text morphing during animation is a significant unsolved problem that affects real production work.

4. Character Control

When I pass a character as a first frame or reference image, I want that character's expressions to accurately reflect how they behave in the real world. This is solved in only two specific cases.

The first is non-real characters. Because nobody actually knows how a fictional character looks when they're sad, happy, or cheering, the model gets away with it.

The second is well-known personalities. Think Cillian Murphy. Models hold his face really well because they've been trained on enough data of him. This is exactly what we saw in the early Seedance demos. The viral videos were almost all of well-known personalities, because people were optimizing for views.

For everyone else, character control is unsolved. There are two approaches to fixing it. The first is training, fine-tuning, or changing model weights, which works but isn't scalable, and feels increasingly old-school given where the field is moving. The second is in-context learning. The term sounds like a heavy ML concept but it's actually simple, it's just about giving the model better context as input. Right now, Kling supports reference images for this, but providing 3-4 images clearly isn't accurate enough. We've faced this with actual clients, where the person whose video we're animating doesn't even recognize themselves in the output.

The real issue here is an input problem, not an output problem. The ideal input would be a complete 3D mesh of the character's face.

5. Voice Control and the Integration Problem

This is a dependency problem between voice models and video models.

Say I love a particular voice on ElevenLabs. How do I pass that ElevenLabs voice ID into Kling? Kling does provide a voice option, but it's Kling-specific, there's no cross-platform voice ID system. Voice models and video models are independent ecosystems, each excelling in their own area. You can solve this partially by replacing the audio track afterwards, but that workaround breaks down in specific cases.

For example, ElevenLabs supports Gujarati. Kling almost certainly doesn't, and it's unlikely they'll train their video model on it because that's not their focus. So for regional language use cases, you're stuck, and there's no clean way to bridge the two systems right now.

6. Frame-Level Control

A few models have come out with multi-shot prompting. Kling and Wan explicitly, while Veo and Sora essentially did it by default without calling it that, and with zero actual control given to the user.

For Kling and Wan, the control exists, but it's only through prompting. That means you can't control a specific action at a specific frame. And when you try to edit a particular moment in an already-generated video, you can't just change the prompt, because ML models cannot reproduce the same output even when given the same prompt, the same seed, and the same temperature settings.

This is worth a separate discussion on its own. But the core point is that granular, frame-level control knobs are the need of the hour, and as far as I know, no model has solved this yet.

What Actually Happened With Seedance 2.0

When I first wrote this framework, Seedance 2.0 was the model everyone was watching. The pre-release footage looked genuinely different on the things I care about. The voice didn't have the metallic texture. The pacing felt real. Character likeness on well-known faces was holding up better than anything else I'd seen.

Then it shipped, and most of that is gone.

After the Hollywood likeness allegations, the model got nerfed pretty heavily. It no longer picks up faces from image-to-video or reference-to-video inputs in any meaningful way. Quality is capped at 720p. The overall vibe, that thing in the early footage that made you stop and look twice, is gone.

The model that's live today is a different product from the one that was being demoed. Cap the resolution, strip the face handling, and you've removed the two things that were making people pay attention in the first place. At this point, Seedance 2.0 is almost on par with Kling V3 on most fronts. The one place it still pulls ahead is sound design, the way it picks SFX and BGM is genuinely superb, and that's the part nobody else is doing as well right now.

So the framework still stands. The six things above are still the right things to judge a video model on. I just don't have a clean reference point anymore.