
"Video understanding" is helping media companies describe and search their vast video archives at unprecedented speeds.
Dr. Yannis Tevissen, Head of Science at Moments Lab, explains how the concept differs from generative AI and why the goal is to empower human storytellers by helping them find and reuse existing content.
To be effective, such an approach requires training AI models on high-quality, structured data with human oversight to produce reliable and unbiased results.
The tech world is abuzz about AI that generates synthetic video from a simple text prompt. But in production houses, a quieter, more practical AI revolution is taking hold. Known as "video understanding," this technology doesn’t create new footage. Instead, it focuses on describing and cataloging existing content, allowing storytellers to find and repurpose key moments from the vast archives they already own.
For an expert's perspective, we spoke with Dr. Yannis Tevissen, the Head of Science at Moments Lab and a leading voice in the field of video understanding. With a PhD in AI from the Institut Polytechnique de Paris, Dr. Tevissen has built a career on the practical use of AI. Instead of generating synthetic content, he sees video understanding as the more impactful application.
"Video understanding is even more powerful than generative AI because you can use it to describe and search the content you've already shot. Then, you can reuse that video content to create a brand new story," Tevissen says. By unlocking the value of existing footage, he explains, video understanding empowers human storytellers rather than replacing them.
Real over robotic: For Tevissen, the same principle applies to generative AI. "As someone who consumes videos, I don't really want to see videos 100% generated by AI. I want to see real videos, maybe augmented by AI, but the end product is something that a human recorded. A human still guided the story. It was just made with the help of AI. I think that's a very different thing."
Today, that human-guided philosophy is the engine behind the "Discovery Agent," a tool designed to bridge the "knowledge gap" and uncover new, untold stories hidden within vast media archives. By augmenting a producer’s creative process—from acting as a subject matter expert to structuring a pitch and combing archives—it collapses the time between idea and execution.
For global production houses like Banijay, which owns the rights to everything from Top Chef to Black Mirror, the solution makes a previously impractical task feasible: a user can instantly search for every shot of, say, a chicken dish from 15 seasons of a cooking show—a feat of recall no human could manage. But making that search reliable is the hard part, Tevissen explains.
Scraping the ceiling: Despite popular belief, messy, unstructured data is not better for training, he says. "Scraping everything from the web doesn't work for video, because you have a lot of noise. You have bizarre videos, things you don't want your models trained on." Now, this approach is reaching its limits for the nuanced task of video understanding.
At Moments Lab, the solution is a multi-layered process designed to address technical bias and subjective human context. First, the team mitigates the risk of any single person's interpretation skewing the data by using a large and diverse pool of annotators. "We use video understanding to understand the video and help create a first description of it. We also have annotators who review and correct the initial annotations written by AI. We gain a lot of time this way, but it also requires a lot of oversight and a lot of knowledge about bias mitigation for AI models."
Just the facts: To do so effectively, they employ a disciplined, rules-based approach. "Annotating a video for AI training is actually quite tricky. For instance, you might need to write, 'I see a bunch of people in a big room, suited and with their laptops.' But you don't want to include the context, like the fact that it's a NATO conference. You know that as a human, but the AI wouldn't be able to understand this because it doesn't have the broader context. So you really need to define the rules."
Don't blink: Another technical challenge is "frame sampling bias," Tevissen explains. "When you try to understand a video by extracting only a few frames, you can miss the important action. If there is a very quick handshake between two political figures, for example, you might miss it entirely if you don't sample the right frame."
When presented with the widely-held belief that unstructured data is better for training AI, Tevissen politely disagrees. From his perspective, video AI is still in its infancy. It must be built on a strong foundation before it can "graduate" to more chaotic inputs, he explains. "You must first build the core of your model with very structured, quality data. We haven't reached this point yet for video content. We are still too early in developing video understanding models, so we still need very qualitative and well-annotated data."
Looking ahead, Tevissen sees the field branching along two clear paths. The near-term future will be about creating more holistic models that incorporate other "senses," especially audio, to gain a more complete understanding of a scene. The long-term vision is about building ecosystems where multiple, specialized AI agents collaborate on multi-layered creative tasks. "When these agents can talk to each other," he concludes, "that's when we'll see the real power this technology holds."