ai tools

Netflix’s AI Deletes Reality

Netflix just released an AI that doesn't just erase actors from scenes—it erases their impact on reality itself. This groundbreaking open-source tool is changing video editing forever, and we break down how it works.

Stork.AI
Hero image for: Netflix’s AI Deletes Reality
💡

TL;DR / Key Takeaways

Netflix just released an AI that doesn't just erase actors from scenes—it erases their impact on reality itself. This groundbreaking open-source tool is changing video editing forever, and we break down how it works.

The "Ghost in the Machine" is Dead

Existing AI video tools excel at erasing objects, but they routinely fail at deleting the consequences of those objects. This fundamental flaw creates jarring ghost interactions, where the physical effects of a removed item inexplicably persist. Consider a bowling ball: remove it from a scene, and the pins still topple over for no discernible reason. Erase a person making a smoothie, and the blender continues to spin and churn, devoid of an operator. Current models merely patch pixels, fixing appearance while utterly ignoring the underlying physics and causal relationships of the surrounding environment. They are content-aware fill on steroids, but little more.

Netflix just released VOID (Video Object and Interaction Deletion), a groundbreaking open-source AI framework that directly confronts this pervasive problem. VOID doesn't just paint over missing pixels; it intelligently rewrites the scene's physics, generating a counterfactual reality where the removed object never existed. This innovative model understands cause and effect, modifying video content based on the absence of specific elements to ensure logical consistency. It promises to eliminate the implausible remnants left by prior technologies.

Released on April 3, 2026, under an Apache 2.0 license, and developed in collaboration with INSAIT, VOID represents a monumental leap beyond simple video inpainting. This is a paradigm shift, transitioning from cosmetic pixel-level adjustments to sophisticated causal reasoning within video. Instead of merely guessing what lies behind a removed object, VOID’s two-pass reasoning system first identifies what else would be causally affected by its absence.

During its initial reasoning phase, VOID employs a Vision Language Model and Meta's SAM 2 (Segment Anything Model 2) to not only track the object for removal but also to identify all causally affected elements. It then constructs a "quadmask," a detailed map that instructs the subsequent video diffusion model not just where to erase, but precisely where to rewrite the physics of the surrounding area. Trained on synthetic paired data generated using Google's Kubric and HUMOTO, VOID learned the intricate relationships between object presence and environmental impact. This meticulous approach allows VOID to generate footage that is not only visually coherent but also physically consistent, redefining the possibilities for dynamic video manipulation and production workflows.

Beyond Pixels: An AI That Understands Physics

Illustration: Beyond Pixels: An AI That Understands Physics
Illustration: Beyond Pixels: An AI That Understands Physics

Netflix’s VOID framework redefines video object removal, transcending simple pixel erasure to fundamentally reimagine a scene’s physics. Unlike standard AI tools that merely attempt to fill a void, VOID generates a counterfactual reality, meticulously recreating the video as if the target object or person never existed. This innovative approach directly tackles the pervasive "ghost interaction" problem, where removed elements leave behind inexplicable physical consequences, such as falling pins without a bowling ball or a spinning blender with no one operating it.

VOID initiates its sophisticated two-pass process with a crucial reasoning phase. Employing a Vision Language Model alongside Meta’s SAM 2 (Segment Anything Model 2), the AI meticulously analyzes the entire scene. It doesn't just identify the object for removal; it critically asks, "If I remove this, what else changes?" This query drives the model to pinpoint all other elements in the scene that would be causally affected by the target object’s absence. For instance, removing a single domino from a stack prompts VOID to identify all subsequent dominoes as physically interdependent, requiring a complete re-simulation of their interaction.

This analytical step culminates in the creation of a quadmask, a highly precise, AI-generated map. This quadmask serves as a critical instructional guide for the subsequent video diffusion model. It dictates not only where pixels must be erased to remove the target object but, crucially, where the physics of the surrounding environment must be entirely rewritten. The map directs the model to alter motions, forces, and inter-object relationships in a physically plausible manner, ensuring the regenerated scene maintains absolute verisimilitude.

This methodology marks a profound paradigm shift from conventional AI video inpainting. Older content-aware fill algorithms operate solely on pattern recognition, guessing pixels based on surrounding visual data without any understanding of physical laws. VOID, however, demonstrates a rudimentary but powerful form of world understanding, grasping the intricate cause-and-effect relationships inherent in physical interactions. Its extensive training on synthetic environments, like Google’s Kubric and HUMOTO, provided vast paired datasets. These datasets included "before" and "after" versions of thousands of physics simulations, one with an interaction and one where the object was never present.

By learning from these meticulously crafted synthetic realities, VOID developed the capacity to infer the precise relationship between an object’s presence and its profound impact on the environment. This deep understanding allows VOID to produce coherent, physically consistent video without the tell-tale signs of AI manipulation, moving beyond surface-level visual fixes to a deeper, physics-aware reconstruction of reality.

Inside the Two-Pass Pipeline

VOID’s innovative approach relies on a two-pass system to achieve its physics-aware deletions, fundamentally altering a scene’s reality. This sophisticated pipeline moves beyond simple pixel manipulation, first understanding the scene’s causal fabric, then intelligently reconstructing it with fidelity.

The initial Reasoning Phase leverages a powerful combination of advanced AI models. A Vision Language Model, akin to Google’s Gemini, meticulously analyzes the scene to interpret complex context, identify potential causal relationships, and understand the object’s role. Concurrently, Meta’s Segment Anything Model 2 (SAM 2) precisely identifies and tracks the target object across every frame, creating a pixel-perfect mask for its removal.

During this crucial phase, the AI doesn't merely locate pixels for erasure. It actively queries what fundamental changes would occur if the object never existed, moving beyond visual appearance to physical consequence. This process culminates in the generation of a specialized "quadmask," a detailed map that instructs the subsequent diffusion model not only where to erase pixels but, critically, where to rewrite the physics and interactions of the surrounding environment.

Following this deep reasoning, the Generation and Refinement Phase takes over. A robust video diffusion model, specifically Alibaba’s fine-tuned CogVideoX-Fun-V1.5-5b-InP, generates the new footage. This model synthesizes the counterfactual reality based on the quadmask’s intricate instructions, intelligently filling the void left by the removed object while maintaining a consistent visual aesthetic.

Diffusion models, while powerful, can sometimes introduce subtle visual inconsistencies or shape distortions in generated content. To combat this, VOID incorporates an optional yet vital refinement step. It employs a technique involving 'flow-warped noise' to lock remaining objects into their correct shapes and positions, ensuring temporal consistency. This process makes them feel solid and unwavering, even as the scene’s underlying physics have been radically altered.

VOID’s unparalleled capability stems from its highly effective hybrid architecture, integrating cutting-edge models from diverse industry leaders. This collaborative approach strategically combines: - Meta’s SAM 2 for precise object segmentation and tracking. - Google’s Gemini-like Vision Language Model for deep contextual understanding and causal inference. - Alibaba’s CogVideoX for high-quality, interaction-aware video generation. Further technical details and the open-source implementation can be explored via Netflix/void-model - GitHub. This blend of specialized AI components creates a remarkably coherent and physically plausible output.

How to Teach an AI What Never Happened

Training VOID required overcoming a fundamental data problem: how to teach an AI about events that *didn't* happen. Real-world footage cannot provide before-and-after comparisons of a car crash that *didn't* occur, or a glass that *never* shattered. This absence of ground truth for counterfactual realities posed a significant hurdle for traditional supervised learning.

Netflix and INSAIT ingeniously circumvented this limitation using synthetic environments. Researchers leveraged platforms like Google's Kubric to generate thousands of meticulously controlled physics simulations. These digital sandboxes allowed for the creation of perfectly paired video sequences.

Each pair consisted of two versions of the same scene: one depicting an object interacting with its environment (e.g., a ball hitting pins), and another where the object was entirely absent, with all subsequent physical effects correctly removed. By feeding the AI both versions side-by-side, it learned the intricate causal relationships between an object's presence and its precise physical impact on the surrounding scene.

This extensive synthetic dataset enabled VOID to internalize the complex interplay of forces and reactions, developing an intuitive understanding of physical causality. For more intricate scenarios involving human-object interactions, the team further utilized specialized datasets like HUMOTO, rendered in Blender, ensuring the AI could accurately model nuanced movements and their consequences in a counterfactual reality.

The Open-Source Hurdle: Our Hands-On Test

Illustration: The Open-Source Hurdle: Our Hands-On Test
Illustration: The Open-Source Hurdle: Our Hands-On Test

Netflix’s release of VOID as an open-source framework, while revolutionary, presents significant practical hurdles for users attempting to implement it. Better Stack’s hands-on experience revealed a landscape far from plug-and-play, underscoring the complexities inherent in cutting-edge AI deployment. Setup is "not straightforward at all," demanding considerable technical acumen.

Documentation gaps represent a primary stumbling block. The official GitHub repository frequently omits crucial details and contains misleading information, leading to failed commands and obscure errors. For instance, initial setup instructions fail to specify the explicit requirement for the SAM 3 model, a critical dependency for the procedure.

Strict naming conventions further complicate the process. Quadmasks, central to VOID’s operation, demand precise naming as `quadmask_0.mp4` to function correctly. Without these explicit guidelines, users encounter silent failures or unexpected behavior, necessitating deep dives into the codebase or external resources to resolve seemingly minor issues.

The sheer hardware requirements alone place VOID beyond the reach of most local setups. The model demands a beefy GPU with 40GB+ VRAM, making an NVIDIA H100 or equivalent almost mandatory for efficient processing. This necessitates reliance on cloud GPU platforms like RunPod, adding another layer of setup complexity for container configuration and specific port exposure (e.g., 8998 for web apps).

Beyond hardware, access itself is gated and multi-layered. Users require multiple API keys and tokens to even begin inference. A Hugging Face token is essential for downloading the various models, while access to the SAM 3 repository is restricted, requiring users to request permission. Furthermore, the initial segmentation step, which leverages a Vision Language Model for precise pose estimation and quadmask generation, demands a Gemini API key. This intricate credential requirement underlines that VOID, in its current open-source form, targets expert users with robust infrastructure and a high tolerance for configuration. It is far from a simple, accessible tool for casual experimentation.

Failure & Success: The Matrix Test

Netflix’s VOID AI faced its ultimate test in a pivotal scene from *The Matrix*: removing Neo from his iconic sparring match with Morpheus. The model flawlessly excised Neo’s physical presence, demonstrating its remarkable ability to erase an actor with pixel-perfect precision. This initial success highlighted VOID’s core capability in generating a counterfactual reality where the target object never existed.

However, the resulting footage revealed the current boundaries of even this sophisticated AI. Morpheus continued his intricate martial arts choreography, throwing punches and kicks into an empty dojo. The effect was unsettling: Morpheus appeared to be engaged in a desperate fight against an invisible opponent, creating an undeniable ghost interaction that VOID explicitly aims to eliminate.

This outcome underscores a critical distinction. VOID excels at rewriting the physics of objects directly affected by a removal – like a bowling ball's impact on pins. Yet, Morpheus's movements were not merely physical reactions; they were highly choreographed, intentional actions directly *dependent* on Neo's presence and performance. For VOID to plausibly rewrite Morpheus’s actions, it would need to infer an entirely new, non-combative performance, fundamentally altering the scene's narrative and motion.

The AI, despite its groundbreaking prowess in understanding causal dependencies, cannot invent entirely new human intent or rewrite a character’s entire performance from scratch. It operates within the inherent logic of the source footage, capable of modifying physical interactions but not radically re-scripting complex human behaviors. This limitation, explored further in research like VOID: Video Object and Interaction Deletion (arXiv), proves VOID’s power, but also its current ceiling. It is a formidable tool, but not yet magic.

Hitting the High Note: The La La Land Test

A triumphant demonstration of VOID’s capabilities arrived with the La La Land test, where Better Stack’s team challenged the model to remove Emma Stone from a vibrant dance sequence with Ryan Gosling. This particular scene, rich with dynamic movement and complex occlusions as the characters weave around each other, presented a stringent test of VOID's ability to maintain continuity and rewrite reality without leaving artifacts. The result was remarkably seamless, presenting a compelling vision of what the AI can achieve under optimal conditions.

VOID’s output for the La La Land scene proved nearly flawless. As Ryan Gosling moved through the frame, passing directly in front of where Emma Stone had been, the AI maintained perfect continuity and a ghost-free reconstruction. The model accurately inferred the obscured background, including intricate details of the set and lighting, seamlessly stitching them into the foreground. Crucially, none of the "ghost interactions"—like lingering shadows or inexplicable environmental shifts—that plagued earlier, more physically entangled attempts manifested here.

This resounding success offers critical insight into VOID’s current strengths. Unlike the direct physical cause-and-effect scenarios in *The Matrix*, where Neo’s punches fundamentally altered his opponent’s state, the La La Land dance primarily involved two characters moving in close proximity with minimal direct physical interaction. The core challenge became cleanly separating these two moving figures and accurately filling complex occlusions, rather than re-simulating physical consequences.

The model’s ability to generate a convincing counterfactual reality where Emma Stone never existed in that dance, while preserving Ryan Gosling's fluid movements and the scene's romantic ambiance, stands as a prime example of its immense potential. This test demonstrates VOID's robust performance in scenarios prioritizing visual continuity and the disentanglement of moving, non-interactive elements, offering a compelling glimpse into its future applications for cinematic editing and visual effects.

Into the Uncanny Valley: The Titanic Test

Illustration: Into the Uncanny Valley: The Titanic Test
Illustration: Into the Uncanny Valley: The Titanic Test

Netflix’s VOID faced its most romantic challenge: erasing Leonardo DiCaprio from the iconic 'I'm flying' scene in *Titanic*. The Better Stack team attempted to remove Jack Dawson, leaving Rose DeWitt Bukater alone at the bow of the ship. While VOID largely succeeded in vanishing DiCaprio’s figure, the results were decidedly mixed, revealing the persistent challenges of even advanced AI.

Creepy artifacts marred the otherwise impressive deletion. A disembodied hand, clearly belonging to DiCaprio, remained eerily clasped around Kate Winslet’s arm. This phantom limb underscored a critical dependency: VOID’s powerful physics-aware generation relies heavily on precise initial segmentation. The user's imperfect mask, rather than a failing of VOID's core physics engine, likely caused this persistent "ghost" interaction.

The incident highlights a crucial user-side hurdle. Even with robust tools like SAM 2 for tracking, generating a pixel-perfect initial mask across complex, moving scenes remains a challenging manual or semi-manual task. Any imprecision in defining the object to be removed directly impacts the quality of VOID's output, demonstrating that even groundbreaking AI requires meticulous input.

Beyond the phantom hand, a more subtle, yet unsettling, artifact emerged. Winslet’s face displayed a slight morphing, a common phenomenon in AI-generated video where facial features subtly distort or shift. This minute alteration pushed the result directly into the uncanny valley, where the image is almost human-like but just enough off to trigger unease. It serves as a stark reminder that while VOID can reshape reality, achieving perfect photorealism, especially with human subjects, remains an elusive goal.

How VOID Crushes the Competition

VOID fundamentally redefines the landscape of video inpainting, dramatically outperforming both commercial giants like RunwayML and Adobe, and open-source alternatives such as ProPainter and DiffuEraser. While these tools excel at simple object removal or static scene manipulation, their limitations become starkly apparent when confronted with physics-dependent interactions or complex occlusions. VOID’s core innovation lies in its ability to understand and rewrite cause-and-effect, not just fill pixels.

Independent research confirms VOID’s superior fidelity and realism. A comprehensive human preference study, detailed in Netflix’s original paper, revealed that users preferred VOID’s output 64.8% of the time over results from a suite of leading competitors, including state-of-the-art methods. This decisive preference underscores its breakthrough capability in generating believable, counterfactual realities, where the absence of an object feels natural and physically consistent.

VOID’s true competitive edge isn't merely higher quality, but its specific mastery over complex scenarios that baffle other models. Where competitors often leave "ghost interactions"—like a blender inexplicably spinning after a person is removed, or water splashing without a diver—VOID meticulously reconstructs the scene's physics. This enables the seamless deletion of objects even in highly dynamic environments, ensuring remaining elements react as if the removed object never existed, preserving physical plausibility across frames. This unique ability to infer and simulate missing physical interactions sets it apart from traditional content-aware fill approaches.

Netflix's decision to release VOID under an Apache 2.0 open-source license is a strategic maneuver designed to accelerate adoption and establish it as an industry standard. This open approach fosters broad community development, allowing researchers and developers worldwide to build upon its sophisticated foundation, integrate it into new workflows, and even contribute improvements. By democratizing this advanced, physics-aware technology, Netflix aims to drive innovation across the entire video production and post-production ecosystem, potentially revolutionizing how content is created and modified. For further reading on its broader industry implications, see Netflix Launches VOID AI That Rewrites Video Scenes After Filming - Forbes. This move positions VOID not just as a tool, but as a foundational technology for the future of interactive video.

The Future of Film: Interactive & AI-Driven

VOID’s capabilities extend far beyond simple object removal, promising a radical shift in media production and consumption. Netflix, having open-sourced VOID, stands to benefit immensely from integrating such a powerful tool into its content pipeline. Imagine eliminating costly reshoots for minor continuity errors or removing unwanted background elements with unprecedented physical accuracy, saving millions in post-production costs.

Industry-wide, VOID unlocks new creative avenues. Filmmakers could iterate on scenes, testing different character compositions or object placements without ever needing to re-film. This digital malleability transforms the editing suite into a dynamic creation hub, where directors can truly sculpt a counterfactual reality for any given sequence.

Crucially, VOID redefines interactive storytelling. A future *Black Mirror: Bandersnatch* could dynamically alter character presence based on viewer choices, making narrative branches physically consistent. If a user chooses for a character to never appear, VOID ensures their absence is not just visual but affects the scene’s physics and other characters’ interactions, deepening immersion.

This level of control over visual narratives carries profound implications. Netflix’s framework provides an unmatched "undo" button for visual effects, fundamentally changing workflows for VFX artists and editors. Removing a boom mic reflection or a misplaced prop becomes a precise, physics-aware operation, drastically reducing manual rotoscoping and inpainting efforts.

However, the power to seamlessly rewrite visual history presents a significant ethical dilemma. A tool capable of creating such convincing alternate realities also becomes a potent instrument for disinformation. The same technology that removes an actor from a scene can just as easily fabricate their presence, fueling the proliferation of deepfakes and eroding trust in visual media.

Safeguards, such as robust content authentication and digital watermarking, will become imperative. As AI-generated content becomes indistinguishable from reality, the industry must proactively develop mechanisms to verify media provenance. VOID represents a monumental leap in AI video manipulation, demanding both creative exploration and rigorous ethical consideration.

Frequently Asked Questions

What is Netflix's VOID model?

VOID (Video Object and Interaction Deletion) is an open-source AI framework from Netflix that removes objects or actors from video and intelligently rewrites the scene's physics to account for their absence, eliminating 'ghost interactions'.

How is VOID different from other AI video editors?

While other tools erase pixels, they often leave behind the physical consequences of the removed object (e.g., a shadow remains). VOID uses a two-pass system to understand cause-and-effect, rewriting the scene as if the object never existed.

Can I run the VOID model on my personal computer?

It's unlikely for most users. VOID requires a powerful cloud GPU with at least 40GB of VRAM, such as an NVIDIA A100 or H100, making it inaccessible for standard consumer hardware.

Is Netflix using VOID in its own movies and shows?

Netflix has released VOID as a research project and has not yet announced official plans to integrate it into its production pipelines. However, its potential for post-production cost savings is significant.

Frequently Asked Questions

What is Netflix's VOID model?
VOID (Video Object and Interaction Deletion) is an open-source AI framework from Netflix that removes objects or actors from video and intelligently rewrites the scene's physics to account for their absence, eliminating 'ghost interactions'.
How is VOID different from other AI video editors?
While other tools erase pixels, they often leave behind the physical consequences of the removed object (e.g., a shadow remains). VOID uses a two-pass system to understand cause-and-effect, rewriting the scene as if the object never existed.
Can I run the VOID model on my personal computer?
It's unlikely for most users. VOID requires a powerful cloud GPU with at least 40GB of VRAM, such as an NVIDIA A100 or H100, making it inaccessible for standard consumer hardware.
Is Netflix using VOID in its own movies and shows?
Netflix has released VOID as a research project and has not yet announced official plans to integrate it into its production pipelines. However, its potential for post-production cost savings is significant.

Topics Covered

#netflix#video editing#generative ai#open source#void model
🚀Discover More

Stay Ahead of the AI Curve

Discover the best AI tools, agents, and MCP servers curated by Stork.AI. Find the right solutions to supercharge your workflow.

Back to all posts