1University of Maryland, College Park,2University of Southern California,3Massachusetts Institute of Technology *Equal Contributions † Corresponding Author
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20–50 training exampleswith no architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
Qualitative Comparison with Baselines
We compare our method against three strong baselines built on the Wan architecture with 14 billion parameters: Wan2.2-14B-I2V [28], VACE [12], and SkyReels-A2 [7].
Wan2.2-14B-I2V is our base I2V model, to which we apply our lightweight adaptation for invoking its innate subject mixing and scene transition capabilities.
Text Prompt: "Professional-quality video with rich details. The
video features a charming Teddy Bear sipping
apple juice from a bottle using its hand, while
delicately holds a vibrant red rose using its
hand, admiring its beauty, perhaps as an
offering or a gesture of affection"
Excessive Number of References: 5 References
Text Prompt: "A video scene unfolds in a vast, golden wheat
field under a blue sky, with a red smoke flare
billowing from a supply crate in the mid-ground.
In the foreground, the character Wukong, in his
ornate armor, stands next to a soldier wearing a
helmet and tactical gear. The soldier holds a
blue iPhone, gesturing with it, while Wukong
holds a VR headset. The two characters are
positioned side-by-side, appearing to be in an
animated discussion, comparing and talking
about the two different technology products
they are holding."
Text Prompt: "A man wearing a short-sleeved shirt gently hands over a toy rocket to another man dressed in a black blazer over a dark graphic T-shirt. The man dressed in a black blazer looks at the rocket with mild curiosity and intrigue."
Results Across Diverse Applications
For our 81-frame generated videos, the first 4 frames serve as subject-mixing inputs, aligned with the temporal compression ratio of the 3D VAE in Wan2.2-I2V-A14B, while the remaining 77 frames comprise the clean, customized video output.
Video 1: Five-Character Anime Crossover (6 References)
Text Prompt: "The video begins with a wide shot of a cherry blossom-lined street, petals gently floating through the air as sunlight filters between the trees. Five iconic anime characters walk side by side from the distance toward the camera. On the far left, Naruto Uzumaki in an orange jumpsuit walks confidently with arms folded. Next to Naruto Uzumaki, Monkey D. Luffy wearing a red open shirt, blue shorts, and a straw hat smiles energetically, waving with one hand while a cheerful yellow creature, Pikachu, sits on his shoulder, flicking its tail with sparks of playful energy. To the right of Monkey D. Luffy, Goku with spiky black hair and a cheerful expression, wearing an orange martial arts gi with a blue undershirt, belt, wristbands, and boots. To the right of Goku, Mario with a red cap featuring the letter "M," a matching red shirt, blue overalls with yellow buttons, white gloves, and brown shoes, is clapping while walking. As they move closer, the camera slowly pans upward, capturing their friendly gestures and animated expressions under the falling petals. The motion and color transitions highlight a sense of camaraderie and lighthearted adventure against the serene spring backdrop."
Reference Image
Generated Video
Video 2: Robot Manipulation, Liquid Pouring (3 References)
Text Prompt: "The right robotic hand holds a transparent glass bottle partially filled with yellow liquid resembling juice. The camera focuses on the intricate mechanical joints as the right robotic hand tilts the bottle with fine motor control, pouring the liquid smoothly into the red mug on the small table without spilling."
Reference Image
Generated Video
Video 3: Aerial View Simulation (3 References)
Text Prompt: "A high-angle, third-person aerial tracking shot follows a silver Tesla Cybertruck as it drives along a winding, S-curved asphalt road through a dense, green forest. Flying in perfect formation directly above the vehicle is a person wearing a blue and black wingsuit. The camera maintains its high vantage point, capturing both the Cybertruck navigating the curves of the road below and the wingsuit flyer mirroring its exact path in the air."
Reference Image
Generated Video
Video 4: Multi-Vehicle Autonomous Driving (3 References)
Text Prompt: "The scene begins from a **first-person perspective inside a modern autonomous electric car**, showing the sleek interior with a large digital dashboard displaying a navigation route and vehicle speed. An ambulance car has a white body with red horizontal stripes and medical symbols (blue star of life) on the sides, with flashing red and blue lights leads the flow of traffic. On the left, a red convertible car is following the ambulance car."
Reference Image
Generated Video
Video 5: Filmmaking, Triple-Product Group Showcase (4 References)
Text Prompt: "The video begins with a wide shot of three young individuals standing side by side in front of an aged stone corridor. Each of them is dressed in dark robes with red-and-gold accents, creating a cohesive visual identity. The young man on the left, with short dark hair and round glasses, raises his left hand to present a sleek black Apple iPhone—its reflective dual cameras glint under natural light, highlighting precision engineering. The woman in the center, her light brown hair tied loosely, holds a mixed reality headset with both hands, turning it slightly to show its curved glass visor and fabric headband as light passes across its surface. The young man on the right, with medium-length hair, places a silver MacBook on the moss-covered stone ledge, opening it partway to reveal its metallic sheen and engraved logo."
Reference Image
Generated Video
Video 6: Filmmaking, Human-animal Interation (2 References)
Text Prompt: "Cinematic medium shot tracking through a dense, lush jungle environment filled with large trees and green foliage. A large, powerfully built tiger with orange and black stripes walks steadily through the undergrowth. Riding bareback on the tiger is the muscular, shirtless man with long hair, wearing simple brown pants. He sits upright with a commanding and confident posture, moving effortlessly with the animal. The camera follows them, capturing the man's focused expression and his clear dominion over the powerful predator, portraying him as the undisputed king of the forest."
Reference Image
Generated Video
Video 7: Underwater Exploration (3 References)
Text Prompt: "The video begins from a first-person perspective, simulating the view of someone snorkeling underwater. The setting is a vibrant, clear-blue coral reef, with sunlight filtering through the water's surface from above. As the viewer "swims" forward, a large shark, suddenly glides past from one side of the frame to the other. Following this, a sea turtle with a brown patterned shell, gracefully swims by in the opposite direction."
Reference Image
Generated Video
Video 8: Product Showcasing (2 References)
Text Prompt: "A medium shot features Akira Toriyama, smiling and wearing glasses, holding the red-haired Super Saiyan God Goku figure. The figure, depicted in its orange gi, stands just a bit larger than Akira's hands, appearing slightly larger than his hand as he holds it. Toriyama gestures towards the figure, looking directly at the camera with an enthusiastic expression, praising it by saying, \"it is very good,\" clearly promoting it for sale to the audience."
Reference Image
Generated Video
Video 9: Robot-Human Interaction (2 References)
Text Prompt: "A wide shot captures a presentation stage. On the left, Sam Altman stands wearing a grey sweatshirt and dark jeans. On the right, Microsoft CEO Satya Nadella stands wearing a blue t-shirt and dark jeans. Positioned in the center of the stage, between the two men, is a black and white humanoid robot. The robot begins to move, turning and walking smoothly towards Sam Altman on the left. The robot stops in front of him and extends its right hand. Sam Altman reaches out with his right hand, and they engage in a handshake, while Satya Nadella looks on from the right side of the stage."
Reference Image
Generated Video
Video 10: Historical Figures Meeting (3 References)
Text Prompt: "The video opens in a bright, modern conference room with a long white table. The entire scene is rendered in black and white. Seated at the table are Albert Einstein, and Mahatma Gandhi. The camera focuses on a medium shot of the two figures, slowly panning and cutting between them as they engage in an earnest conversation about world peace. Einstein is shown gesturing thoughtfully, while Gandhi, wearing his glasses and robes, speaks with a calm and focused demeanor, their expressions conveying the importance of their discussion. The stark, contemporary setting contrasts with the historical, black and white figures, highlighting the timeless significance of their ideas."
Reference Image
Generated Video
Video 11: First-Person Driving (2 References)
Text Prompt: "The video opens from a first-person driver's perspective, navigating a multi-lane highway during a heavy downpour. The road is slick with rain, and visibility is poor due to fog and the spray from surrounding cars. Suddenly, a bright red Ferrari F8 Tributo aggressively accelerates from an adjacent lane. The sports car swerves sharply, cutting directly in front of the camera's vehicle with very little space. The camera focuses on the rear of the red Ferrari, its distinctive twin taillights glowing, as it kicks up a massive plume of water and road spray, momentarily blinding the driver before it speeds off into the misty conditions."
Reference Image
Generated Video
Video 12: Anime Battle Scene (3 References)
Text Prompt: "Film-quality, high-frame-rate animation with dynamic energy effects and strong motion emphasis. The scene takes place in an otherworldly alien landscape featuring green skies, floating clouds, and sparse blue trees extending across the rocky terrain. In the center-left, a muscular fighter with spiky blue hair and an orange martial arts uniform powers up, his aura glowing intensely with blue light that distorts the air around him. He thrusts his open palm forward, unleashing a rapid barrage of energy punches and shockwaves that ripple through the ground. On the right, a bald hero dressed in a yellow jumpsuit with red gloves and boots stands calmly with arms raised defensively, his white cape fluttering from the impact. The camera alternates between wide shots showing explosive energy waves colliding, and close-ups capturing the fierce expressions and precise hand movements of both fighters. The lighting shifts with each attack—blue flashes illuminating the scene while dust and debris swirl under their immense power, emphasizing their superhuman speed and strength in a high-intensity duel."
Reference Image
Generated Video
Video 13: Blending of Game and Real-world (2 References)
Text Prompt: "High-quality, cinematic video. The scene opens on a wide shot identical to the main image: Arthur Morgan, in his cowboy hat, sits atop his brown and white pinto horse on a grassy overlook, rifle in hand. He gazes over a valley toward distant, snow-capped mountains. A strange, deafening mechanical roar echoes through the valley. Arthur's head snaps up, his expression turning from calm to confused as he searches the sky. The camera cuts to follow the source of the noise: a modern United Airlines passenger jet, starkly out of place, flying erratically and dangerously low. Smoke billows from one of its engines. A medium shot captures Arthur's utter disbelief. The camera then tracks the airplane as it careens downward, disappearing behind a distant mountain ridge. A moment of silence is broken by a massive, fiery explosion, sending a plume of thick black smoke billowing into the sky. The video ends on a tight shot of Arthur's face, frozen in shock as he witnesses the impossible crash."
Reference Image
Generated Video
Video 14: Robot-Animal Interaction (3 References)
Text Prompt: "High-quality, cinematic video. The scene opens with a wide shot of the humanoid Tesla robot standing in the middle of a vast, rolling green field under a clear blue sky. A small, golden retriever puppy is sitting on the grass near the robot's feet. The camera zooms in to a medium shot as the robot slowly and deliberately raises its articulated hand and lowers it toward the puppy. A close-up shot follows, focusing on the robot's black and white mechanical fingers making gentle contact and stroking the puppy's soft, light-colored fur. The camera may pan slightly to show the puppy's reaction, such as wagging its tail or leaning into the touch."
Reference Image
Generated Video
Video 15: Character Showdown (3 References)
Text Prompt: "A wide shot establishes a bright, sunlit landscape of rolling green hills and a grassy field with a dirt path, under a blue sky with white clouds. The camera moves forward to focus on two small, 3D animated figures standing in the field. On the left, Crayon Shin-chan, wearing a yellow shirt, blue shorts, and a green cape, strikes a determined, heroic pose with his right fist raised. On the right, the figure of Nezha stands in a low, intense martial arts stance, ready for action. Both characters hold their distinct, dynamic poses against the peaceful, scenic backdrop."
Reference Image
Generated Video
Video 16: Action Figure Commercial (2 References)
Text Prompt: "A professional, commercial-style shot opens with the One Punch Man action figure standing heroically in the center of a desolate, destroyed city. The background is filled with smoldering rubble, debris, and the silhouettes of crumbling buildings. The camera slowly pans around the figure, highlighting its iconic yellow suit, red gloves, and flowing white cape. It then pushes in for a dramatic low-angle shot, focusing on the figure's powerful clenched-fist posture, capturing the meticulous detail of the product against the chaotic backdrop."
Reference Image
Generated Video
Video 17: First-Person Underwater Swimming (2 References)
Text Prompt: "A dynamic first-person point of view from a swimmer moving through clear, turquoise ocean water. Sunlight filters down from the surface, casting light patterns on the swimmer's arms as they pull through the water over a sandy seabed. As the swimmer glides forward, a large tiger shark is seen swimming gracefully in the mid-distance. In the same frame, a green sea turtle with a distinctive brown and orange shell glides calmly. The camera maintains this first-person perspective as the swimmer continues to swim forward, passing by the remarkable sight of the shark and turtle coexisting in the water."
Reference Image
Generated Video
Video 18: First-Person Shooter Gameplay (2 References)
Text Prompt: "The video begins from an intense, first-person perspective, capturing a player holding an assault rifle on a grassy slope overlooking a coastal settlement, reminiscent of a PUBG setting. The camera pans slightly, scanning the area, before a vibrant red and black sports car suddenly speeds into the frame on a path below. The player's view immediately snaps towards the car, bringing the rifle's iron sights into focus. The camera shakes violently with simulated recoil as the player fires at the vehicle, attempting to track and neutralize the moving target."
Reference Image
Generated Video
Video 19: Racing Game Simulation (3 References)
Text Prompt: "A thrilling first-person perspective from the cockpit of a race car, showing the driver's gloved hands gripping the steering wheel. The car speeds down a wide racetrack, rapidly closing in on two cars ahead: a white sedan and a yellow modified hatchback. These two cars are actively dueling for position, with the yellow modified hatchback seen attempting to weave and overtake the white sedan as the POV driver follows closely behind the battle."
Reference Image
Generated Video
Video 20: Driving Simulation (3 References)
Text Prompt: "A high-speed, first-person video captured from the driver's seat of a car, showing the dashboard and hands gripping the steering wheel. The camera pursues two other vehicles driving aggressively through city streets. The targets are a vibrant red supercar, seen from the rear with its distinctive louvered engine cover and wide black grilles, and a sleek grey sports coupe, also seen from the rear with its quad exhaust pipes. These two cars weave rapidly through traffic, accelerating and braking hard as if in a race, while the camera car attempts to follow closely behind them."
Reference Image
Generated Video
Video 21: Rare Cases in Autonomous Driving, with Aircraft (3 References)
Text Prompt: "The scene begins from a **first-person perspective inside a modern autonomous electric car**, showing the sleek interior with a large digital dashboard displaying a navigation route and vehicle speed. The car travels rapidly along a multi-lane single-direction highway under clear daylight. Ahead, an **ambulance with flashing red and blue emergency lights** speeds in the same direction. Far ahead, an F-22 Raptor fighter jet aircraft descends rapidly, deploying its landing gear and aligning precisely with the center of the highway. The car's sensors and autopilot react instantly, adjusting speed and lane position to accommodate the incoming aircraft."
Reference Image
Generated Video
Video 22: Lunar Encounter Scene (2 References)
Text Prompt: "High-definition, film quality. The video opens on the desolate, grey landscape of the Moon, with the Lunar Module and American flag visible in the background. An astronaut in a bulky white spacesuit is seen taking slow, bounding steps across the dusty surface. The camera pans slightly to reveal a second figure entering the frame: a tall, slender, grey alien with a large head and oversized black eyes. The alien walks steadily and directly towards the astronaut and then shakes the hand with the astronaut."
Reference Image
Generated Video
Video 23: Robotic Manipulation (2 References)
Text Prompt: "Film-quality, laboratory lighting with realistic robotic motion and object interaction. A **wooden cube block** rests on the tabletop in front of the robot. The camera captures a close-up as the robotic arm activates, its white and gray joints rotating smoothly while the orange-tipped gripper opens precisely. The robot's gripper lowers toward the wooden cube, aligning its claws with the edges for an accurate grasp. After gripping securely, the arm lifts the cube vertically and moves the cube above a **wooden tray**, gently lowering it into the center before releasing and retracting. The video emphasizes the robot's precision, dexterity, and coordinated motion control throughout the entire pick-and-place sequence."
Reference Image
Generated Video
Video 24: Game Simulation, Prehistoric Hunting Scene (3 References)
Text Prompt: "A cinematic shot, set in the vast, muddy, prehistoric landscape, captures a surreal scene. The cowboy is firmly seated on the back of the massive, roaring T-rex, using it as a mount. The camera, Arthur, tracks them from a low angle as the T-rex walks with heavy, purposeful steps towards the towering, long-necked Patagotitan mayorums (sauropods) seen in the background by the water. Arthur raises his revolver, takes aim at one of the Patagotitan mayorums, and fires, the action clear and focused."
Reference Image
Generated Video
Video 25: Product Handoff Scene (3 References)
Text Prompt: "A man wearing a red T-shirt layered over a long-sleeve purple shirt gently and slowly hands over a graphics card labeled \"RTX 5090\" to another man dressed in dark blue long-sleeved shirt and black pants. The man dressed in dark blue long-sleeved shirt and black pants looks at the graphics card labeled \"RTX 5090\" with mild curiosity and intrigue."
Reference Image
Generated Video
Video 26: Three-Vehicle Autonomous Driving (4 References)
Text Prompt: "The video begins from a first-person view inside an autonomous car speeding along a multi-lane same-direction highway. There are three vehicles driving in the same direction. Ahead, an ambulance with flashing red and blue lights leads the flow of traffic. A red Ferrari convertible sports car accelerates rapidly in the leftmost lane. On the left of the red Ferrari convertible sports car, a gas tanker truck has a large cylindrical metal tank mounted on the back, driving slowly on the road."
Reference Image
Generated Video
Video 27: Real-world, Dual-Product Demonstration (3 References)
Text Prompt: "Film-quality, studio-lit scene in a robotics lab environment. The young man, wearing a light blue button-up shirt, sits confidently at his workspace surrounded by robotics hardware, simulation software, and coding monitors. He uses his **left hand to prominently showcase a sleek black Apple iPhone**, its reflective surface and dual-camera module catching subtle light highlights. With his **right hand, he presents a mixed reality headset**, its curved visor and intricate strap details gleaming under the lab's warm illumination."
Reference Image
Generated Video
Video 28: Multi-Product Comparison (3 References)
Text Prompt: "In a vibrant shoe store, a woman holds up an Adidas ZX Flux sneaker with a colorful geometric pattern, while two men on either side showcase a neon green Nike Flyknit and another shoe. The focus remains on the Adidas sneaker, which appears to be the central point of discussion and analysis. The Nike Flyknit adds to the variety of footwear options presented in the video."
Reference Image
Generated Video
Video 29: Filmmaking, Dual-Product Showcase (3 References)
Text Prompt: "The young man, dressed in a dark sweater layered over a collared shirt and a red-and-gold striped tie, sits at a long wooden table with an open book in front of him. He raises his left hand to clearly demonstrate a sleek black Apple iPhone, showcasing its dual-camera design and reflective surface under the warm indoor light. In his right hand, he presents a futuristic mixed-reality headset with a curved glossy visor and soft fabric strap, angled to emphasize its modern design and immersive capability. The camera focuses closely on the contrast between the classic academic setting and the cutting-edge technology, capturing the subtle expressions on his face as he explains and interacts with both devices in detail."
Reference Image
Generated Video
Video 30: Road Scene Simulation (2 References)
Text Prompt: "Professional-quality video with rich details. In the background, a green military tank with visible treads and a white star insignia moves slowly from left to right across the road behind the stop sign. The tank's treads rotate smoothly, kicking up light dust as it passes, while the sunlight reflects subtly off its metal surface. Power lines, trees, and parked cars frame the street, giving a sense of scale and normalcy contrasted with the imposing motion of the armored vehicle. The camera remains steady as the tank continues its traversal, maintaining the stop sign prominently in the foreground."
Reference Image
Generated Video
Video 31: Dual-Object Handling (3 References)
Text Prompt: "An older man with curly gray hair, dressed in a brown blazer and red sweater, using his right hand to drink an apple juice while using his left hand to hold a Teddy Bear"