AdaState: Self-Evolving Anchors for Streaming Video Generation

Abstract

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an \emph{adaptive state}, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.

Motivation

Attention focuses on the first frame and the recent content, accumulating over time.

Causal video diffusion models concentrate attention at the first and most recent cache positions. The first position acts as a privileged static anchor that locks the scene's composition; the bias persists across expanding window size n.

Motivation figure: attention bias and qualitative comparison of reference strategies

Figure. (a) Share of context attention vs. cache position for various window sizes n: attention spikes at the first (anchor) and most recent positions — a pattern preserved across n. (b) Consequences across reference strategies: No Reference (Self-Forcing) enables change while staying dependent on the initial composition, Static Reference (Rolling-Forcing) freezes the scene, while Adaptive Reference (AdaState) sustains motion and coherence.

Method

A self-evolving anchor, denoised in-place.

AdaState reserves a hidden state slot in the KV cache that the causal DiT denoises alongside content at every chunk. The state gets updated within the cache as the rollout progresses.

AdaState framework: per-chunk causal DiT with adaptive state recurrence

Figure. AdaState reserves an adaptive state slot in the KV cache that the causal DiT denoises alongside video content tokens at every chunk. The state is never rendered: it propagates as a recurrence (green dashed) while the sliding window (blue dashed) carries recent content forward. Decoded states surface denoising errors but are never emitted as frames.

1

Hidden adaptive state

A latent slot is reserved inside the KV cache. The model denoises it alongside content at every chunk but never emits it as a generated frame.

2

Continuous-time anchor

A static first-frame anchor freezes the reference at t₀ — a growing discontinuity with the present scene. AdaState's anchor instead changes continuously across chunks, staying in step with where the rollout actually is.

3

Robustness under error accumulation

The decoded state surfaces and amplifies the model's denoising errors (as seen in zoomed patches), feeding them back as supervision on the effective position. AdaState learns to correct against them at training time.

Generation Gallery

Headline rollouts.

Switch between within-horizon (5s, 21 latent frames) and beyond-horizon (12s, 51 latent frames).

Hover any video to pause · click to lock the pause

"An extreme close-up shot of an ant emerging from its nest… As the camera pulls back, we see a picturesque neighborhood beyond the hill."

"A dramatic exploration scene in a dark, mysterious cave, where an intrepid explorer lumbers forward, flashlight beam casting shadows on ancient murals…"

"A vibrant manga-style illustration of a sheep dressed as a ninja, stealthily navigating through a barnyard obstacle course…"

"A cinematic scene in the style of a fantasy drama, depicting a person walking through a serene field filled with floating lanterns…"

"An adorable kangaroo in a green dress and sun hat taking a leisurely stroll in Johannesburg during a breathtaking sunset…"

"A realistic photo of a llama wearing colorful pajamas dancing energetically on a stage under vibrant disco lighting…"

"A movie trailer in a classic cinematic style, featuring the adventurous journey of a 30-year-old space man wearing a vibrant red wool knitted motorcycle helmet…"

"A dynamic and lively tour through an art gallery, showcasing a diverse array of beautiful works in various styles…"

"A cinematic 35mm film-style extreme close-up of a gray-haired man in his 60s, deeply engrossed in thought at a Parisian café…"

"An aerial view of Santorini during the blue hour, capturing white Cycladic buildings with blue domes against the twilight sky…"

"A highly detailed close-up shot focusing on dew droplets glistening on the delicate petals of a blue rose…"

"A stylish woman in a black leather jacket and red dress strolls down a neon-lit Tokyo street, the wet pavement reflecting vibrant signs…"

30-second comparisons

At 6× training horizon, baselines either freeze or collapse.

Hover any video to pause · click to lock the pause

Prompt"A dynamic FPV aerial view of a vast mountain range at sunrise, snow-capped peaks rising above a sea of soft low clouds…"

Self-Forcingno anchor

LongLivestatic anchor

Reward-ForcingEMA anchor

AdaState (Ours)adaptive anchor

Prompt"A sweeping aerial drone shot above dramatic coastal cliffs at golden hour, deep blue waves crashing against the rocks below and sending up white spray…"

Causal-Forcingno anchor

Rolling-Forcingstatic anchor

MemRoPEEMA anchor

AdaState (Ours)adaptive anchor

Prompt"A dynamic FPV aerial flight through a vibrant underwater coral city, where colorful corals line the streets and ancient stone ruins rise from the seabed…"