AdaState:
Self-Evolving Anchors for
Streaming Video Generation

Replace the frozen first-frame anchor with an adaptive state that the model denoises alongside each chunk — unlocking streaming video generation with richer dynamics without sacrificing coherence.

VT
Virginia Tech
Dept. of Computer Science
AdaState rollout — long-horizon streaming generation with continuous camera motion and evolving scene state.
TL;DR
Streaming video models anchor on the first frame, suppressing dynamics and freezing the scene. AdaState instead denoises a hidden anchor alongside each chunk — so the reference evolves with the content. Result: longer rollouts, real camera motion, no extra modules.
Abstract
Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an \emph{adaptive state}, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.
Motivation

Attention focuses on the first frame and the recent content, accumulating over time.

Causal video diffusion models concentrate attention at the first and most recent cache positions. The first position acts as a privileged static anchor that locks the scene's composition; the bias persists across expanding window size n.

Motivation figure: attention bias and qualitative comparison of reference strategies

Figure. (a) Share of context attention vs. cache position for various window sizes n: attention spikes at the first (anchor) and most recent positions — a pattern preserved across n. (b) Consequences across reference strategies: No Reference (Self-Forcing) enables change while staying dependent on the initial composition, Static Reference (Rolling-Forcing) freezes the scene, while Adaptive Reference (AdaState) sustains motion and coherence.

Method

A self-evolving anchor, denoised in-place.

AdaState reserves a hidden state slot in the KV cache that the causal DiT denoises alongside content at every chunk. The state gets updated within the cache as the rollout progresses.

AdaState framework: per-chunk causal DiT with adaptive state recurrence

Figure. AdaState reserves an adaptive state slot in the KV cache that the causal DiT denoises alongside video content tokens at every chunk. The state is never rendered: it propagates as a recurrence (green dashed) while the sliding window (blue dashed) carries recent content forward. Decoded states surface denoising errors but are never emitted as frames.

1

Hidden adaptive state

A latent slot is reserved inside the KV cache. The model denoises it alongside content at every chunk but never emits it as a generated frame.

2

Continuous-time anchor

A static first-frame anchor freezes the reference at t0 — a growing discontinuity with the present scene. AdaState's anchor instead changes continuously across chunks, staying in step with where the rollout actually is.

3

Robustness under error accumulation

The decoded state surfaces and amplifies the model's denoising errors (as seen in zoomed patches), feeding them back as supervision on the effective position. AdaState learns to correct against them at training time.

30-second comparisons

At 6× training horizon, baselines either freeze or collapse.

Hover any video to pause · click to lock the pause

Prompt"A dynamic FPV aerial view of a vast mountain range at sunrise, snow-capped peaks rising above a sea of soft low clouds…"

Self-Forcingno anchor
LongLivestatic anchor
Reward-ForcingEMA anchor
AdaState (Ours)adaptive anchor

Prompt"A sweeping aerial drone shot above dramatic coastal cliffs at golden hour, deep blue waves crashing against the rocks below and sending up white spray…"

Causal-Forcingno anchor
Rolling-Forcingstatic anchor
MemRoPEEMA anchor
AdaState (Ours)adaptive anchor

Prompt"A dynamic FPV aerial flight through a vibrant underwater coral city, where colorful corals line the streets and ancient stone ruins rise from the seabed…"

CausVidno anchor
Infinity-RoPEstatic anchor
Rolling-Sinkheuristic anchor
AdaState (Ours)adaptive anchor
Ablations

Two knobs, one design.

State size & cache window, and the horizon weight α that re-balances within- vs. beyond-horizon supervision.

Hover any video to pause · click to lock the pause

Prompt"An astronaut running through a narrow alley in Rio de Janeiro… helmet reflects sunlight…"

Fs=3, Wp+F=9more state, wider window
Fs=1, Wp+F=9less state, wider window
Fs=1, Wp+F=6AdaState (final)

Prompt"An old man in blue jeans and a white T-shirt takes a stroll in Mumbai during a winter storm…"

Fs=3, Wp+F=9more state, wider window
Fs=1, Wp+F=9less state, wider window
Fs=1, Wp+F=6AdaState (final)

Prompt (5s)"A handheld shot following a young child running through a field of tall grass…"

No weightingα = 0
α = 2best at 5s
α = 4long-horizon model

Prompt (30s)"A dynamic FPV aerial flight above Niagara Falls, the camera racing toward the thundering edge…"

No weightingα = 0
α = 2within-horizon model
α = 4best at 30s