Video generation models have demonstrated remarkable potential for applications ranging from filmmaking and advertising to embodied AI world modeling. However, current systems suffer from several fundamental bottlenecks: prohibitively high training and inference costs, physics modeling that is not accurate enough, and limited generation length, typically constrained to short video clips.
We argue that all of these limitations largely stem from inefficient and inappropriate video representations, particularly those based on conventional Video VAEs. Existing Video VAEs often require more than 10k tokens per second of high-quality video, resulting in extremely long token sequences for downstream transformer models, whose computational cost scales quadratically with sequence length. Also, the spatial-temporal patch tokens from VAEs are all independent of each other without correlations, and thus do not represent any physics-related latent space.
In this paper, we introduce DeltaToken, a novel and highly efficient video representation that achieves comparable reconstruction quality while using up to 192× fewer tokens. DeltaToken is built upon a causal, streaming encoder that compactly encodes inter-frame delta information (motion) into one-dimensional token sequences, paired with a causal streaming decoder and a diffusion-based decoder to reconstruct high-fidelity videos. We believe that this is the key for the next-generation world model, enabling much lower training cost, much faster real-time interactive inference, much longer context length and memory, and much better physics modeling.
Extensive experiments demonstrate that DeltaToken significantly improves both reconstruction efficiency and generation scalability, enabling longer video generation at substantially reduced computational cost. Our results suggest that rethinking video representations through delta-based tokenization is a promising direction for scalable video generation models.
Our approach is grounded in three key observations that challenge the current limitations of video tokenization.
Current Video-VAEs compress only a few neighboring frames, ignoring strong temporal correlations across longer distances. Drawing on 25+ years of video codec research, we reintegrate the concept of motion blocks — the most important component in DeltaToken for reducing token count.
Following the success of TiTok, 1D tokenization shows strong potential for visual tasks. 1D tokens enable better cross-sequence communication and dynamically allocate capacity — fewer tokens for low-entropy regions, more for high-complexity areas — outperforming 2D grid-based tokens.
Almost all VAEs/VQGANs use adversarial loss to recover high-frequency details. But in generative modeling, diffusion models capture probability distributions far better than GANs. We replace adversarial loss with diffusion loss as the training target, leveraging the generative prior as a powerful decoder.
A 100× more efficient tokenizer reshapes what is possible in commercial video generation.
Beyond a certain sequence length, transformer compute scales quadratically with token count. A 100× more efficient tokenizer therefore yields well above 100× savings in both training and inference — reducing world-model pretraining to a scale comparable to today's post-training.
A highly compact latent space enables dramatically longer context windows. With the same token budget, a 100× efficient tokenizer supports 100× longer durations, extending today's 5–15 s commercial APIs toward the 10-minute-scale content era.
VAE tokens for a 5 s clip exceed 30–50k, dwarfing the 200–500 text tokens of its caption — joint text-video training collapses under this imbalance. DeltaToken compresses a 5 s clip to ~1k visual tokens, finally matching the text/visual token ratio for LLM-native video generation.
Compact, physics-aligned, and architecturally clean — built for the next generation of robotics.
Robotics inference runs under strict power and cost budgets, where on-device GPU compute is severely limited. Existing VAE representations force compromises in resolution and model size. DeltaToken removes this tradeoff, allowing larger models without sacrificing resolution.
Conventional VAE latent spaces are physically opaque — they bear no meaningful relationship to physical dynamics. Because DeltaToken explicitly models inter-frame deltas and motion, its latent space is naturally aligned with physical change, since physics in the visual domain is fundamentally about temporal differences.
Text, audio, and action are naturally 1D, while image/video VAE outputs are 2D/3D — making multimodal modeling awkward. DeltaToken unifies all modalities in a single 1D space. Video and action can even be concatenated along the channel dimension for joint generation, already aligned on the temporal axis.
We evaluate DeltaToken across two complementary domains: content generation domain and embodied AI world model domain, with the model trained and tested on 480 × 832 @ 24 fps @ 5 s videos, and each video is compressed into 976 tokens @ 16 channels. In each video below, Top is the original video and Bottom is reconstructed by our tokenizer.
Film & TV-style footage — for media production and creative content.
Human action videos — for embodied ai world modeling.
@article{zhang2026deltatoken,
title = {DeltaToken: 100x Token-Efficient Video Representation for World Models},
author = {Zhang, Qiang},
journal = {Preprint},
year = {2026}
}