DeltaToken: 100× Token-Efficient Video Representation for World Models

Rethinking Video Representations

Video generation models have demonstrated remarkable potential for applications ranging from filmmaking and advertising to embodied AI world modeling. However, current systems suffer from several fundamental bottlenecks: prohibitively high training and inference costs, physics modeling that is not accurate enough, and limited generation length, typically constrained to short video clips.

We argue that all of these limitations largely stem from inefficient and inappropriate video representations, particularly those based on conventional Video VAEs. Existing Video VAEs often require more than 10k tokens per second of high-quality video, resulting in extremely long token sequences for downstream transformer models, whose computational cost scales quadratically with sequence length. Also, the spatial-temporal patch tokens from VAEs are all independent of each other without correlations, and thus do not represent any physics-related latent space.

In this paper, we introduce DeltaToken, a novel and highly efficient video representation that achieves comparable reconstruction quality while using up to 192× fewer tokens. DeltaToken is built upon a causal, streaming encoder that compactly encodes inter-frame delta information (motion) into one-dimensional token sequences, paired with a causal streaming decoder and a diffusion-based decoder to reconstruct high-fidelity videos. We believe that this is the key for the next-generation world model, enabling much lower training cost, much faster real-time interactive inference, much longer context length and memory, and much better physics modeling.

Extensive experiments demonstrate that DeltaToken significantly improves both reconstruction efficiency and generation scalability, enabling longer video generation at substantially reduced computational cost. Our results suggest that rethinking video representations through delta-based tokenization is a promising direction for scalable video generation models.

Intuition and Components

Our approach is grounded in three key observations that challenge the current limitations of video tokenization.

Long-range Motion Correlation

Current Video-VAEs compress only a few neighboring frames, ignoring strong temporal correlations across longer distances. Drawing on 25+ years of video codec research, we reintegrate the concept of motion blocks — the most important component in DeltaToken for reducing token count.

1D Fused Representations

Following the success of TiTok, 1D tokenization shows strong potential for visual tasks. 1D tokens enable better cross-sequence communication and dynamically allocate capacity — fewer tokens for low-entropy regions, more for high-complexity areas — outperforming 2D grid-based tokens.

Diffusion Loss for Training

Almost all VAEs/VQGANs use adversarial loss to recover high-frequency details. But in generative modeling, diffusion models capture probability distributions far better than GANs. We replace adversarial loss with diffusion loss as the training target, leveraging the generative prior as a powerful decoder.

Benefits for Entertainment World Model

A 100× more efficient tokenizer reshapes what is possible in commercial video generation.

Training Efficiency & Real-Time Interactive Inference

Beyond a certain sequence length, transformer compute scales quadratically with token count. A 100× more efficient tokenizer therefore yields well above 100× savings in both training and inference — reducing world-model pretraining to a scale comparable to today's post-training.

Much Longer Context Length

A highly compact latent space enables dramatically longer context windows. With the same token budget, a 100× efficient tokenizer supports 100× longer durations, extending today's 5–15 s commercial APIs toward the 10-minute-scale content era.

LLM-Native Video Generation

VAE tokens for a 5 s clip exceed 30–50k, dwarfing the 200–500 text tokens of its caption — joint text-video training collapses under this imbalance. DeltaToken compresses a 5 s clip to ~1k visual tokens, finally matching the text/visual token ratio for LLM-native video generation.

Benefits for Embodied World Model

Compact, physics-aligned, and architecturally clean — built for the next generation of robotics.

Training Efficiency & On-device Inference

Robotics inference runs under strict power and cost budgets, where on-device GPU compute is severely limited. Existing VAE representations force compromises in resolution and model size. DeltaToken removes this tradeoff, allowing larger models without sacrificing resolution.

Better Alignment with Physical Modeling

Conventional VAE latent spaces are physically opaque — they bear no meaningful relationship to physical dynamics. Because DeltaToken explicitly models inter-frame deltas and motion, its latent space is naturally aligned with physical change, since physics in the visual domain is fundamentally about temporal differences.

Elegant World Modeling Architecture

Text, audio, and action are naturally 1D, while image/video VAE outputs are 2D/3D — making multimodal modeling awkward. DeltaToken unifies all modalities in a single 1D space. Video and action can even be concatenated along the channel dimension for joint generation, already aligned on the temporal axis.

Video Reconstruction Results

We evaluate DeltaToken across two complementary domains: content generation domain and embodied AI world model domain, with the model trained and tested on 480 × 832 @ 24 fps @ 5 s videos, and each video is compressed into 976 tokens @ 16 channels. In each video below, Top is the original video and Bottom is reconstructed by our tokenizer.

Wan VAE (baseline)

187,200tokens

@ 16 channels

480/8 × 832/8 × 24/4 × 5

Wan VAE (2x2 patch size)

46,800tokens

@ 64 channels

480/8/2 × 832/8/2 × 24/4 × 5

DeltaToken (ours)

976tokens

@ 16 channels

1D causal streaming · delta-based

~192×

Fewer tokens than the Wan VAE baseline

Comparable reconstruction quality · drastically shorter sequences for downstream transformers

Content Generation Domain

Film & TV-style footage — for media production and creative content.

DEMO 01

Embodied AI World Model

Human action videos — for embodied ai world modeling.

DEMO 02

Video Generation Results

Trained from scratch at roughly 1/1000× the data and compute of state-of-the-art video generation models. This prototype here is not aimed at producing top-quality generations — it is a controlled experiment to demonstrate that DeltaToken's latent space is fully learnable by a downstream generative model. Samples generated by our DeltaToken-based world model with 480 × 832 @ 24 fps @ 5 s each.

Task

Text + Image → Video

I2V is similar foundation as T2V

Model

1.6BDiT params

Self-attention only architecture

Training Data

3.2Mvideo clips

Each is around 5s clips

Compute

32B200 GPU-days

≈ $3–4k rental cost

Note1: the current preliminary result is a self-funded research project, so all the data and compute are very limited (but all the data pipeline and training infrastructure are designed for the large-scale experiments). We believe that if scaled with 8× model size, 64× data and 32× compute (It will take around $0.6m and 4 weeks. And it's still 15x less data and 30x less compute than the state-of-the-art models), we can get a top-tier competitive foundation model. If you are interested in further collaboration, feel free to reach out to qiangzhang0123@gmail.com.

Note2: Disclaimer: All the tokenizer building, data processing, model implementation, model training, experimentation and presentation are done by Qiang from scratch while not employed by any company, all the legal IP belongs to Qiang himself.