Image & Video Generation Models
Training Data Compilation, Filtering Pipelines & Training Recipes
Overview Comparison
| Model | Type | Architecture | Params | Data Scale | Text Encoder | Loss | Open? |
| Sora (Turbo/2) | Video | DiT (spacetime patches) | Undisclosed | Undisclosed | DALL-E 3 recaptioner | Diffusion (likely flow matching) | No |
| Imagen 1 | Image | Cascaded pixel-space U-Net | 3B total | ~860M pairs | Frozen T5-XXL | Noise prediction | No |
| Imagen 3 | Image | Latent Diffusion Transformer | Undisclosed | "Billions" | T5-XXL (likely) | Undisclosed | No |
| FLUX.1 | Image | MMDiT (rectified flow) | 12B | Undisclosed | CLIP-L + T5-XXL | Rectified flow (velocity) | Partial |
| Flux Kontext | Image Edit | MMDiT + 3D RoPE | 12B | Millions of pairs | CLIP-L + T5-XXL | Flow matching + LADD | Partial |
| Seedream 2.0 | Image | MMDiT | 3.9B | ~250M pairs | Bilingual LLM + Glyph-ByT5 | Score matching | No |
| Seedream 3.0 | Image | MMDiT | ~3.9B | ~300M (expanded) | Bilingual LLM + Glyph-ByT5 | Flow matching | No |
| Qwen-Image | Image | MMDiT (20B) | 20B | 5.6B pairs | Qwen2.5-VL-7B | Flow matching | Yes |
| Wan 2.1 | Video | DiT + 3D Causal VAE | 14B / 1.3B | ~1.5B vid + 10B img | UMT5-XXL (cross-attn) | Flow matching | Yes |
| MiniMax Allegro | Video | DiT | Undisclosed | 107M img + 66M vid | Undisclosed | Diffusion | Yes |
| Hailuo 02 | Video | DiT + NCR | 3x Hailuo 01 | 4x Hailuo 01 | Undisclosed | Undisclosed | No |
1. OpenAI Sora
Architecture
Diffusion Transformer (DiT) operating on spacetime patches of video latent codes. A VAE compresses raw video temporally and spatially; compressed tokens are arranged as a grid of spacetime patches fed into a transformer with self-attention across both time and space. Images are treated as single-frame videos for joint training.
Data Pipeline
flowchart LR
A["Raw Data Sources\n• Shutterstock (licensed)\n• Pond5 stock footage\n• Public datasets\n• Custom/commissioned"] --> B["Pre-Training Filters\n• CSAM classifier\n• Explicit/violent removal\n• Quality/consistency checks\n• Human review (ambiguous)"]
B --> C["Recaptioning\n(DALL-E 3 technique)\nHighly descriptive\ncaptions for all data"]
C --> D["Training Data\nImages + Videos\n(variable res/duration/AR)"]
Training Recipe
flowchart TB
S1["Stage 1: Image Pretraining\nSpatial understanding"] --> S2["Stage 2: Low-Res Video\nTemporal dynamics"]
S2 --> S3["Stage 3: Progressive\nResolution Scaling"]
S3 --> S4["Stage 4: Joint Image+Video\nNative aspect ratios"]
S4 --> S5["Stage 5: Safety Alignment\nClassifiers + fine-tuning"]
Key insight: Quality emerges from scale — 3D consistency, object permanence, and basic physics simulation appear without explicit inductive biases. The DALL-E 3 recaptioning technique (replacing weak alt-text with rich descriptions) is critical for prompt-following fidelity.
Sora Evolution
| Version | Date | Capabilities |
| Sora Preview | Feb 2024 | Up to 60s, text-to-video only, no audio |
| Sora Turbo | Dec 2024 | 1080p, 20s, T2V/I2V/V2V, storyboard tool |
| Sora 2 | 2025 | Native audio, multi-shot consistency (LCT), improved physics |
2. Google Imagen
Architecture Evolution
flowchart LR
subgraph Imagen1["Imagen 1 (2022)"]
direction TB
I1A["T5-XXL\n(frozen, 4.6B)"] --> I1B["Base U-Net\n64×64 · 2B params"]
I1B --> I1C["SR Stage 1\n256×256 · 600M"]
I1C --> I1D["SR Stage 2\n1024×1024 · 400M"]
end
subgraph Imagen3["Imagen 3 (2024)"]
direction TB
I3A["T5-XXL + Gemini\nSynthetic Captions"] --> I3B["Latent Diffusion\nTransformer"]
I3B --> I3C["Upsampling\n2x / 4x / 8x"]
end
Imagen1 -->|"evolution"| Imagen3
Data Pipeline (Imagen 3)
flowchart LR
A["Billions of\nimage-text pairs\n(web + licensed)"] --> B["Quality Filter"]
B --> C["Safety Filter\n(violent, NSFW)"]
C --> D["AI-Generated\nContent Removal"]
D --> E["Deduplication +\nDown-weighting"]
E --> F["Caption Safety\n(PII removal)"]
F --> G["Dual Captioning\n• Original caption\n• Gemini synthetic caption"]
G --> H["Training Set"]
Training Details (Imagen 1 — most documented)
| Parameter | Value |
| Batch size | 2048 |
| Training steps | 2.5M (all 3 stages) |
| Optimizer | Adafactor (base) / Adam (SR) |
| Hardware | 256 TPU-v4 (base), 128 TPU-v4 (each SR) |
| CFG dropout | 10% |
| CFG weight | 1.35 (base), 8.0 (SR) |
| FID (COCO zero-shot) | 7.27 |
Key Techniques (Imagen 1)
- Dynamic Thresholding: At each sampling step, compute the p-th percentile of absolute pixel values as threshold s. If s > 1, clip to [−s, s] and divide by s. This prevents pixel saturation at high guidance weights and enables much larger CFG scales than static thresholding.
- Noise Conditioning Augmentation: Super-resolution models are conditioned on noisy low-res inputs (Gaussian noise with variance-preserving schedule). The augmentation level is randomly sampled during training and swept at inference. Critical for preventing train/test mismatch in cascaded pipelines.
- Efficient U-Net: Parameters shifted from high-res to low-res blocks, skip connections scaled by 1/√2, reversed up/downsampling order → 2-3× faster.
Key Innovation (Imagen 2)
- Aesthetic Score Conditioning: A dedicated aesthetics model (trained on human preferences for lighting, framing, exposure, sharpness) scores each image. This score is used as a conditioning signal during training — not just for filtering — giving more weight to human-preferred qualities.
Key insight: Scaling the text encoder (T5-XXL) was far more impactful than scaling the U-Net. Imagen 2's aesthetic score conditioning (not just filtering) was influential. Imagen 3's removal of AI-generated images from training data is a distinctive curation step.
3. FLUX.1 & Flux Kontext (Black Forest Labs)
Architecture
flowchart TB
subgraph TextEnc["Text Encoders"]
TE1["CLIP-L\n(pooled embedding → modulation)"]
TE2["T5-XXL\n(dense token embeddings)"]
end
subgraph Backbone["12B MMDiT Backbone"]
DS["19 Double-Stream Blocks\n(separate text/image weights)"]
SS["38 Single-Stream Blocks\n(shared weights, parallel attn+MLP)"]
DS --> SS
end
VAE["16-channel VAE\n(custom, adversarial training)"]
TextEnc --> Backbone
VAE -->|"encode"| Backbone
Backbone -->|"decode"| VAE
Data Pipeline
flowchart LR
A["Large-scale\nimage-text data"] --> B["NSFW Removal"]
B --> C["Aesthetic Filtering\n(score ≥ 6.5)"]
C --> D["Perceptual\nDeduplication\n(cluster-based)"]
D --> E["Synthetic Recaptioning\n(CogVLM)\n50/50 original:synthetic"]
E --> F["Precompute Embeddings\n(frozen encoders)"]
F --> G["Training Set"]
Training Recipe
flowchart TB
S1["Pre-train at 256×256\nbatch 4096, 500k steps"] --> S2["High-res fine-tune\n(QK-Norm for stability)"]
S2 --> S3["Resolution-dependent\ntimestep shifting"]
S3 --> S4["Guidance Distillation\n(teacher → student)\n→ FLUX.1 dev"]
S3 --> S5["Speed Distillation\n→ FLUX.1 schnell"]
Key Techniques
- Rectified Flow: Velocity prediction objective (not noise prediction). Logit-normal timestep sampling (m=0, s=1) outperforms uniform.
- QK-Normalization: RMSNorm on Q/K prevents attention logit explosion at high resolutions, enabling bf16 training.
- Timestep Shifting:
t_m = (√(m/n) · t_n) / (1 + (√(m/n)−1) · t_n) — more pixels need proportionally more noise.
- RoPE: Rotary positional embeddings generalize to unseen sequence lengths (vs. learned absolute encodings).
Flux Kontext (Image Editing)
- Fine-tuned from FLUX.1 checkpoint on millions of relational (input, output, instruction) pairs.
- Context + target tokens concatenated sequentially (not channel-wise — channel-wise performed worse).
- 3D RoPE: Factorized (t, h, w) coordinates — context images indexed by
i, supporting multi-reference.
- Two-stage: flow matching → adversarial distillation (LADD) for fewer sampling steps.
| Variant | License | Notes |
| FLUX.1 [pro] | Proprietary (API) | Full model, best quality |
| FLUX.1 [dev] | Non-commercial | Guidance-distilled, 12B |
| FLUX.1 [schnell] | Apache 2.0 | Speed-optimized, <2s generation |
| Kontext [pro/dev/max] | Mixed | Image editing variants |
4. ByteDance Seedream
Architecture
MMDiT (3.9B params in v2.0) with a custom bilingual LLM text encoder + Glyph-Aligned ByT5 for character-level text rendering. Uses 2D RoPE for image tokens and Cross-Modality RoPE (v3.0+) for text tokens.
Data Pipeline
flowchart LR
A["~250M image-text pairs\n70% Chinese / 30% English"] --> B["Multi-Stage Cleaning\n• Watermark removal\n• Artifact filtering\n• Aesthetic scoring"]
B --> C["Distribution Balancing\n• Downsample over-represented\n• Hierarchical clustering"]
C --> D["Active Learning\n• Find challenging examples\n• Iterative refinement"]
D --> E["Two-Tier Captioning\n• Generic (short+long)\n• Multi-perspective rich captions\n• Chinese + English"]
E --> F["Training Set"]
Defect-Aware Training (Seedream 3.0)
flowchart LR
A["Previously Excluded\nImages (with defects)"] --> B["Defect Detector\n(trained on 15K\nannotated samples)"]
B --> C{"Defect area\n< 20%?"}
C -->|Yes| D["Retain image +\nSpatial attention mask\n(exclude defect from loss)"]
C -->|No| E["Discard"]
D --> F["+21.7% more\ntraining data"]
Training Stages
flowchart TB
S1["Pre-train: 256×256\n(various aspect ratios)"] --> S2["Fine-tune: 512→2048px\n(Seedream 3.0)\nor 512→4096px (4.0)"]
S2 --> S3["Continuing Training (CT)\nDiversified aesthetic captions"]
S3 --> S4["Supervised Fine-Tuning\n(styles, text rendering, aesthetics)"]
S4 --> S5["RLHF\nVLM reward model (>20B params)\nMultiple iterations"]
S5 --> S6["Prompt Engineering (PE)\nFinal alignment"]
Key insight: The defect-aware training paradigm is unique — using spatial masks to exclude watermarked regions from loss computation lets the model learn from imperfect data without absorbing artifacts. The bilingual LLM text encoder enables native Chinese cultural knowledge. RLHF with a >20B VLM reward model is one of the largest reported for image generation.
Version Comparison
| Aspect | Seedream 2.0 | Seedream 3.0 | Seedream 4.0 |
| Loss | Score matching | Flow matching | Adaptive flow matching |
| Data | ~250M pairs | ~300M (+21.7% defect-aware) | Billions |
| Max resolution | Not specified | 2048×2048 | 4096×4096 |
| Post-training | SFT + RLHF | CT + SFT + RLHF + PE | CT + SFT + RLHF (joint T2I+editing) |
| Inference speed | — | 3s (1K) | 1.8s (2K) |
5. Qwen-Image & Wan (Alibaba)
Qwen-Image (Text-to-Image, 20B)
Data Pipeline
flowchart LR
A["5.6B image-text pairs\n55% nature, 27% design\n13% people, 5% synthetic text"] --> B["Stage 1\nResolution/corruption\nDedup, NSFW"]
B --> C["Stage 2\nImage Enhancement\nBlur/brightness/texture"]
C --> D["Stage 3\nCaption Alignment\nRaw + recaption + fused"]
D --> E["Stage 4\nSynthetic Text Data\n• Pure rendering\n• Compositional\n• Complex layouts"]
E --> F["Stage 5\nHigh-res refinement\n640p+ quality/aesthetic filter"]
F --> G["Training Set"]
Architecture
- 20B param MMDiT with double-stream (vision patches + text tokens)
- Text encoder: Frozen Qwen2.5-VL-7B-Instruct (a multimodal LLM — not CLIP/T5)
- Positional encoding: Multimodal Scalable RoPE (MSRoPE)
- Loss: Flow matching with exponential timestep shift
- Curriculum: Non-text images → simple text → paragraph/slide-level text compositions
Wan 2.1 (Video Generation, 14B)
Data Pipeline
flowchart LR
A["~1.5B videos\n+ ~10B images"] --> B["Step 1: Fundamental\n(removes ~50%)\n• OCR text coverage\n• LAION aesthetic score\n• NSFW, watermark, blur\n• Duration/resolution"]
B --> C["Step 2: Visual Quality\n100 clusters →\nbalanced sampling →\nmanual scoring →\nexpert assessment model"]
C --> D["Step 3: Motion Quality\n6 tiers: optimal, medium,\nstatic, camera-driven,\nlow-quality, shaky"]
D --> E["Step 4:\nDeduplication"]
E --> F["Dense Captioning\n(Qwen2-VL)\nBilingual CN/EN"]
F --> G["Training Set"]
Architecture
- DiT with flow matching, UMT5-XXL text encoder via cross-attention
- Wan-VAE: 3D causal VAE (temporal 4×, spatial 8× compression), unlimited-length 1080p encoding
- Resolution-progressive curriculum with joint image+video training
Key insight: Wan's 4-step filtering pipeline (fundamental → visual quality → motion quality → dedup) is one of the most detailed public descriptions of video data curation. Using a VLM (Qwen2.5-VL) as the text encoder for image generation is a novel departure from CLIP/T5. The 6-tier motion quality classification for videos is uniquely granular.
6. MiniMax / Hailuo (Allegro)
Allegro Data Pipeline (most documented)
flowchart LR
A["412M raw images\n+ video corpus"] --> B["1. Duration/Resolution\n≥360p, ≥2s, ≥23fps"]
B --> C["2. Scene Segmentation\nSingle-scene clips\nTrim first/last 10 frames"]
C --> D["3. Low-Level Metrics\n• DOVER (brightness/clarity)\n• LPIPS (consistency)\n• UniMatch (motion)"]
D --> E["4. Aesthetics\nLAION Aesthetics\nPredictor"]
E --> F["5. Artifact Removal\nCRAFT text detection\nWatermark detection"]
F --> G["6. Coarse Captioning"]
G --> H["7. CLIP Similarity\nFilter"]
H --> I["107M images\n48M vid@360p\n18M vid@720p\n2M HQ vid (fine-tune)"]
Training Recipe (Allegro)
flowchart TB
S1["Stage 1: T2I Pre-training\n107M image-text pairs\nVisual fundamentals"] --> S2["Stage 2: T2V Pre-training\n48M@360p + 18M@720p\nTemporal consistency"]
S2 --> S3["Stage 3: T2V Fine-tuning\n2M high-quality clips\n6-16s medium-to-long\nmoderate motion"]
Hailuo 02 (Commercial)
- Architecture: DiT + Noise-aware Compute Redistribution (NCR) — dynamically allocates more compute to high-error timesteps
- 3× more parameters and 4× more data than Hailuo 01
- 2.5× training/inference efficiency improvement via NCR
- Native 1080p, up to 10s, 24-30fps
- Incorporates user feedback data from Hailuo 01
VTP (Visual Tokenizer Pre-training)
Key insight: MiniMax's VTP work showed that giving lower weight to pixel reconstruction loss (vs. contrastive + self-supervised losses) actually improved downstream generation quality. Allegro's 7-step video filtering pipeline is one of the most detailed publicly available.
Common Patterns & Takeaways
Universal Data Curation Pipeline
flowchart TB
A["Raw Web-Scale Data\n(millions to billions)"] --> B["Basic Filters\n• Resolution thresholds\n• Duration/FPS (video)\n• Corruption checks"]
B --> C["Safety Filters\n• NSFW classifiers\n• CSAM detection\n• Violence removal"]
C --> D["Quality Assessment\n• Aesthetic scoring\n• Blur/exposure detection\n• Motion quality (video)"]
D --> E["Deduplication\n• Perceptual hashing\n• Cluster-based\n• Semantic dedup"]
E --> F["Recaptioning\n• VLM-generated captions\n• Multi-perspective\n• Bilingual (some)"]
F --> G["Distribution Balancing\n• Downsample majority\n• Preserve long-tail\n• Active learning"]
G --> H["Curated Training Set\n(typically 10-25% of raw data)"]
Common Training Recipe Pattern
flowchart TB
S1["Phase 1: Low-Resolution Pre-training\n256×256 or 512×512\nLarge batch, many steps"] --> S2["Phase 2: Progressive Resolution\nGradually increase to\n1024/2048/4096"]
S2 --> S3["Phase 3: Quality Fine-tuning\nHigh-quality subset\nAesthetic optimization"]
S3 --> S4["Phase 4: Alignment\n• RLHF (Seedream)\n• Guidance distillation (Flux)\n• Safety fine-tuning (Sora)"]
Key Trends
| Trend | Details |
| Architecture convergence | Nearly all models have converged on DiT/MMDiT backbones with flow matching objectives. U-Nets are gone. |
| Recaptioning is essential | Every model uses VLM-generated synthetic captions (DALL-E 3, CogVLM, Gemini, Qwen2-VL). This is the single highest-impact data processing step. |
| Text encoder = LLM | Evolution from CLIP → T5-XXL → full LLMs (Qwen2.5-VL, bilingual LLMs). Richer language understanding directly improves generation. |
| Progressive resolution | All models train low-res first, then scale up. Timestep shifting compensates for more pixels needing more noise. |
| Flow matching replaces DDPM | Rectified flow / flow matching is now standard. Velocity prediction with logit-normal timestep sampling. |
| Data quality > data quantity | Aggressive filtering (often removing 50-75% of raw data). Seedream's defect-aware masking shows creative approaches to reclaiming borderline data. |
| Post-training matters | RLHF, guidance distillation, and adversarial distillation are increasingly important for final quality and speed. |
| RoPE for positions | 2D/3D RoPE has replaced learned positional embeddings for better resolution generalization. |
What Remains Proprietary
The most guarded details across all models are: exact dataset composition, compute budgets, model parameter counts (for closed models), and specific filtering thresholds. Open models (Flux dev, Wan, Qwen-Image) provide the most architectural detail but still withhold training data specifics.