Image & Video Generation Models

Training Data Compilation, Filtering Pipelines & Training Recipes

Table of Contents

Overview Comparison

ModelTypeArchitectureParamsData ScaleText EncoderLossOpen?
Sora (Turbo/2)VideoDiT (spacetime patches)UndisclosedUndisclosedDALL-E 3 recaptionerDiffusion (likely flow matching)No
Imagen 1ImageCascaded pixel-space U-Net3B total~860M pairsFrozen T5-XXLNoise predictionNo
Imagen 3ImageLatent Diffusion TransformerUndisclosed"Billions"T5-XXL (likely)UndisclosedNo
FLUX.1ImageMMDiT (rectified flow)12BUndisclosedCLIP-L + T5-XXLRectified flow (velocity)Partial
Flux KontextImage EditMMDiT + 3D RoPE12BMillions of pairsCLIP-L + T5-XXLFlow matching + LADDPartial
Seedream 2.0ImageMMDiT3.9B~250M pairsBilingual LLM + Glyph-ByT5Score matchingNo
Seedream 3.0ImageMMDiT~3.9B~300M (expanded)Bilingual LLM + Glyph-ByT5Flow matchingNo
Qwen-ImageImageMMDiT (20B)20B5.6B pairsQwen2.5-VL-7BFlow matchingYes
Wan 2.1VideoDiT + 3D Causal VAE14B / 1.3B~1.5B vid + 10B imgUMT5-XXL (cross-attn)Flow matchingYes
MiniMax AllegroVideoDiTUndisclosed107M img + 66M vidUndisclosedDiffusionYes
Hailuo 02VideoDiT + NCR3x Hailuo 014x Hailuo 01UndisclosedUndisclosedNo

1. OpenAI Sora

Architecture

Diffusion Transformer (DiT) operating on spacetime patches of video latent codes. A VAE compresses raw video temporally and spatially; compressed tokens are arranged as a grid of spacetime patches fed into a transformer with self-attention across both time and space. Images are treated as single-frame videos for joint training.

Data Pipeline

flowchart LR A["Raw Data Sources\n• Shutterstock (licensed)\n• Pond5 stock footage\n• Public datasets\n• Custom/commissioned"] --> B["Pre-Training Filters\n• CSAM classifier\n• Explicit/violent removal\n• Quality/consistency checks\n• Human review (ambiguous)"] B --> C["Recaptioning\n(DALL-E 3 technique)\nHighly descriptive\ncaptions for all data"] C --> D["Training Data\nImages + Videos\n(variable res/duration/AR)"]

Training Recipe

flowchart TB S1["Stage 1: Image Pretraining\nSpatial understanding"] --> S2["Stage 2: Low-Res Video\nTemporal dynamics"] S2 --> S3["Stage 3: Progressive\nResolution Scaling"] S3 --> S4["Stage 4: Joint Image+Video\nNative aspect ratios"] S4 --> S5["Stage 5: Safety Alignment\nClassifiers + fine-tuning"]
Key insight: Quality emerges from scale — 3D consistency, object permanence, and basic physics simulation appear without explicit inductive biases. The DALL-E 3 recaptioning technique (replacing weak alt-text with rich descriptions) is critical for prompt-following fidelity.

Sora Evolution

VersionDateCapabilities
Sora PreviewFeb 2024Up to 60s, text-to-video only, no audio
Sora TurboDec 20241080p, 20s, T2V/I2V/V2V, storyboard tool
Sora 22025Native audio, multi-shot consistency (LCT), improved physics
Sources: Sora Technical Report · System Card · Sora 2

2. Google Imagen

Architecture Evolution

flowchart LR subgraph Imagen1["Imagen 1 (2022)"] direction TB I1A["T5-XXL\n(frozen, 4.6B)"] --> I1B["Base U-Net\n64×64 · 2B params"] I1B --> I1C["SR Stage 1\n256×256 · 600M"] I1C --> I1D["SR Stage 2\n1024×1024 · 400M"] end subgraph Imagen3["Imagen 3 (2024)"] direction TB I3A["T5-XXL + Gemini\nSynthetic Captions"] --> I3B["Latent Diffusion\nTransformer"] I3B --> I3C["Upsampling\n2x / 4x / 8x"] end Imagen1 -->|"evolution"| Imagen3

Data Pipeline (Imagen 3)

flowchart LR A["Billions of\nimage-text pairs\n(web + licensed)"] --> B["Quality Filter"] B --> C["Safety Filter\n(violent, NSFW)"] C --> D["AI-Generated\nContent Removal"] D --> E["Deduplication +\nDown-weighting"] E --> F["Caption Safety\n(PII removal)"] F --> G["Dual Captioning\n• Original caption\n• Gemini synthetic caption"] G --> H["Training Set"]

Training Details (Imagen 1 — most documented)

ParameterValue
Batch size2048
Training steps2.5M (all 3 stages)
OptimizerAdafactor (base) / Adam (SR)
Hardware256 TPU-v4 (base), 128 TPU-v4 (each SR)
CFG dropout10%
CFG weight1.35 (base), 8.0 (SR)
FID (COCO zero-shot)7.27

Key Techniques (Imagen 1)

Key Innovation (Imagen 2)

Key insight: Scaling the text encoder (T5-XXL) was far more impactful than scaling the U-Net. Imagen 2's aesthetic score conditioning (not just filtering) was influential. Imagen 3's removal of AI-generated images from training data is a distinctive curation step.
Sources: Imagen Paper (2205.11487) · Imagen 3 Paper (2408.07009) · Imagen 2

3. FLUX.1 & Flux Kontext (Black Forest Labs)

Architecture

flowchart TB subgraph TextEnc["Text Encoders"] TE1["CLIP-L\n(pooled embedding → modulation)"] TE2["T5-XXL\n(dense token embeddings)"] end subgraph Backbone["12B MMDiT Backbone"] DS["19 Double-Stream Blocks\n(separate text/image weights)"] SS["38 Single-Stream Blocks\n(shared weights, parallel attn+MLP)"] DS --> SS end VAE["16-channel VAE\n(custom, adversarial training)"] TextEnc --> Backbone VAE -->|"encode"| Backbone Backbone -->|"decode"| VAE

Data Pipeline

flowchart LR A["Large-scale\nimage-text data"] --> B["NSFW Removal"] B --> C["Aesthetic Filtering\n(score ≥ 6.5)"] C --> D["Perceptual\nDeduplication\n(cluster-based)"] D --> E["Synthetic Recaptioning\n(CogVLM)\n50/50 original:synthetic"] E --> F["Precompute Embeddings\n(frozen encoders)"] F --> G["Training Set"]

Training Recipe

flowchart TB S1["Pre-train at 256×256\nbatch 4096, 500k steps"] --> S2["High-res fine-tune\n(QK-Norm for stability)"] S2 --> S3["Resolution-dependent\ntimestep shifting"] S3 --> S4["Guidance Distillation\n(teacher → student)\n→ FLUX.1 dev"] S3 --> S5["Speed Distillation\n→ FLUX.1 schnell"]

Key Techniques

Flux Kontext (Image Editing)

VariantLicenseNotes
FLUX.1 [pro]Proprietary (API)Full model, best quality
FLUX.1 [dev]Non-commercialGuidance-distilled, 12B
FLUX.1 [schnell]Apache 2.0Speed-optimized, <2s generation
Kontext [pro/dev/max]MixedImage editing variants
Sources: Rectified Flow Transformers (2403.03206) · Flux Kontext (2506.15742) · HF Model Card

4. ByteDance Seedream

Architecture

MMDiT (3.9B params in v2.0) with a custom bilingual LLM text encoder + Glyph-Aligned ByT5 for character-level text rendering. Uses 2D RoPE for image tokens and Cross-Modality RoPE (v3.0+) for text tokens.

Data Pipeline

flowchart LR A["~250M image-text pairs\n70% Chinese / 30% English"] --> B["Multi-Stage Cleaning\n• Watermark removal\n• Artifact filtering\n• Aesthetic scoring"] B --> C["Distribution Balancing\n• Downsample over-represented\n• Hierarchical clustering"] C --> D["Active Learning\n• Find challenging examples\n• Iterative refinement"] D --> E["Two-Tier Captioning\n• Generic (short+long)\n• Multi-perspective rich captions\n• Chinese + English"] E --> F["Training Set"]

Defect-Aware Training (Seedream 3.0)

flowchart LR A["Previously Excluded\nImages (with defects)"] --> B["Defect Detector\n(trained on 15K\nannotated samples)"] B --> C{"Defect area\n< 20%?"} C -->|Yes| D["Retain image +\nSpatial attention mask\n(exclude defect from loss)"] C -->|No| E["Discard"] D --> F["+21.7% more\ntraining data"]

Training Stages

flowchart TB S1["Pre-train: 256×256\n(various aspect ratios)"] --> S2["Fine-tune: 512→2048px\n(Seedream 3.0)\nor 512→4096px (4.0)"] S2 --> S3["Continuing Training (CT)\nDiversified aesthetic captions"] S3 --> S4["Supervised Fine-Tuning\n(styles, text rendering, aesthetics)"] S4 --> S5["RLHF\nVLM reward model (>20B params)\nMultiple iterations"] S5 --> S6["Prompt Engineering (PE)\nFinal alignment"]
Key insight: The defect-aware training paradigm is unique — using spatial masks to exclude watermarked regions from loss computation lets the model learn from imperfect data without absorbing artifacts. The bilingual LLM text encoder enables native Chinese cultural knowledge. RLHF with a >20B VLM reward model is one of the largest reported for image generation.

Version Comparison

AspectSeedream 2.0Seedream 3.0Seedream 4.0
LossScore matchingFlow matchingAdaptive flow matching
Data~250M pairs~300M (+21.7% defect-aware)Billions
Max resolutionNot specified2048×20484096×4096
Post-trainingSFT + RLHFCT + SFT + RLHF + PECT + SFT + RLHF (joint T2I+editing)
Inference speed3s (1K)1.8s (2K)
Sources: Seedream 2.0 (2503.07703) · Seedream 3.0 (2504.11346) · Seedream 4.0 (2509.20427)

5. Qwen-Image & Wan (Alibaba)

Qwen-Image (Text-to-Image, 20B)

Data Pipeline

flowchart LR A["5.6B image-text pairs\n55% nature, 27% design\n13% people, 5% synthetic text"] --> B["Stage 1\nResolution/corruption\nDedup, NSFW"] B --> C["Stage 2\nImage Enhancement\nBlur/brightness/texture"] C --> D["Stage 3\nCaption Alignment\nRaw + recaption + fused"] D --> E["Stage 4\nSynthetic Text Data\n• Pure rendering\n• Compositional\n• Complex layouts"] E --> F["Stage 5\nHigh-res refinement\n640p+ quality/aesthetic filter"] F --> G["Training Set"]

Architecture

Wan 2.1 (Video Generation, 14B)

Data Pipeline

flowchart LR A["~1.5B videos\n+ ~10B images"] --> B["Step 1: Fundamental\n(removes ~50%)\n• OCR text coverage\n• LAION aesthetic score\n• NSFW, watermark, blur\n• Duration/resolution"] B --> C["Step 2: Visual Quality\n100 clusters →\nbalanced sampling →\nmanual scoring →\nexpert assessment model"] C --> D["Step 3: Motion Quality\n6 tiers: optimal, medium,\nstatic, camera-driven,\nlow-quality, shaky"] D --> E["Step 4:\nDeduplication"] E --> F["Dense Captioning\n(Qwen2-VL)\nBilingual CN/EN"] F --> G["Training Set"]

Architecture

Key insight: Wan's 4-step filtering pipeline (fundamental → visual quality → motion quality → dedup) is one of the most detailed public descriptions of video data curation. Using a VLM (Qwen2.5-VL) as the text encoder for image generation is a novel departure from CLIP/T5. The 6-tier motion quality classification for videos is uniquely granular.
Sources: Qwen-Image (2508.02324) · Wan (2503.20314) · GitHub

6. MiniMax / Hailuo (Allegro)

Allegro Data Pipeline (most documented)

flowchart LR A["412M raw images\n+ video corpus"] --> B["1. Duration/Resolution\n≥360p, ≥2s, ≥23fps"] B --> C["2. Scene Segmentation\nSingle-scene clips\nTrim first/last 10 frames"] C --> D["3. Low-Level Metrics\n• DOVER (brightness/clarity)\n• LPIPS (consistency)\n• UniMatch (motion)"] D --> E["4. Aesthetics\nLAION Aesthetics\nPredictor"] E --> F["5. Artifact Removal\nCRAFT text detection\nWatermark detection"] F --> G["6. Coarse Captioning"] G --> H["7. CLIP Similarity\nFilter"] H --> I["107M images\n48M vid@360p\n18M vid@720p\n2M HQ vid (fine-tune)"]

Training Recipe (Allegro)

flowchart TB S1["Stage 1: T2I Pre-training\n107M image-text pairs\nVisual fundamentals"] --> S2["Stage 2: T2V Pre-training\n48M@360p + 18M@720p\nTemporal consistency"] S2 --> S3["Stage 3: T2V Fine-tuning\n2M high-quality clips\n6-16s medium-to-long\nmoderate motion"]

Hailuo 02 (Commercial)

VTP (Visual Tokenizer Pre-training)

Key insight: MiniMax's VTP work showed that giving lower weight to pixel reconstruction loss (vs. contrastive + self-supervised losses) actually improved downstream generation quality. Allegro's 7-step video filtering pipeline is one of the most detailed publicly available.
Sources: Allegro (2410.15458) · VTP (2512.13687) · Hailuo 02

Common Patterns & Takeaways

Universal Data Curation Pipeline

flowchart TB A["Raw Web-Scale Data\n(millions to billions)"] --> B["Basic Filters\n• Resolution thresholds\n• Duration/FPS (video)\n• Corruption checks"] B --> C["Safety Filters\n• NSFW classifiers\n• CSAM detection\n• Violence removal"] C --> D["Quality Assessment\n• Aesthetic scoring\n• Blur/exposure detection\n• Motion quality (video)"] D --> E["Deduplication\n• Perceptual hashing\n• Cluster-based\n• Semantic dedup"] E --> F["Recaptioning\n• VLM-generated captions\n• Multi-perspective\n• Bilingual (some)"] F --> G["Distribution Balancing\n• Downsample majority\n• Preserve long-tail\n• Active learning"] G --> H["Curated Training Set\n(typically 10-25% of raw data)"]

Common Training Recipe Pattern

flowchart TB S1["Phase 1: Low-Resolution Pre-training\n256×256 or 512×512\nLarge batch, many steps"] --> S2["Phase 2: Progressive Resolution\nGradually increase to\n1024/2048/4096"] S2 --> S3["Phase 3: Quality Fine-tuning\nHigh-quality subset\nAesthetic optimization"] S3 --> S4["Phase 4: Alignment\n• RLHF (Seedream)\n• Guidance distillation (Flux)\n• Safety fine-tuning (Sora)"]

Key Trends

TrendDetails
Architecture convergenceNearly all models have converged on DiT/MMDiT backbones with flow matching objectives. U-Nets are gone.
Recaptioning is essentialEvery model uses VLM-generated synthetic captions (DALL-E 3, CogVLM, Gemini, Qwen2-VL). This is the single highest-impact data processing step.
Text encoder = LLMEvolution from CLIP → T5-XXL → full LLMs (Qwen2.5-VL, bilingual LLMs). Richer language understanding directly improves generation.
Progressive resolutionAll models train low-res first, then scale up. Timestep shifting compensates for more pixels needing more noise.
Flow matching replaces DDPMRectified flow / flow matching is now standard. Velocity prediction with logit-normal timestep sampling.
Data quality > data quantityAggressive filtering (often removing 50-75% of raw data). Seedream's defect-aware masking shows creative approaches to reclaiming borderline data.
Post-training mattersRLHF, guidance distillation, and adversarial distillation are increasingly important for final quality and speed.
RoPE for positions2D/3D RoPE has replaced learned positional embeddings for better resolution generalization.

What Remains Proprietary

The most guarded details across all models are: exact dataset composition, compute budgets, model parameter counts (for closed models), and specific filtering thresholds. Open models (Flux dev, Wan, Qwen-Image) provide the most architectural detail but still withhold training data specifics.