Image & Video Generation Models

Model	Type	Architecture	Params	Data Scale	Text Encoder	Loss	Open?
Sora (Turbo/2)	Video	DiT (spacetime patches)	Undisclosed	Undisclosed	DALL-E 3 recaptioner	Diffusion (likely flow matching)	No
Imagen 1	Image	Cascaded pixel-space U-Net	3B total	~860M pairs	Frozen T5-XXL	Noise prediction	No
Imagen 3	Image	Latent Diffusion Transformer	Undisclosed	"Billions"	T5-XXL (likely)	Undisclosed	No
FLUX.1	Image	MMDiT (rectified flow)	12B	Undisclosed	CLIP-L + T5-XXL	Rectified flow (velocity)	Partial
Flux Kontext	Image Edit	MMDiT + 3D RoPE	12B	Millions of pairs	CLIP-L + T5-XXL	Flow matching + LADD	Partial
Seedream 2.0	Image	MMDiT	3.9B	~250M pairs	Bilingual LLM + Glyph-ByT5	Score matching	No
Seedream 3.0	Image	MMDiT	~3.9B	~300M (expanded)	Bilingual LLM + Glyph-ByT5	Flow matching	No
Qwen-Image	Image	MMDiT (20B)	20B	5.6B pairs	Qwen2.5-VL-7B	Flow matching	Yes
Wan 2.1	Video	DiT + 3D Causal VAE	14B / 1.3B	~1.5B vid + 10B img	UMT5-XXL (cross-attn)	Flow matching	Yes
MiniMax Allegro	Video	DiT	Undisclosed	107M img + 66M vid	Undisclosed	Diffusion	Yes
Hailuo 02	Video	DiT + NCR	3x Hailuo 01	4x Hailuo 01	Undisclosed	Undisclosed	No

1. OpenAI Sora

Architecture

Diffusion Transformer (DiT) operating on spacetime patches of video latent codes. A VAE compresses raw video temporally and spatially; compressed tokens are arranged as a grid of spacetime patches fed into a transformer with self-attention across both time and space. Images are treated as single-frame videos for joint training.

Data Pipeline

flowchart LR A["Raw Data Sources\n• Shutterstock (licensed)\n• Pond5 stock footage\n• Public datasets\n• Custom/commissioned"] --> B["Pre-Training Filters\n• CSAM classifier\n• Explicit/violent removal\n• Quality/consistency checks\n• Human review (ambiguous)"] B --> C["Recaptioning\n(DALL-E 3 technique)\nHighly descriptive\ncaptions for all data"] C --> D["Training Data\nImages + Videos\n(variable res/duration/AR)"]

Training Recipe

flowchart TB S1["Stage 1: Image Pretraining\nSpatial understanding"] --> S2["Stage 2: Low-Res Video\nTemporal dynamics"] S2 --> S3["Stage 3: Progressive\nResolution Scaling"] S3 --> S4["Stage 4: Joint Image+Video\nNative aspect ratios"] S4 --> S5["Stage 5: Safety Alignment\nClassifiers + fine-tuning"]

Key insight: Quality emerges from scale — 3D consistency, object permanence, and basic physics simulation appear without explicit inductive biases. The DALL-E 3 recaptioning technique (replacing weak alt-text with rich descriptions) is critical for prompt-following fidelity.

Sora Evolution

Version	Date	Capabilities
Sora Preview	Feb 2024	Up to 60s, text-to-video only, no audio
Sora Turbo	Dec 2024	1080p, 20s, T2V/I2V/V2V, storyboard tool
Sora 2	2025	Native audio, multi-shot consistency (LCT), improved physics

Sources: Sora Technical Report · System Card · Sora 2

2. Google Imagen

Architecture Evolution

flowchart LR subgraph Imagen1["Imagen 1 (2022)"] direction TB I1A["T5-XXL\n(frozen, 4.6B)"] --> I1B["Base U-Net\n64×64 · 2B params"] I1B --> I1C["SR Stage 1\n256×256 · 600M"] I1C --> I1D["SR Stage 2\n1024×1024 · 400M"] end subgraph Imagen3["Imagen 3 (2024)"] direction TB I3A["T5-XXL + Gemini\nSynthetic Captions"] --> I3B["Latent Diffusion\nTransformer"] I3B --> I3C["Upsampling\n2x / 4x / 8x"] end Imagen1 -->|"evolution"| Imagen3

Data Pipeline (Imagen 3)

flowchart LR A["Billions of\nimage-text pairs\n(web + licensed)"] --> B["Quality Filter"] B --> C["Safety Filter\n(violent, NSFW)"] C --> D["AI-Generated\nContent Removal"] D --> E["Deduplication +\nDown-weighting"] E --> F["Caption Safety\n(PII removal)"] F --> G["Dual Captioning\n• Original caption\n• Gemini synthetic caption"] G --> H["Training Set"]

Training Details (Imagen 1 — most documented)

Parameter	Value
Batch size	2048
Training steps	2.5M (all 3 stages)
Optimizer	Adafactor (base) / Adam (SR)
Hardware	256 TPU-v4 (base), 128 TPU-v4 (each SR)
CFG dropout	10%
CFG weight	1.35 (base), 8.0 (SR)
FID (COCO zero-shot)	7.27

Key Techniques (Imagen 1)

Dynamic Thresholding: At each sampling step, compute the p-th percentile of absolute pixel values as threshold s. If s > 1, clip to [−s, s] and divide by s. This prevents pixel saturation at high guidance weights and enables much larger CFG scales than static thresholding.
Noise Conditioning Augmentation: Super-resolution models are conditioned on noisy low-res inputs (Gaussian noise with variance-preserving schedule). The augmentation level is randomly sampled during training and swept at inference. Critical for preventing train/test mismatch in cascaded pipelines.
Efficient U-Net: Parameters shifted from high-res to low-res blocks, skip connections scaled by 1/√2, reversed up/downsampling order → 2-3× faster.

Key Innovation (Imagen 2)

Aesthetic Score Conditioning: A dedicated aesthetics model (trained on human preferences for lighting, framing, exposure, sharpness) scores each image. This score is used as a conditioning signal during training — not just for filtering — giving more weight to human-preferred qualities.

Key insight: Scaling the text encoder (T5-XXL) was far more impactful than scaling the U-Net. Imagen 2's aesthetic score conditioning (not just filtering) was influential. Imagen 3's removal of AI-generated images from training data is a distinctive curation step.

Sources: Imagen Paper (2205.11487) · Imagen 3 Paper (2408.07009) · Imagen 2

3. FLUX.1 & Flux Kontext (Black Forest Labs)

Architecture

flowchart TB subgraph TextEnc["Text Encoders"] TE1["CLIP-L\n(pooled embedding → modulation)"] TE2["T5-XXL\n(dense token embeddings)"] end subgraph Backbone["12B MMDiT Backbone"] DS["19 Double-Stream Blocks\n(separate text/image weights)"] SS["38 Single-Stream Blocks\n(shared weights, parallel attn+MLP)"] DS --> SS end VAE["16-channel VAE\n(custom, adversarial training)"] TextEnc --> Backbone VAE -->|"encode"| Backbone Backbone -->|"decode"| VAE

Data Pipeline

flowchart LR A["Large-scale\nimage-text data"] --> B["NSFW Removal"] B --> C["Aesthetic Filtering\n(score ≥ 6.5)"] C --> D["Perceptual\nDeduplication\n(cluster-based)"] D --> E["Synthetic Recaptioning\n(CogVLM)\n50/50 original:synthetic"] E --> F["Precompute Embeddings\n(frozen encoders)"] F --> G["Training Set"]

Training Recipe

flowchart TB S1["Pre-train at 256×256\nbatch 4096, 500k steps"] --> S2["High-res fine-tune\n(QK-Norm for stability)"] S2 --> S3["Resolution-dependent\ntimestep shifting"] S3 --> S4["Guidance Distillation\n(teacher → student)\n→ FLUX.1 dev"] S3 --> S5["Speed Distillation\n→ FLUX.1 schnell"]

Key Techniques

Rectified Flow: Velocity prediction objective (not noise prediction). Logit-normal timestep sampling (m=0, s=1) outperforms uniform.
QK-Normalization: RMSNorm on Q/K prevents attention logit explosion at high resolutions, enabling bf16 training.
Timestep Shifting: t_m = (√(m/n) · t_n) / (1 + (√(m/n)−1) · t_n) — more pixels need proportionally more noise.
RoPE: Rotary positional embeddings generalize to unseen sequence lengths (vs. learned absolute encodings).

Flux Kontext (Image Editing)

Fine-tuned from FLUX.1 checkpoint on millions of relational (input, output, instruction) pairs.
Context + target tokens concatenated sequentially (not channel-wise — channel-wise performed worse).
3D RoPE: Factorized (t, h, w) coordinates — context images indexed by i, supporting multi-reference.
Two-stage: flow matching → adversarial distillation (LADD) for fewer sampling steps.

Variant	License	Notes
FLUX.1 [pro]	Proprietary (API)	Full model, best quality
FLUX.1 [dev]	Non-commercial	Guidance-distilled, 12B
FLUX.1 [schnell]	Apache 2.0	Speed-optimized, <2s generation
Kontext [pro/dev/max]	Mixed	Image editing variants

Sources: Rectified Flow Transformers (2403.03206) · Flux Kontext (2506.15742) · HF Model Card

4. ByteDance Seedream

Architecture

MMDiT (3.9B params in v2.0) with a custom bilingual LLM text encoder + Glyph-Aligned ByT5 for character-level text rendering. Uses 2D RoPE for image tokens and Cross-Modality RoPE (v3.0+) for text tokens.

Data Pipeline

flowchart LR A["~250M image-text pairs\n70% Chinese / 30% English"] --> B["Multi-Stage Cleaning\n• Watermark removal\n• Artifact filtering\n• Aesthetic scoring"] B --> C["Distribution Balancing\n• Downsample over-represented\n• Hierarchical clustering"] C --> D["Active Learning\n• Find challenging examples\n• Iterative refinement"] D --> E["Two-Tier Captioning\n• Generic (short+long)\n• Multi-perspective rich captions\n• Chinese + English"] E --> F["Training Set"]

Defect-Aware Training (Seedream 3.0)

flowchart LR A["Previously Excluded\nImages (with defects)"] --> B["Defect Detector\n(trained on 15K\nannotated samples)"] B --> C{"Defect area\n< 20%?"} C -->|Yes| D["Retain image +\nSpatial attention mask\n(exclude defect from loss)"] C -->|No| E["Discard"] D --> F["+21.7% more\ntraining data"]

Training Stages

flowchart TB S1["Pre-train: 256×256\n(various aspect ratios)"] --> S2["Fine-tune: 512→2048px\n(Seedream 3.0)\nor 512→4096px (4.0)"] S2 --> S3["Continuing Training (CT)\nDiversified aesthetic captions"] S3 --> S4["Supervised Fine-Tuning\n(styles, text rendering, aesthetics)"] S4 --> S5["RLHF\nVLM reward model (>20B params)\nMultiple iterations"] S5 --> S6["Prompt Engineering (PE)\nFinal alignment"]

Key insight: The defect-aware training paradigm is unique — using spatial masks to exclude watermarked regions from loss computation lets the model learn from imperfect data without absorbing artifacts. The bilingual LLM text encoder enables native Chinese cultural knowledge. RLHF with a >20B VLM reward model is one of the largest reported for image generation.

Version Comparison

Aspect	Seedream 2.0	Seedream 3.0	Seedream 4.0
Loss	Score matching	Flow matching	Adaptive flow matching
Data	~250M pairs	~300M (+21.7% defect-aware)	Billions
Max resolution	Not specified	2048×2048	4096×4096
Post-training	SFT + RLHF	CT + SFT + RLHF + PE	CT + SFT + RLHF (joint T2I+editing)
Inference speed	—	3s (1K)	1.8s (2K)

Sources: Seedream 2.0 (2503.07703) · Seedream 3.0 (2504.11346) · Seedream 4.0 (2509.20427)

5. Qwen-Image & Wan (Alibaba)

Qwen-Image (Text-to-Image, 20B)

Data Pipeline

flowchart LR A["5.6B image-text pairs\n55% nature, 27% design\n13% people, 5% synthetic text"] --> B["Stage 1\nResolution/corruption\nDedup, NSFW"] B --> C["Stage 2\nImage Enhancement\nBlur/brightness/texture"] C --> D["Stage 3\nCaption Alignment\nRaw + recaption + fused"] D --> E["Stage 4\nSynthetic Text Data\n• Pure rendering\n• Compositional\n• Complex layouts"] E --> F["Stage 5\nHigh-res refinement\n640p+ quality/aesthetic filter"] F --> G["Training Set"]

Architecture

20B param MMDiT with double-stream (vision patches + text tokens)
Text encoder: Frozen Qwen2.5-VL-7B-Instruct (a multimodal LLM — not CLIP/T5)
Positional encoding: Multimodal Scalable RoPE (MSRoPE)
Loss: Flow matching with exponential timestep shift
Curriculum: Non-text images → simple text → paragraph/slide-level text compositions

Wan 2.1 (Video Generation, 14B)

Data Pipeline

flowchart LR A["~1.5B videos\n+ ~10B images"] --> B["Step 1: Fundamental\n(removes ~50%)\n• OCR text coverage\n• LAION aesthetic score\n• NSFW, watermark, blur\n• Duration/resolution"] B --> C["Step 2: Visual Quality\n100 clusters →\nbalanced sampling →\nmanual scoring →\nexpert assessment model"] C --> D["Step 3: Motion Quality\n6 tiers: optimal, medium,\nstatic, camera-driven,\nlow-quality, shaky"] D --> E["Step 4:\nDeduplication"] E --> F["Dense Captioning\n(Qwen2-VL)\nBilingual CN/EN"] F --> G["Training Set"]

Architecture

DiT with flow matching, UMT5-XXL text encoder via cross-attention
Wan-VAE: 3D causal VAE (temporal 4×, spatial 8× compression), unlimited-length 1080p encoding
Resolution-progressive curriculum with joint image+video training

Key insight: Wan's 4-step filtering pipeline (fundamental → visual quality → motion quality → dedup) is one of the most detailed public descriptions of video data curation. Using a VLM (Qwen2.5-VL) as the text encoder for image generation is a novel departure from CLIP/T5. The 6-tier motion quality classification for videos is uniquely granular.

Sources: Qwen-Image (2508.02324) · Wan (2503.20314) · GitHub

6. MiniMax / Hailuo (Allegro)

Allegro Data Pipeline (most documented)

flowchart LR A["412M raw images\n+ video corpus"] --> B["1. Duration/Resolution\n≥360p, ≥2s, ≥23fps"] B --> C["2. Scene Segmentation\nSingle-scene clips\nTrim first/last 10 frames"] C --> D["3. Low-Level Metrics\n• DOVER (brightness/clarity)\n• LPIPS (consistency)\n• UniMatch (motion)"] D --> E["4. Aesthetics\nLAION Aesthetics\nPredictor"] E --> F["5. Artifact Removal\nCRAFT text detection\nWatermark detection"] F --> G["6. Coarse Captioning"] G --> H["7. CLIP Similarity\nFilter"] H --> I["107M images\n48M vid@360p\n18M vid@720p\n2M HQ vid (fine-tune)"]

Training Recipe (Allegro)

flowchart TB S1["Stage 1: T2I Pre-training\n107M image-text pairs\nVisual fundamentals"] --> S2["Stage 2: T2V Pre-training\n48M@360p + 18M@720p\nTemporal consistency"] S2 --> S3["Stage 3: T2V Fine-tuning\n2M high-quality clips\n6-16s medium-to-long\nmoderate motion"]

Hailuo 02 (Commercial)

Architecture: DiT + Noise-aware Compute Redistribution (NCR) — dynamically allocates more compute to high-error timesteps
3× more parameters and 4× more data than Hailuo 01
2.5× training/inference efficiency improvement via NCR
Native 1080p, up to 10s, 24-30fps
Incorporates user feedback data from Hailuo 01

VTP (Visual Tokenizer Pre-training)

Key insight: MiniMax's VTP work showed that giving lower weight to pixel reconstruction loss (vs. contrastive + self-supervised losses) actually improved downstream generation quality. Allegro's 7-step video filtering pipeline is one of the most detailed publicly available.

Sources: Allegro (2410.15458) · VTP (2512.13687) · Hailuo 02

Common Patterns & Takeaways

Universal Data Curation Pipeline

flowchart TB A["Raw Web-Scale Data\n(millions to billions)"] --> B["Basic Filters\n• Resolution thresholds\n• Duration/FPS (video)\n• Corruption checks"] B --> C["Safety Filters\n• NSFW classifiers\n• CSAM detection\n• Violence removal"] C --> D["Quality Assessment\n• Aesthetic scoring\n• Blur/exposure detection\n• Motion quality (video)"] D --> E["Deduplication\n• Perceptual hashing\n• Cluster-based\n• Semantic dedup"] E --> F["Recaptioning\n• VLM-generated captions\n• Multi-perspective\n• Bilingual (some)"] F --> G["Distribution Balancing\n• Downsample majority\n• Preserve long-tail\n• Active learning"] G --> H["Curated Training Set\n(typically 10-25% of raw data)"]

Common Training Recipe Pattern

flowchart TB S1["Phase 1: Low-Resolution Pre-training\n256×256 or 512×512\nLarge batch, many steps"] --> S2["Phase 2: Progressive Resolution\nGradually increase to\n1024/2048/4096"] S2 --> S3["Phase 3: Quality Fine-tuning\nHigh-quality subset\nAesthetic optimization"] S3 --> S4["Phase 4: Alignment\n• RLHF (Seedream)\n• Guidance distillation (Flux)\n• Safety fine-tuning (Sora)"]

Key Trends

Trend	Details
Architecture convergence	Nearly all models have converged on DiT/MMDiT backbones with flow matching objectives. U-Nets are gone.
Recaptioning is essential	Every model uses VLM-generated synthetic captions (DALL-E 3, CogVLM, Gemini, Qwen2-VL). This is the single highest-impact data processing step.
Text encoder = LLM	Evolution from CLIP → T5-XXL → full LLMs (Qwen2.5-VL, bilingual LLMs). Richer language understanding directly improves generation.
Progressive resolution	All models train low-res first, then scale up. Timestep shifting compensates for more pixels needing more noise.
Flow matching replaces DDPM	Rectified flow / flow matching is now standard. Velocity prediction with logit-normal timestep sampling.
Data quality > data quantity	Aggressive filtering (often removing 50-75% of raw data). Seedream's defect-aware masking shows creative approaches to reclaiming borderline data.
Post-training matters	RLHF, guidance distillation, and adversarial distillation are increasingly important for final quality and speed.
RoPE for positions	2D/3D RoPE has replaced learned positional embeddings for better resolution generalization.

What Remains Proprietary

The most guarded details across all models are: exact dataset composition, compute budgets, model parameter counts (for closed models), and specific filtering thresholds. Open models (Flux dev, Wan, Qwen-Image) provide the most architectural detail but still withhold training data specifics.

Overview Comparison

1. OpenAI Sora

Architecture

Data Pipeline

Training Recipe

Sora Evolution

2. Google Imagen

Architecture Evolution

Data Pipeline (Imagen 3)

Training Details (Imagen 1 — most documented)

Key Techniques (Imagen 1)

Key Innovation (Imagen 2)

3. FLUX.1 & Flux Kontext (Black Forest Labs)

Architecture

Data Pipeline

Training Recipe

Key Techniques

Flux Kontext (Image Editing)

4. ByteDance Seedream

Architecture

Data Pipeline

Defect-Aware Training (Seedream 3.0)

Training Stages

Version Comparison

5. Qwen-Image & Wan (Alibaba)

Qwen-Image (Text-to-Image, 20B)

Data Pipeline

Architecture

Wan 2.1 (Video Generation, 14B)

Data Pipeline

Architecture

6. MiniMax / Hailuo (Allegro)

Allegro Data Pipeline (most documented)

Training Recipe (Allegro)

Hailuo 02 (Commercial)

VTP (Visual Tokenizer Pre-training)

Common Patterns & Takeaways

Universal Data Curation Pipeline

Common Training Recipe Pattern

Key Trends

What Remains Proprietary