A white-haired cybernetic figure kneeling with a katana in a rain-slicked neon alley, red visor glowing.

Text-to-Image / A Diffusion Field Guide

Eight Steps
to a Picture

Understanding Z-Image-Turbo with Experiments

How a 6-billion-parameter open model paints a photo in eight passes — and what every generation knob actually does to the result.

Tongyi-MAI 6.15B params Apache 2.0 ~8 steps <16 GB VRAM
Z-Image-Turbo · 9 steps · guidance 0
1280×800 · seed 42 · Q8 GGUF · 24 GB L4
PART 01

Generation and Images

You want to generate a professional-looking photo like this — for free, an unlimited number of times, no rate limits, no “you have 3 credits left,” nothing?

Z-Image-Turbo is your friend.

Hero generation

Generated locally on a single 24 GB GPU · Z-Image-Turbo · 9 steps · guidance 0 · seed 42 · Q8 GGUF

There was no studio behind that image. No photographer, no softbox, no stock license, no monthly subscription. It came out of a single rented GPU in roughly 10 seconds — and you could make ten thousand more for the price of the electricity.

So what is this thing, and where does it sit?

Turning a sentence into an image — “text-to-image” generation — is one of the most competitive areas in AI right now. Dozens of capable models have appeared in the last two years, and they all do the same job: you type a description, they produce a matching picture. What really separates them is who controls them and what it takes to run them. Some live inside paid apps and APIs run by large labs. Others ship as open files you can download, inspect, and run on your own machine.

Z-Image is one of the open ones — a 6.15-billion-parameter model from Alibaba’s Tongyi-MAI lab, released in November 2025 under a permissive Apache 2.0 license. The version we’ll use throughout, Z-Image-Turbo, is its speed-optimized variant. To see why it’s worth a close look, let’s place it next to the field:

Model Maker Availability Parameters Run it yourself?
Z-Image-Turbo Alibaba · Tongyi-MAI Open — Apache 2.0 6.15B Yes — under 16 GB VRAM
Stable Diffusion 3.5 Large Stability AI Open — community license ~8B Yes — consumer GPU
FLUX.1 [dev] Black Forest Labs Open — non-commercial 12B Hard — high-end GPU
Qwen-Image Alibaba · Tongyi Open — Apache 2.0 ~20B Very hard
Hunyuan-Image 3.0 Tencent Open ~80B (MoE) No — datacenter-class
Seedream 4.0 ByteDance Closed — API / app Undisclosed No — hosted only
Imagen 4 Google Closed — API / app Undisclosed No — hosted only
Gemini 2.5 Flash Image Google Closed — API / app Undisclosed No — hosted only
Midjourney v7 Midjourney Closed — app Undisclosed No — hosted only

Snapshot as of late 2025. Closed-model sizes are not disclosed by their makers; the open-model figures are approximate and worth re-checking against each model’s card.

A few things jump out. Most of the best-known names are closed — usable only through someone else’s service, with no public size and no way to run them yourself. Among the open models, the trend has been toward enormous parameter counts — 20B to 80B — meaning expensive hardware and slow generation. Z-Image-Turbo is the unusual corner of that table: open and permissively licensed, small enough for a single consumer GPU, fast, and still competitive on quality. At release it scored an Elo of 1025 on Alibaba’s own AI Arena — 4th overall, 1st among open models — and was independently rated the #1 open-weights model on the Artificial Analysis leaderboard. Small, fast, free, and near the top is a rare combination — and it’s exactly what lets you run it without limits.

One more thing before we start, because it shapes every experiment that follows. Z-Image comes in two versions: the original base model and the Turbo model we’re using. Base builds an image over roughly 100 refinement passes; Turbo is a distilled, compressed-for-speed copy that reaches comparable quality in just 8. That speedup ships with a few unusual default settings, and getting them right matters — so we won’t take them on faith. Part 2 introduces each from scratch and shows, with pictures, exactly what it does.

Weathered orange bipedal mech standing in a snowy forest
Astronaut pointing toward the viewer under pink and purple light
Anime youth in an orange hoodie and white techwear jacket
Armored sci-fi soldiers patrolling a bioluminescent jungle
Close-up of a white-and-gold humanoid robot with bright blue eyes
Anime girl sitting on a rooftop overlooking a neon city at night
Astronaut whose visor reflects a starfield, on a deep red background
Anime swordsman with a glowing purple katana in a neon city
Young woman holding a blue lightsaber in a sunlit desert market
Stylized robot with green sneakers against bold typographic poster art

Ten more generations from the same model — photoreal mecha, anime, studio portrait, poster typography, cinematic sci-fi — each from a short prompt, on a single GPU.

Here’s the plan. Part 2 hands you the code, then walks knob by knob — steps, guidance, seed, aspect ratio, resolution, model precision — showing what each does to the image. Part 3 opens the hood and traces, in exactly eight steps, the path from prompt to picture.

PART 02

Levers to Control Generation

Generating your first image

Two pieces of software do the work. First, diffusers — Hugging Face’s open-source library that wraps a diffusion model into a ready-to-run pipeline: it downloads the weights, wires up the components (text encoder, transformer, VAE, scheduler), and runs the denoising loop for you, so an image is a few lines of code rather than a re-implementation of the math. Second, the model itself, which we load in quantized form.

Quantized simply means the model’s weights are stored at lower numerical precision — fewer bits per number — to make the file smaller and lighter on memory. Why start here instead of full precision? Z-Image-Turbo at full bfloat16 is about 12.3 GB, while the 8-bit (Q8) version is ~7 GB with no quality you can see — which leaves comfortable headroom on a 24 GB GPU for the text encoder, the VAE, and the working memory each image needs. It’s the most practical way to just get going; later, in Experiment 6, we push the quantization much harder to find where quality finally breaks. Here’s the one-time install:

shell
# pip install -U git+https://github.com/huggingface/diffusers
# pip install -U gguf accelerate transformers spandrel

And here’s the load. We pull the transformer down as a single GGUF file — the single-file format these quantized weights ship in — and assemble the full pipeline around it:

python
import torch
from huggingface_hub import hf_hub_download
from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig

DTYPE = torch.bfloat16

# 1) Load the quantized transformer (here: Q8) as a single file
gguf = hf_hub_download("unsloth/Z-Image-Turbo-GGUF", "z-image-turbo-Q8_0.gguf")
transformer = ZImageTransformer2DModel.from_single_file(
    gguf,
    quantization_config=GGUFQuantizationConfig(compute_dtype=DTYPE),
    torch_dtype=DTYPE,
)

# 2) Assemble the full pipeline around that transformer
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", transformer=transformer, torch_dtype=DTYPE
)
pipe.to("cuda")

When the pipeline prints itself, you can see the five parts we’ll meet again in Part 3 — a FlowMatchEulerDiscreteScheduler, a Qwen3Model text encoder with a Qwen2Tokenizer, the ZImageTransformer2DModel, and an AutoencoderKL VAE. Generating an image is one call:

python
image = pipe(
    prompt=prompt,
    height=1280, width=720,
    num_inference_steps=9,   # 8 actual DiT forward passes
    guidance_scale=0.0,      # Turbo bakes guidance in — keep this at 0
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

The method: one prompt, one variable at a time

Every experiment below uses the same fixed prompt — a black-helmeted robot, written with explicit subject, clothing, and lighting sections — so that anything that changes in the output is caused by the parameter under test and nothing else. This is the discipline that makes the comparisons fair: hold everything constant, perturb one axis, attribute the change.

python · prompt
prompt = """
A sleek black helmeted robot shown in a three-quarter profile against a solid red background

Subject: A highly detailed mechanical head featuring exposed wiring and internal gears beneath the glossy shell of the faceplate, with intricate metallic textures visible around the neck joint.

Clothing: Wearing a high-collared dark tactical jacket made of matte fabric with reinforced shoulder pads that have subtle mesh detailing and zippered accents along the seams; an illuminated red logo on the left shoulder reads 'Libel' in stylized font.

Lighting: Dramatic spotlight highlighting the curves of the helmet while casting deep shadows within its mechanical core, creating high contrast between reflective surfaces and dark recesses.
"""

What follows are six experiments. For each, I explain what the lever is and why it matters first, then show the sweep and what to look for.

Experiment 01

Inference Steps

The idea. A diffusion model doesn’t paint an image in one shot; it starts from random noise and removes a little of it on each step, refining its guess of the final picture as it goes. Too few steps and the model hasn’t finished cleaning up the noise — the image is blurry, under-formed, or muddy. More steps reduce that error, but with sharply diminishing returns: past a point, extra steps barely change anything and just cost you time. The entire purpose of a “Turbo” model is to push the “good enough” line down to a tiny number of steps.

The sweep. num_inference_steps ∈ {1, 3, 6, 9, 12, 15}, guidance fixed at 0.0, seed fixed.

python
P, H, W, SEED = 183, 1280, 720, 42
for S in [1, 3, 6, 9, 12, 15]:
    image = pipe(prompt=prompt, height=H, width=W,
                 num_inference_steps=S, guidance_scale=0.0,
                 generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
    image.save(f"EXPERIMENTS/P{P}_STEP{S}_SEED{SEED}_{W}x{H}.png")
Output at 1 inference step — a soft, blurred blob
1 step
Output at 3 inference steps — fully formed
3 steps
Output at 6 inference steps
6 steps
Output at 9 inference steps — the default
9 steps
Output at 12 inference steps
12 steps
Output at 15 inference steps
15 steps

Same prompt, same seed — only the step count changes. Hover any frame to enlarge it.

Slide to compare · head detail

Drag the divider across the head. Pick a different step on each side to compare any two — the two sides can’t show the same one.

What to look for. The jump from 1 to 3 steps is the dramatic one. At 1 step the image is a soft, smeared blob — the helmet is a vague silhouette and the “Libel” patch is an unreadable red smudge. By 3 steps the model has already resolved almost everything: the glossy shell, the exposed lens-and-gear mechanism at the jaw, the high-collared tactical jacket, and the legible “Libel” logo are all there. From 6 through 15 the differences are marginal — small shifts in how the internal mechanical detail and specular highlights render, closer to re-rolls of the same image than to genuine improvements. In other words, Turbo is essentially converged by about 3 steps and comfortably so at its 9-step (8-NFE) default; pushing to 12 or 15 buys nothing you can see while costing proportionally more time. This is the inverse of the old “always use 30–50 steps” habit. (Why so few steps suffice is the subject of Part 3, Step 7.)

Experiment 02

Guidance Scale

The idea. Classifier-Free Guidance (CFG) is the standard way to make a diffusion model follow your prompt more strongly. Normally the model produces two predictions on each step — one that listens to your prompt and one that ignores it — and the guidance scale controls how far it extrapolates away from the ignore-the-prompt version and toward the prompt. Crank it up and the image obeys the prompt harder, but loses variety and, at high values, starts to look oversaturated and over-cooked. It also normally doubles the cost of every step, since it needs both predictions.

The Turbo twist. Here’s where distillation changes the meaning of the knob. Turbo has CFG baked into its weights — it was built assuming guidance, so at run time it does a single prompt-following pass and you leave guidance_scale=0. Turning the dial up doesn’t sharpen prompt adherence the way it would on a base model; it reintroduces an operation the model was never meant to run at inference, and quality degrades.

The sweep. guidance_scale ∈ {0.0, 0.25, 0.5, 0.75, 1.0} at 9 steps.

python
P, H, W, S, SEED = 183, 1280, 720, 9, 42
for G in [0.0, 0.25, 0.5, 0.75, 1.0]:
    image = pipe(prompt=prompt, height=H, width=W,
                 num_inference_steps=S, guidance_scale=G,
                 generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
    image.save(f"EXPERIMENTS/P{P}_STEP{S}_SEED{SEED}_{W}x{H}_G{G}.png")
Guidance 0.0 — balanced, neutral exposure
G 0.0
Guidance 0.25
G 0.25
Guidance 0.5
G 0.5
Guidance 0.75
G 0.75
Guidance 1.0 — darker, higher contrast, saturated reds
G 1.0

Same prompt, same seed, 9 steps — only the guidance scale changes. Hover any frame to enlarge it.

Slide to compare · head detail

Drag across the head to watch contrast and saturation shift. Pick a different value on each side — they can’t match.

What to look for. On a base model, raising guidance from 1 toward 7 visibly tightens prompt adherence. Here it does something different — and milder, since 0–1 is a gentle range. The cleanest, most neutral exposure is at 0; as guidance climbs the image steadily gains contrast and saturation — the blacks deepen, the red rim-light on the neck mechanism intensifies, and by 0.75–1.0 the shadows in the jacket start to crush and the reds go a touch over-cooked. None of this is better prompt-following; it’s the classic over-saturation drift of guidance, just early. The takeaway holds: on Turbo, guidance is a contrast knob, not a quality knob, so the right setting is the one the model was distilled for — 0. This is a concrete demonstration of why distillation rewrites a parameter you thought you understood.

Experiment 03

Seed

The idea. Generation starts from a draw of random noise, and the seed is what makes that draw repeatable. Same prompt + same seed + same settings ⇒ the exact same image, every time — which is what lets you change one other thing and trust the comparison. Different seeds give different starting noise, so the model walks a different path to a different but still prompt-consistent image: the pose, framing, and exact details shift while the prompt’s content stays satisfied. (One caveat for a heavily distilled few-step model: it commits to a composition very fast, so outputs across seeds can feel a little more alike than in a 50-step model.)

The sweep. Eight seeds, run in landscape (1280×720) and again in portrait (720×1280).

python
SEEDS = [0, 7, 42, 123, 777, 2024, 88888, 1234567]
W, H, S, G = 1280, 720, 9, 0.0
for SEED in SEEDS:
    image = pipe(prompt=prompt, height=H, width=W,
                 num_inference_steps=S, guidance_scale=G,
                 generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
    image.save(f"EXPERIMENTS/SEED{SEED}_{W}x{H}.png")

Portrait · 720×1280

Seed 0, portrait
seed 0
Seed 7, portrait
seed 7
Seed 42, portrait
seed 42
Seed 123, portrait
seed 123
Seed 777, portrait
seed 777
Seed 2024, portrait
seed 2024
Seed 88888, portrait
seed 88888
Seed 1234567, portrait
seed 1234567

Landscape · 1280×720

Seed 0, landscape
seed 0
Seed 7, landscape
seed 7
Seed 42, landscape
seed 42
Seed 123, landscape
seed 123
Seed 777, landscape
seed 777
Seed 2024, landscape
seed 2024
Seed 88888, landscape
seed 88888
Seed 1234567, landscape
seed 1234567

The same prompt and settings across eight seeds, in both shapes. Hover any frame to enlarge it.

What to look for. Every seed obeys the prompt — black helmeted robot, exposed jaw mechanism, matte tactical jacket, red backdrop, the “Libel” patch — yet no two are the same picture. The seed reshuffles everything the prompt leaves open: which way the head faces, whether the shell reads glossy or matte, how much of the internal gearwork is exposed, the cut of the jacket and where the “Libel” patches land, even the mood of the lighting. That spread is the model’s diversity, and it’s healthy here — despite committing to a composition in only a handful of steps, the eight seeds come out clearly distinct rather than as minor variations. Compare the two rows and you’ll also see that a seed is not portable across shapes: portrait seed 42 and landscape seed 42 are different pictures, not one image cropped two ways — the wider canvas pulls in more of the shoulders and torso and recomposes the subject around it.

Experiment 04

Aspect Ratio

The idea. A model learns from a distribution of image shapes, and it’s most reliable near shapes it saw often. Aspect ratio is really a lever on composition: a tall canvas nudges the model toward a single vertical subject, a wide one toward environmental or panoramic layouts. Two practical constraints bound your choices — your total pixel budget (compute and VRAM), and the rule that both dimensions must be divisible by 16 (a consequence of how the latent is compressed and tokenized; see Part 3, Step 4).

The sweep. Hold total pixels near ~1 megapixel while varying the ratio across 11 shapes, each dimension rounded to a multiple of 16:

21:9 · 1568×672
21:91568×672
3:4 · 880×1184
3:4880×1184
16:9 · 1360×768
16:91360×768
4:5 · 912×1152
4:5912×1152
2:3 · 832×1248
2:3832×1248
3:2 · 1248×832
3:21248×832
5:4 · 1152×912
5:41152×912
9:21 · 672×1568
9:21672×1568
4:3 · 1184×880
4:31184×880
1:1 · 1024×1024
1:11024×1024
9:16 · 768×1360
9:16768×1360

All eleven ratios at their true proportions, each holding a ~1 MP budget. Hover any to enlarge slightly.

What to look for. Every shape yields a coherent, well-composed image — the model recomposes the subject to fit rather than stretching or squishing it. Portrait ratios (2:3, 9:16) crop in close on the head and collar; landscape ratios (3:2, 16:9) pull back to reveal more of the jacket and shoulders; the square sits in between. The extremes are the real test: at 21:9 the head sits in a wide field with generous negative space, and at 9:21 the figure stacks tall and narrow — both stay coherent, with no duplicated heads or tiling. (Because each ratio starts from a different noise tensor, the pose and detailing shift from one to the next, exactly like the seed sweep — these are different images, not one image letterboxed.)

Experiment 05

Resolution / Scale

The idea. Beyond shape, the absolute number of pixels matters, because every model has a native training resolution — for Z-Image that’s the 1024-class, around 1 megapixel. Generate well below native and there simply aren’t enough latent tokens to carry the structure, so you lose detail and coherence. Generate well above native and you hit the classic diffusion failure: the model duplicates content — two heads, repeated limbs, tiled textures — because the patterns it learned at training size effectively “tile” when asked to fill far more tokens than it ever saw. (This is why the standard professional workflow is generate near native, then upscale — and exactly the kind of upscaling pipeline worth a follow-up post.)

The sweep. Hold 16:9 fixed and scale total resolution from ~0.33 MP up to ~2.36 MP, in both orientations:

python
BASE_W, BASE_H = 256, 144          # smallest 16:9 box divisible by 16
for n in [3, 4, 5, 6, 7, 8]:       # n=5 is the 1280×720 reference
    W, H = BASE_W * n, BASE_H * n
    image = pipe(prompt=prompt, height=H, width=W,
                 num_inference_steps=9, guidance_scale=0.0,
                 generator=torch.Generator("cuda").manual_seed(42)).images[0]
    image.save(f"EXPERIMENTS/16x9_{W}x{H}.png")
768×432, 0.33 megapixels
768×432 · 0.33 MP
1024×576, 0.59 megapixels
1024×576 · 0.59 MP
1280×720, 0.92 megapixels
1280×720 · 0.92 MP
1536×864, 1.33 megapixels
1536×864 · 1.33 MP
1792×1008, 1.81 megapixels
1792×1008 · 1.81 MP
2048×1152, 2.36 megapixels
2048×1152 · 2.36 MP

The 16:9 sweep, each frame drawn at a size proportional to its true pixel count — from 0.33 MP up to 2.36 MP.

What to look for. Two things. First, detail density climbs with resolution: at 0.33 MP the forms are all correct but soft, and as the pixel budget grows the fine mechanical work — gear teeth, the lens rings, the jacket’s woven texture, the “Libel” stitching — resolves progressively crisper, sharpest in the 2.36 MP frame at the bottom. Second, and more surprising given the theory above: the predicted high-resolution failure doesn’t appear here. Across the whole sweep, including the top end at roughly 2.4× native, the model keeps a single coherent figure — no second head, no tiled texture, no repeated limbs. So while duplication is the real risk when you push far past native, Z-Image’s comfortable range at 16:9 evidently extends well beyond 1 MP — at least to the ~2.36 MP tested; you’d likely have to go considerably higher to trigger the tiling. And as with the earlier sweeps, each resolution is its own noise tensor, so pose and framing drift from rung to rung — these are different images generated at different scales, not one image upscaled.

Experiment 06

Quantization (Model Size)

The idea. A model’s weights are normally stored at 16-bit precision (here, bfloat16). Quantization stores them at lower precision to shrink the file and the memory footprint. GGUF is a single-file model format that came out of the llama.cpp ecosystem for language models and has been adopted for diffusion transformers; its “Q” levels are roughly bits-per-weight:

  • Q8 ≈ 8-bit — within rounding error of full precision; effectively lossless.
  • Q4 ≈ 4-bit — the usual size/quality sweet spot; much smaller, modest quality loss.
  • Q2 ≈ 2-bit — aggressive; “for experiments,” with quality dropping off sharply.

For the actual Z-Image-Turbo files, full bf16 is 12.3 GB, while Q8 = 7.22 GB, Q4 = 5.02 GB, and Q2 = 3.64 GB — which is precisely how a 6B-parameter transformer fits comfortably on a 24 GB card with room to spare. (Modern “K-quants” are smarter than uniform rounding: they keep sensitive layers at higher precision and squeeze tolerant ones harder, so even “Q2” here isn’t a flat 2-bit model.)

The sweep. Re-run the entire experiment suite at Q8, Q4, and Q2 by changing only the file passed to from_single_file:

python
# z-image-turbo-Q8_0.gguf  →  z-image-turbo-Q4_K_M.gguf  →  z-image-turbo-Q2_K.gguf

Slide to compare · full frame

Same prompt, seed, and settings — only the quantization changes. Drag the divider; switch either side between Q8, Q4, and Q2 (the two can’t match).

What to look for. Slide between Q8 and Q4 first: they’re all but indistinguishable — same composition, same detail — yet Q4 is a third smaller on disk (5.0 vs 7.2 GB). That’s the headline: 4-bit quantization is close to a free lunch on this model, which is why it’s the popular default. Q2 is the interesting one. It still produces a coherent, detailed image — the helmet, the neck mechanism, the jacket weave hold up better than 2-bit’s reputation suggests — but the composition has clearly drifted: the head turns to face you and the framing shifts. So the real cost of aggressive quantization here is less an obvious drop in fidelity than a perturbation of the sampling trajectory — the same seed lands on a different image. Quantization nudges every weight slightly, and below a certain bit-width those tiny errors compound enough to steer the generation elsewhere. Practical read: reach for Q4 freely; treat Q2 as a memory-saver that re-rolls the picture.

PART 03

Under the Hood, in Eight Steps

We’ve been treating the model as a black box that turns a sentence into a picture. Now let’s open it. The journey from your prompt to the final image passes through exactly eight stages — and the seventh of them is the eight-pass denoising loop the model is named for. Here’s the whole path, one step at a time.

PROMPT SEED 1 Tokenize prompt → tokens 3 Noise latent seed → latent noise 2 Text encoder Qwen3-4B (frozen) 4 Patchify latent → tokens 5 Single stream DiT · 30 layers 6 Positions 3D RoPE 7 Denoise ×8 flow-matching 8 Decode VAE decode PICTURE

The whole path, top to bottom: two inputs — your prompt and the seed — are encoded in parallel, merged into a single token stream, denoised eight times, then decoded into pixels.

01
tokenization · Qwen2Tokenizer

From Words to Numbers

A neural network can’t read text; it reads numbers. The first thing that happens to your prompt is tokenization — a Qwen2Tokenizer splits your sentence into a sequence of tokens (sub-word chunks) and maps each to an integer ID. Before that, the prompt is wrapped in a short chat template (the <|im_start|>user … <|im_end|> format), because the text encoder is a chat-style language model and expects to be addressed like one. The sequence is capped at 512 tokens, the model’s working budget for how much prompt it will actually read. (A detail that looks like a bug but isn’t: the encoder is from the Qwen3 family but uses the Qwen2 tokenizer — those two generations share tokenizer tooling.)

02
text encoder · Qwen3-4B

A Language Model Reads Your Prompt

Those token IDs are fed into a frozen Qwen3-4B language model acting as the text encoder, which turns them into a rich numerical representation — embeddings that capture not just the words but their meaning and relationships. Why an entire 4-billion-parameter LLM, instead of the smaller CLIP or T5 encoders older models used? Because an instruction-tuned LLM brings genuine language understanding: it handles long, compositional prompts, follows instructions, carries real-world knowledge, and — crucially for Z-Image — is strongly bilingual (Chinese and English), which is a big part of why the model is so good at rendering legible text inside images. This single choice is why Z-Image rewards clear, descriptive prompting: there’s a capable reader on the other end of your sentence. (The pipeline reads conditioning from one of the LLM’s near-final hidden layers rather than its very last output — a detail worth confirming in your own environment.)

03
latent init · the seed

Starting From Pure Noise

While the text path is running, the image path begins — not from a blank canvas, but from a canvas of pure random noise. A tensor is filled with Gaussian noise, and that noise is the raw material the model will sculpt into a picture. This is exactly where the seed from Experiment 3 enters: the seed determines that random draw. Fix the seed and you fix the starting noise, which is why the same seed reproduces the same image down to the pixel. Change it and you hand the model a different lump of clay to start from, so it carves a different — but still prompt-faithful — result. Everything you saw in the seed experiment traces back to this one tensor.

04
latent space · Flux VAE + patchify

Compressing the Canvas

Here’s a subtlety that explains several experiments at once: the model does not work on full-resolution pixels — that would be ruinously expensive. It works in a compressed latent space defined by a Variational Autoencoder (the Flux VAE), which represents an image at 1/8th the width and height using 16 channels of information. The noise from Step 3 is sampled directly in this compressed space. (In pure text-to-image there’s no input picture, so the VAE’s encoder is skipped here — it only reappears at the very end, in Step 8, to decode.) That latent grid is then patchified — cut into a grid of small patches (patch size 2), each becoming one token the transformer can process, just like the text tokens from Step 2. The two compression factors compound: 8× from the VAE and 2× from patchification gives 16× total — the reason every height and width must be divisible by 16 (Experiment 4) and a big part of why pushing far above native ~1 MP causes tiling (Experiment 5).

05
backbone · single-stream DiT

One Stream for Words and Pixels

Now both modalities exist as tokens: text tokens from Step 2, image (noise) tokens from Step 4. Z-Image’s backbone is a Diffusion Transformer (DiT) — the architecture that replaced the older convolutional U-Net by treating image generation as a sequence-processing problem. Its distinguishing choice is being single-stream. In the dual-stream “MMDiT” design used by SD3 and FLUX, text and image tokens are processed by separate sets of weights that only interact through attention. Z-Image instead concatenates text and image tokens into one unified sequence and runs them through a single tower, so the two modalities interact densely at every layer. The argument — inspired by how decoder-only LLMs scale — is that this is far more parameter-efficient, and it’s how a 6.15B model (30 layers, hidden dimension 3840, 32 attention heads) competes with rivals four to thirteen times its size.

06
positions · 3D Unified RoPE

Teaching the Model Where Things Are

A transformer treats its input as a set of tokens — it has no inherent sense of which token is where. For an image, position is everything (top-left vs. bottom-right is the difference between a coherent picture and noise). Z-Image solves this with 3D Unified RoPE, a positional encoding scheme that tells each token its place: image tokens are positioned across two spatial axes (height and width), while text tokens are laid out along a separate “temporal” axis. With both kinds of tokens sharing one stream (Step 5), this unified positioning is what keeps words and pixels spatially coherent inside a single sequence — the model always knows what is where.

07
sampling · flow matching · 8 NFEs

Eight Passes to Clear the Noise

This is the heart of the model. The transformer’s job, on each pass, is to look at the current noisy latent (plus the text conditioning) and predict how to move it a step closer to a clean image. Z-Image uses flow matching (specifically the rectified-flow idea): instead of the meandering, curved denoising path of older diffusion models, it learns an almost straight-line route from noise to image. Straight routes can be traveled in a few big strides; curved ones need many tiny ones — which is the whole reason modern models can generate in so few steps. The FlowMatchEulerDiscreteScheduler walks that route, taking one stride per step.

Now the two halves of the “Turbo” story come together. Why eight steps: the base Z-Image needs ~100 of these passes. Turbo was created through distillation — a process (Tongyi-MAI calls it Decoupled DMD, refined with reinforcement learning) that trains the fast model to reproduce the slow model’s results in a handful of strides. The target was 8 passes, which is why num_inference_steps=9 is recommended (it works out to 8 actual transformer forward passes) and why Experiment 1 showed quality plateauing right around there. Why guidance is zero: the distillation uses classifier-free guidance as its training engine and effectively bakes the guidance effect into the weights. So at inference there’s no second, unconditional pass to extrapolate from — guidance is already “inside” the model. That’s why the native setting is guidance_scale=0, and why turning it up in Experiment 2 made things worse.

08
decode · Flux VAE decoder

From Latent Back to Pixels

Finally, the VAE decoder — the other half of the autoencoder from Step 4 — takes that clean latent and expands it back into a full-resolution RGB image, restoring the 8× spatial compression and turning 16 abstract channels into the red, green, and blue pixels you actually see. This is the moment the latent becomes a picture. The result is the image you started Part 1 looking at.

Eight steps, start to finish: your words became tokens (1) and were read by a language model (2); a noisy canvas was drawn from your seed (3) and compressed into tokens (4); words and pixels joined one stream (5) and were told where they sit (6); eight denoising passes cleared the noise (7); and the VAE decoded the result into pixels (8). That’s the whole journey — and now every knob in Part 2 has a place to hang.

Appendix — Facts, Sources, and Things to Double-Check

A few numbers in this post are worth verifying against your own run or the primary sources, partly because the model is very new and partly because some details are inferred from code rather than stated in prose.

Solid, citable facts

6.15B parameters; single-stream DiT (30 layers, hidden 3840, 32 heads); Qwen3-4B text encoder; Flux VAE; Decoupled DMD + RL distillation; 8 NFEs vs. ~100 for the base; Elo 1025 — all from the Z-Image technical report (arXiv 2511.22699) and the official GitHub repo / HuggingFace model cards (Tongyi-MAI/Z-Image-Turbo). GGUF file sizes (BF16 12.3 GB / Q8 7.22 GB / Q4 5.02 GB / Q2 3.64 GB) are verified on the unsloth/Z-Image-Turbo-GGUF file tree. “#1 open-weights model” is from the independent Artificial Analysis leaderboard; present the Elo 1025 / AI Arena figure as vendor-reported, since that arena is Alibaba’s own.

Verify in your own environment

  • That num_inference_steps=9 produces 8 scheduler steps — inspect pipe.scheduler.timesteps.
  • The VAE’s exact downsampling factor, channel count, and the patch size — print pipe.vae.config (the 8× / 16-channel / patch-2 figures come from the Flux VAE docs and Z-Image’s code, not the paper’s prose).
  • Which Qwen3 hidden layer supplies the conditioning (reported as the penultimate hidden state).
  • Your exact diffusers version (~0.39.0.dev0) — GGUF support for this model is recent; note it so readers can reproduce.

Keep the two models separate

Every “guidance 0, 8 steps” recommendation here is for Turbo. If you ever write up the base Z-Image, the guidance and step advice flips entirely (≈ guidance 3–5, 28–50 steps, negative prompts useful) — don’t let the two sets of defaults blur together.