Text-to-Image / A Diffusion Field Guide
Understanding Z-Image-Turbo with Experiments
How a 6-billion-parameter open model paints a photo in eight passes — and what every generation knob actually does to the result.
You want to generate a professional-looking photo like this — for free, an unlimited number of times, no rate limits, no “you have 3 credits left,” nothing?
Z-Image-Turbo is your friend.
Generated locally on a single 24 GB GPU · Z-Image-Turbo · 9 steps · guidance 0 · seed 42 · Q8 GGUF
There was no studio behind that image. No photographer, no softbox, no stock license, no monthly subscription. It came out of a single rented GPU in roughly 10 seconds — and you could make ten thousand more for the price of the electricity.
So what is this thing, and where does it sit?
Turning a sentence into an image — “text-to-image” generation — is one of the most competitive areas in AI right now. Dozens of capable models have appeared in the last two years, and they all do the same job: you type a description, they produce a matching picture. What really separates them is who controls them and what it takes to run them. Some live inside paid apps and APIs run by large labs. Others ship as open files you can download, inspect, and run on your own machine.
Z-Image is one of the open ones — a 6.15-billion-parameter model from Alibaba’s Tongyi-MAI lab, released in November 2025 under a permissive Apache 2.0 license. The version we’ll use throughout, Z-Image-Turbo, is its speed-optimized variant. To see why it’s worth a close look, let’s place it next to the field:
| Model | Maker | Availability | Parameters | Run it yourself? |
|---|---|---|---|---|
| Z-Image-Turbo | Alibaba · Tongyi-MAI | Open — Apache 2.0 | 6.15B | Yes — under 16 GB VRAM |
| Stable Diffusion 3.5 Large | Stability AI | Open — community license | ~8B | Yes — consumer GPU |
| FLUX.1 [dev] | Black Forest Labs | Open — non-commercial | 12B | Hard — high-end GPU |
| Qwen-Image | Alibaba · Tongyi | Open — Apache 2.0 | ~20B | Very hard |
| Hunyuan-Image 3.0 | Tencent | Open | ~80B (MoE) | No — datacenter-class |
| Seedream 4.0 | ByteDance | Closed — API / app | Undisclosed | No — hosted only |
| Imagen 4 | Closed — API / app | Undisclosed | No — hosted only | |
| Gemini 2.5 Flash Image | Closed — API / app | Undisclosed | No — hosted only | |
| Midjourney v7 | Midjourney | Closed — app | Undisclosed | No — hosted only |
Snapshot as of late 2025. Closed-model sizes are not disclosed by their makers; the open-model figures are approximate and worth re-checking against each model’s card.
A few things jump out. Most of the best-known names are closed — usable only through someone else’s service, with no public size and no way to run them yourself. Among the open models, the trend has been toward enormous parameter counts — 20B to 80B — meaning expensive hardware and slow generation. Z-Image-Turbo is the unusual corner of that table: open and permissively licensed, small enough for a single consumer GPU, fast, and still competitive on quality. At release it scored an Elo of 1025 on Alibaba’s own AI Arena — 4th overall, 1st among open models — and was independently rated the #1 open-weights model on the Artificial Analysis leaderboard. Small, fast, free, and near the top is a rare combination — and it’s exactly what lets you run it without limits.
One more thing before we start, because it shapes every experiment that follows. Z-Image comes in two versions: the original base model and the Turbo model we’re using. Base builds an image over roughly 100 refinement passes; Turbo is a distilled, compressed-for-speed copy that reaches comparable quality in just 8. That speedup ships with a few unusual default settings, and getting them right matters — so we won’t take them on faith. Part 2 introduces each from scratch and shows, with pictures, exactly what it does.
Ten more generations from the same model — photoreal mecha, anime, studio portrait, poster typography, cinematic sci-fi — each from a short prompt, on a single GPU.
Here’s the plan. Part 2 hands you the code, then walks knob by knob — steps, guidance, seed, aspect ratio, resolution, model precision — showing what each does to the image. Part 3 opens the hood and traces, in exactly eight steps, the path from prompt to picture.
Two pieces of software do the work. First, diffusers — Hugging
Face’s open-source library that wraps a diffusion model into a ready-to-run pipeline: it downloads the
weights, wires up the components (text encoder, transformer, VAE, scheduler), and runs the denoising loop
for you, so an image is a few lines of code rather than a re-implementation of the math. Second, the model
itself, which we load in quantized form.
Quantized simply means the model’s weights are stored at lower numerical precision — fewer bits per number — to make the file smaller and lighter on memory. Why start here instead of full precision? Z-Image-Turbo at full bfloat16 is about 12.3 GB, while the 8-bit (Q8) version is ~7 GB with no quality you can see — which leaves comfortable headroom on a 24 GB GPU for the text encoder, the VAE, and the working memory each image needs. It’s the most practical way to just get going; later, in Experiment 6, we push the quantization much harder to find where quality finally breaks. Here’s the one-time install:
# pip install -U git+https://github.com/huggingface/diffusers
# pip install -U gguf accelerate transformers spandrel
And here’s the load. We pull the transformer down as a single GGUF file — the single-file format these quantized weights ship in — and assemble the full pipeline around it:
import torch
from huggingface_hub import hf_hub_download
from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig
DTYPE = torch.bfloat16
# 1) Load the quantized transformer (here: Q8) as a single file
gguf = hf_hub_download("unsloth/Z-Image-Turbo-GGUF", "z-image-turbo-Q8_0.gguf")
transformer = ZImageTransformer2DModel.from_single_file(
gguf,
quantization_config=GGUFQuantizationConfig(compute_dtype=DTYPE),
torch_dtype=DTYPE,
)
# 2) Assemble the full pipeline around that transformer
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo", transformer=transformer, torch_dtype=DTYPE
)
pipe.to("cuda")
When the pipeline prints itself, you can see the five parts we’ll meet again in Part 3 — a
FlowMatchEulerDiscreteScheduler, a Qwen3Model text encoder with a
Qwen2Tokenizer, the ZImageTransformer2DModel, and an AutoencoderKL
VAE. Generating an image is one call:
image = pipe(
prompt=prompt,
height=1280, width=720,
num_inference_steps=9, # 8 actual DiT forward passes
guidance_scale=0.0, # Turbo bakes guidance in — keep this at 0
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
Every experiment below uses the same fixed prompt — a black-helmeted robot, written with explicit subject, clothing, and lighting sections — so that anything that changes in the output is caused by the parameter under test and nothing else. This is the discipline that makes the comparisons fair: hold everything constant, perturb one axis, attribute the change.
prompt = """
A sleek black helmeted robot shown in a three-quarter profile against a solid red background
Subject: A highly detailed mechanical head featuring exposed wiring and internal gears beneath the glossy shell of the faceplate, with intricate metallic textures visible around the neck joint.
Clothing: Wearing a high-collared dark tactical jacket made of matte fabric with reinforced shoulder pads that have subtle mesh detailing and zippered accents along the seams; an illuminated red logo on the left shoulder reads 'Libel' in stylized font.
Lighting: Dramatic spotlight highlighting the curves of the helmet while casting deep shadows within its mechanical core, creating high contrast between reflective surfaces and dark recesses.
"""
What follows are six experiments. For each, I explain what the lever is and why it matters first, then show the sweep and what to look for.
The idea. A diffusion model doesn’t paint an image in one shot; it starts from random noise and removes a little of it on each step, refining its guess of the final picture as it goes. Too few steps and the model hasn’t finished cleaning up the noise — the image is blurry, under-formed, or muddy. More steps reduce that error, but with sharply diminishing returns: past a point, extra steps barely change anything and just cost you time. The entire purpose of a “Turbo” model is to push the “good enough” line down to a tiny number of steps.
The sweep. num_inference_steps ∈ {1, 3, 6, 9, 12, 15}, guidance fixed at
0.0, seed fixed.
P, H, W, SEED = 183, 1280, 720, 42
for S in [1, 3, 6, 9, 12, 15]:
image = pipe(prompt=prompt, height=H, width=W,
num_inference_steps=S, guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
image.save(f"EXPERIMENTS/P{P}_STEP{S}_SEED{SEED}_{W}x{H}.png")
Same prompt, same seed — only the step count changes. Hover any frame to enlarge it.
Slide to compare · head detail
Drag the divider across the head. Pick a different step on each side to compare any two — the two sides can’t show the same one.
What to look for. The jump from 1 to 3 steps is the dramatic one. At 1 step the image is a soft, smeared blob — the helmet is a vague silhouette and the “Libel” patch is an unreadable red smudge. By 3 steps the model has already resolved almost everything: the glossy shell, the exposed lens-and-gear mechanism at the jaw, the high-collared tactical jacket, and the legible “Libel” logo are all there. From 6 through 15 the differences are marginal — small shifts in how the internal mechanical detail and specular highlights render, closer to re-rolls of the same image than to genuine improvements. In other words, Turbo is essentially converged by about 3 steps and comfortably so at its 9-step (8-NFE) default; pushing to 12 or 15 buys nothing you can see while costing proportionally more time. This is the inverse of the old “always use 30–50 steps” habit. (Why so few steps suffice is the subject of Part 3, Step 7.)
The idea. Classifier-Free Guidance (CFG) is the standard way to make a diffusion model follow your prompt more strongly. Normally the model produces two predictions on each step — one that listens to your prompt and one that ignores it — and the guidance scale controls how far it extrapolates away from the ignore-the-prompt version and toward the prompt. Crank it up and the image obeys the prompt harder, but loses variety and, at high values, starts to look oversaturated and over-cooked. It also normally doubles the cost of every step, since it needs both predictions.
The Turbo twist. Here’s where distillation changes the meaning of the knob. Turbo has CFG
baked into its weights — it was built assuming guidance, so at run time it does a
single prompt-following pass and you leave guidance_scale=0. Turning the dial up doesn’t
sharpen prompt adherence the way it would on a base model; it reintroduces an operation the model was never
meant to run at inference, and quality degrades.
The sweep. guidance_scale ∈ {0.0, 0.25, 0.5, 0.75, 1.0} at 9 steps.
P, H, W, S, SEED = 183, 1280, 720, 9, 42
for G in [0.0, 0.25, 0.5, 0.75, 1.0]:
image = pipe(prompt=prompt, height=H, width=W,
num_inference_steps=S, guidance_scale=G,
generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
image.save(f"EXPERIMENTS/P{P}_STEP{S}_SEED{SEED}_{W}x{H}_G{G}.png")
Same prompt, same seed, 9 steps — only the guidance scale changes. Hover any frame to enlarge it.
Slide to compare · head detail
Drag across the head to watch contrast and saturation shift. Pick a different value on each side — they can’t match.
What to look for. On a base model, raising guidance from 1 toward 7 visibly tightens prompt adherence. Here it does something different — and milder, since 0–1 is a gentle range. The cleanest, most neutral exposure is at 0; as guidance climbs the image steadily gains contrast and saturation — the blacks deepen, the red rim-light on the neck mechanism intensifies, and by 0.75–1.0 the shadows in the jacket start to crush and the reds go a touch over-cooked. None of this is better prompt-following; it’s the classic over-saturation drift of guidance, just early. The takeaway holds: on Turbo, guidance is a contrast knob, not a quality knob, so the right setting is the one the model was distilled for — 0. This is a concrete demonstration of why distillation rewrites a parameter you thought you understood.
The idea. Generation starts from a draw of random noise, and the seed is what makes that draw repeatable. Same prompt + same seed + same settings ⇒ the exact same image, every time — which is what lets you change one other thing and trust the comparison. Different seeds give different starting noise, so the model walks a different path to a different but still prompt-consistent image: the pose, framing, and exact details shift while the prompt’s content stays satisfied. (One caveat for a heavily distilled few-step model: it commits to a composition very fast, so outputs across seeds can feel a little more alike than in a 50-step model.)
The sweep. Eight seeds, run in landscape (1280×720) and again in portrait (720×1280).
SEEDS = [0, 7, 42, 123, 777, 2024, 88888, 1234567]
W, H, S, G = 1280, 720, 9, 0.0
for SEED in SEEDS:
image = pipe(prompt=prompt, height=H, width=W,
num_inference_steps=S, guidance_scale=G,
generator=torch.Generator("cuda").manual_seed(SEED)).images[0]
image.save(f"EXPERIMENTS/SEED{SEED}_{W}x{H}.png")
Portrait · 720×1280
Landscape · 1280×720
The same prompt and settings across eight seeds, in both shapes. Hover any frame to enlarge it.
What to look for. Every seed obeys the prompt — black helmeted robot, exposed jaw mechanism, matte tactical jacket, red backdrop, the “Libel” patch — yet no two are the same picture. The seed reshuffles everything the prompt leaves open: which way the head faces, whether the shell reads glossy or matte, how much of the internal gearwork is exposed, the cut of the jacket and where the “Libel” patches land, even the mood of the lighting. That spread is the model’s diversity, and it’s healthy here — despite committing to a composition in only a handful of steps, the eight seeds come out clearly distinct rather than as minor variations. Compare the two rows and you’ll also see that a seed is not portable across shapes: portrait seed 42 and landscape seed 42 are different pictures, not one image cropped two ways — the wider canvas pulls in more of the shoulders and torso and recomposes the subject around it.
The idea. A model learns from a distribution of image shapes, and it’s most reliable near shapes it saw often. Aspect ratio is really a lever on composition: a tall canvas nudges the model toward a single vertical subject, a wide one toward environmental or panoramic layouts. Two practical constraints bound your choices — your total pixel budget (compute and VRAM), and the rule that both dimensions must be divisible by 16 (a consequence of how the latent is compressed and tokenized; see Part 3, Step 4).
The sweep. Hold total pixels near ~1 megapixel while varying the ratio across 11 shapes, each dimension rounded to a multiple of 16:
All eleven ratios at their true proportions, each holding a ~1 MP budget. Hover any to enlarge slightly.
What to look for. Every shape yields a coherent, well-composed image — the model recomposes the subject to fit rather than stretching or squishing it. Portrait ratios (2:3, 9:16) crop in close on the head and collar; landscape ratios (3:2, 16:9) pull back to reveal more of the jacket and shoulders; the square sits in between. The extremes are the real test: at 21:9 the head sits in a wide field with generous negative space, and at 9:21 the figure stacks tall and narrow — both stay coherent, with no duplicated heads or tiling. (Because each ratio starts from a different noise tensor, the pose and detailing shift from one to the next, exactly like the seed sweep — these are different images, not one image letterboxed.)
The idea. Beyond shape, the absolute number of pixels matters, because every model has a native training resolution — for Z-Image that’s the 1024-class, around 1 megapixel. Generate well below native and there simply aren’t enough latent tokens to carry the structure, so you lose detail and coherence. Generate well above native and you hit the classic diffusion failure: the model duplicates content — two heads, repeated limbs, tiled textures — because the patterns it learned at training size effectively “tile” when asked to fill far more tokens than it ever saw. (This is why the standard professional workflow is generate near native, then upscale — and exactly the kind of upscaling pipeline worth a follow-up post.)
The sweep. Hold 16:9 fixed and scale total resolution from ~0.33 MP up to ~2.36 MP, in both orientations:
BASE_W, BASE_H = 256, 144 # smallest 16:9 box divisible by 16
for n in [3, 4, 5, 6, 7, 8]: # n=5 is the 1280×720 reference
W, H = BASE_W * n, BASE_H * n
image = pipe(prompt=prompt, height=H, width=W,
num_inference_steps=9, guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(42)).images[0]
image.save(f"EXPERIMENTS/16x9_{W}x{H}.png")
The 16:9 sweep, each frame drawn at a size proportional to its true pixel count — from 0.33 MP up to 2.36 MP.
What to look for. Two things. First, detail density climbs with resolution: at 0.33 MP the forms are all correct but soft, and as the pixel budget grows the fine mechanical work — gear teeth, the lens rings, the jacket’s woven texture, the “Libel” stitching — resolves progressively crisper, sharpest in the 2.36 MP frame at the bottom. Second, and more surprising given the theory above: the predicted high-resolution failure doesn’t appear here. Across the whole sweep, including the top end at roughly 2.4× native, the model keeps a single coherent figure — no second head, no tiled texture, no repeated limbs. So while duplication is the real risk when you push far past native, Z-Image’s comfortable range at 16:9 evidently extends well beyond 1 MP — at least to the ~2.36 MP tested; you’d likely have to go considerably higher to trigger the tiling. And as with the earlier sweeps, each resolution is its own noise tensor, so pose and framing drift from rung to rung — these are different images generated at different scales, not one image upscaled.
The idea. A model’s weights are normally stored at 16-bit precision (here, bfloat16).
Quantization stores them at lower precision to shrink the file and the memory footprint.
GGUF is a single-file model format that came out of the llama.cpp ecosystem
for language models and has been adopted for diffusion transformers; its “Q” levels are roughly
bits-per-weight:
For the actual Z-Image-Turbo files, full bf16 is 12.3 GB, while Q8 = 7.22 GB, Q4 = 5.02 GB, and Q2 = 3.64 GB — which is precisely how a 6B-parameter transformer fits comfortably on a 24 GB card with room to spare. (Modern “K-quants” are smarter than uniform rounding: they keep sensitive layers at higher precision and squeeze tolerant ones harder, so even “Q2” here isn’t a flat 2-bit model.)
The sweep. Re-run the entire experiment suite at Q8, Q4, and Q2 by
changing only the file passed to from_single_file:
# z-image-turbo-Q8_0.gguf → z-image-turbo-Q4_K_M.gguf → z-image-turbo-Q2_K.gguf
Slide to compare · full frame
Same prompt, seed, and settings — only the quantization changes. Drag the divider; switch either side between Q8, Q4, and Q2 (the two can’t match).
What to look for. Slide between Q8 and Q4 first: they’re all but indistinguishable — same composition, same detail — yet Q4 is a third smaller on disk (5.0 vs 7.2 GB). That’s the headline: 4-bit quantization is close to a free lunch on this model, which is why it’s the popular default. Q2 is the interesting one. It still produces a coherent, detailed image — the helmet, the neck mechanism, the jacket weave hold up better than 2-bit’s reputation suggests — but the composition has clearly drifted: the head turns to face you and the framing shifts. So the real cost of aggressive quantization here is less an obvious drop in fidelity than a perturbation of the sampling trajectory — the same seed lands on a different image. Quantization nudges every weight slightly, and below a certain bit-width those tiny errors compound enough to steer the generation elsewhere. Practical read: reach for Q4 freely; treat Q2 as a memory-saver that re-rolls the picture.
We’ve been treating the model as a black box that turns a sentence into a picture. Now let’s open it. The journey from your prompt to the final image passes through exactly eight stages — and the seventh of them is the eight-pass denoising loop the model is named for. Here’s the whole path, one step at a time.
The whole path, top to bottom: two inputs — your prompt and the seed — are encoded in parallel, merged into a single token stream, denoised eight times, then decoded into pixels.
A neural network can’t read text; it reads numbers. The first thing that happens to your prompt is
tokenization — a Qwen2Tokenizer splits your sentence into a sequence of
tokens (sub-word chunks) and maps each to an integer ID. Before that, the prompt is wrapped in
a short chat template (the <|im_start|>user … <|im_end|> format), because the
text encoder is a chat-style language model and expects to be addressed like one. The sequence is capped
at 512 tokens, the model’s working budget for how much prompt it will actually read.
(A detail that looks like a bug but isn’t: the encoder is from the Qwen3 family but uses the Qwen2
tokenizer — those two generations share tokenizer tooling.)
Those token IDs are fed into a frozen Qwen3-4B language model acting as the text encoder, which turns them into a rich numerical representation — embeddings that capture not just the words but their meaning and relationships. Why an entire 4-billion-parameter LLM, instead of the smaller CLIP or T5 encoders older models used? Because an instruction-tuned LLM brings genuine language understanding: it handles long, compositional prompts, follows instructions, carries real-world knowledge, and — crucially for Z-Image — is strongly bilingual (Chinese and English), which is a big part of why the model is so good at rendering legible text inside images. This single choice is why Z-Image rewards clear, descriptive prompting: there’s a capable reader on the other end of your sentence. (The pipeline reads conditioning from one of the LLM’s near-final hidden layers rather than its very last output — a detail worth confirming in your own environment.)
While the text path is running, the image path begins — not from a blank canvas, but from a canvas of pure random noise. A tensor is filled with Gaussian noise, and that noise is the raw material the model will sculpt into a picture. This is exactly where the seed from Experiment 3 enters: the seed determines that random draw. Fix the seed and you fix the starting noise, which is why the same seed reproduces the same image down to the pixel. Change it and you hand the model a different lump of clay to start from, so it carves a different — but still prompt-faithful — result. Everything you saw in the seed experiment traces back to this one tensor.
Here’s a subtlety that explains several experiments at once: the model does not work on full-resolution pixels — that would be ruinously expensive. It works in a compressed latent space defined by a Variational Autoencoder (the Flux VAE), which represents an image at 1/8th the width and height using 16 channels of information. The noise from Step 3 is sampled directly in this compressed space. (In pure text-to-image there’s no input picture, so the VAE’s encoder is skipped here — it only reappears at the very end, in Step 8, to decode.) That latent grid is then patchified — cut into a grid of small patches (patch size 2), each becoming one token the transformer can process, just like the text tokens from Step 2. The two compression factors compound: 8× from the VAE and 2× from patchification gives 16× total — the reason every height and width must be divisible by 16 (Experiment 4) and a big part of why pushing far above native ~1 MP causes tiling (Experiment 5).
Now both modalities exist as tokens: text tokens from Step 2, image (noise) tokens from Step 4. Z-Image’s backbone is a Diffusion Transformer (DiT) — the architecture that replaced the older convolutional U-Net by treating image generation as a sequence-processing problem. Its distinguishing choice is being single-stream. In the dual-stream “MMDiT” design used by SD3 and FLUX, text and image tokens are processed by separate sets of weights that only interact through attention. Z-Image instead concatenates text and image tokens into one unified sequence and runs them through a single tower, so the two modalities interact densely at every layer. The argument — inspired by how decoder-only LLMs scale — is that this is far more parameter-efficient, and it’s how a 6.15B model (30 layers, hidden dimension 3840, 32 attention heads) competes with rivals four to thirteen times its size.
A transformer treats its input as a set of tokens — it has no inherent sense of which token is where. For an image, position is everything (top-left vs. bottom-right is the difference between a coherent picture and noise). Z-Image solves this with 3D Unified RoPE, a positional encoding scheme that tells each token its place: image tokens are positioned across two spatial axes (height and width), while text tokens are laid out along a separate “temporal” axis. With both kinds of tokens sharing one stream (Step 5), this unified positioning is what keeps words and pixels spatially coherent inside a single sequence — the model always knows what is where.
This is the heart of the model. The transformer’s job, on each pass, is to look at the current noisy
latent (plus the text conditioning) and predict how to move it a step closer to a clean image. Z-Image
uses flow matching (specifically the rectified-flow idea): instead of the meandering,
curved denoising path of older diffusion models, it learns an almost straight-line
route from noise to image. Straight routes can be traveled in a few big strides; curved ones
need many tiny ones — which is the whole reason modern models can generate in so few steps. The
FlowMatchEulerDiscreteScheduler walks that route, taking one stride per step.
Now the two halves of the “Turbo” story come together. Why eight steps: the base
Z-Image needs ~100 of these passes. Turbo was created through distillation — a process
(Tongyi-MAI calls it Decoupled DMD, refined with reinforcement learning) that trains the fast
model to reproduce the slow model’s results in a handful of strides. The target was 8
passes, which is why num_inference_steps=9 is recommended (it works out to 8
actual transformer forward passes) and why Experiment 1 showed quality plateauing right around there.
Why guidance is zero: the distillation uses classifier-free guidance as its
training engine and effectively bakes the guidance effect into the weights. So at inference
there’s no second, unconditional pass to extrapolate from — guidance is already “inside” the model.
That’s why the native setting is guidance_scale=0, and why turning it up in Experiment 2
made things worse.
Finally, the VAE decoder — the other half of the autoencoder from Step 4 — takes that clean latent and expands it back into a full-resolution RGB image, restoring the 8× spatial compression and turning 16 abstract channels into the red, green, and blue pixels you actually see. This is the moment the latent becomes a picture. The result is the image you started Part 1 looking at.
A few numbers in this post are worth verifying against your own run or the primary sources, partly because the model is very new and partly because some details are inferred from code rather than stated in prose.
6.15B parameters; single-stream DiT (30 layers, hidden 3840, 32 heads); Qwen3-4B text
encoder; Flux VAE; Decoupled DMD + RL distillation; 8 NFEs vs. ~100 for the base; Elo 1025 — all from the
Z-Image technical report (arXiv 2511.22699) and the official GitHub repo /
HuggingFace model cards (Tongyi-MAI/Z-Image-Turbo). GGUF file sizes (BF16 12.3 GB /
Q8 7.22 GB / Q4 5.02 GB / Q2 3.64 GB) are verified on the unsloth/Z-Image-Turbo-GGUF file tree.
“#1 open-weights model” is from the independent Artificial Analysis leaderboard; present
the Elo 1025 / AI Arena figure as vendor-reported, since that arena is Alibaba’s
own.
num_inference_steps=9 produces 8 scheduler steps — inspect
pipe.scheduler.timesteps.
pipe.vae.config (the 8× / 16-channel / patch-2 figures come from the Flux VAE docs and
Z-Image’s code, not the paper’s prose).
diffusers version (~0.39.0.dev0) — GGUF support for this model is recent; note
it so readers can reproduce.Every “guidance 0, 8 steps” recommendation here is for Turbo. If you ever write up the base Z-Image, the guidance and step advice flips entirely (≈ guidance 3–5, 28–50 steps, negative prompts useful) — don’t let the two sets of defaults blur together.