Getting Started with Z-Image: A Comprehensive Guide to Installation and Usage
In the world of generative AI, having the right tool for the job is crucial. Sometimes you need blazing speed for real-time applications; other times, you need deep controllability for fine art or model training.
Enter Z-Image, a powerful 6-billion parameter image generation family from Tongyi-MAI. Whether you are a developer integrating API endpoints or an artist looking for a new open-source foundation, this guide will walk you through what Z-Image is and, most importantly, how to get it running on your local machine.
What is Z-Image?
Z-Image (造相) is an efficient image generation foundation model built on a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. Unlike traditional dual-stream models, Z-Image processes text and visual tokens as a unified stream, maximizing parameter efficiency.
The family consists of distinct variants tailored for specific needs:
- Z-Image-Turbo: The speed specialist. It generates high-quality images in just 8 steps (sub-second on H800 GPUs) and fits on 16GB consumer cards. It excels at photorealism and bilingual text but lacks some fine-grained controls like negative prompting.
- Z-Image (Foundation): The creative powerhouse. This is the non-distilled version that supports Full CFG (Classifier-Free Guidance) and Negative Prompting. It is slower (28-50 steps) but offers higher diversity and is the ideal base for training LoRAs or ControlNets.
Installation Guide
To start generating images, you need to set up your Python environment. Z-Image is fully integrated with the Hugging Face ecosystem, making installation straightforward.
Prerequisites
Ensure you have a Python environment ready (we recommend Python 3.10+) and a GPU with CUDA support.
Step 1: Install Dependencies
You will need the latest version of the diffusers library to support the S3-DiT architecture. We also recommend installing huggingface_hub for efficient model downloading.
Open your terminal and run:
# Install the latest diffusers from source to ensure Z-Image support
pip install git+https://github.com/huggingface/diffusers
# Install other necessary libraries
pip install -U huggingface_hub torch transformers accelerate
Step 2: Download the Model
You can download the model weights directly. For faster download speeds, especially for large files, you can enable the high-performance transfer mode:
HF_XET_HIGH_PERFORMANCE=1 hf download Tongyi-MAI/Z-Image
Alternatively, the diffusers pipeline will automatically handle the download when you run the code for the first time.
How to Use Z-Image
Below are usage examples for both the Turbo (speed) and Foundation (control) variants.
Scenario A: High-Speed Generation with Z-Image-Turbo
Use this workflow if you need fast results, photorealism, or accurate text rendering. Note that guidance_scale should be set to 0.0 as this is a distilled model.
Python Code:
import torch
from diffusers import ZImagePipeline
# 1. Load the Turbo Pipeline
# We use bfloat16 for the best balance of speed and precision on modern GPUs
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False,
)
pipe.to("cuda")
# 2. Define your prompt
# Z-Image-Turbo handles complex bilingual descriptions well
prompt = "Young Chinese woman in red Hanfu, intricate embroidery... [Add your details here]"
# 3. Generate
image = pipe(
prompt=prompt,
height=1024,
width=1024,
num_inference_steps=9, # Effectively 8 DiT forward passes
guidance_scale=0.0, # CRITICAL: Keep at 0 for Turbo
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_output.png")
Optimization Tip: You can enable Flash Attention for even faster inference by adding pipe.transformer.set_attention_backend("flash") after loading the pipeline.
Scenario B: Creative Control with Z-Image (Foundation)
Use this workflow if you need to use Negative Prompts, adjust the CFG Scale for style intensity, or require high variability between seeds.
Recommended Parameters:
- Resolution: 512×512 up to 2048×2048
- Inference Steps: 28 – 50
- Guidance Scale: 3.0 – 5.0
Python Code:
import torch
from diffusers import ZImagePipeline
# 1. Load the Foundation Pipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False,
)
pipe.to("cuda")
# 2. Define Prompts
# Notice we can now use a Negative Prompt to filter out unwanted elements
prompt = "Two young Asian women standing close together, neutral grey texture background..."
negative_prompt = "blurry, low quality, distorted, watermark, text, artifacts"
# 3. Generate with Control
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt, # Powerful control feature
height=1280,
width=720,
cfg_normalization=False, # False for style, True for realism
num_inference_steps=50, # Higher steps for maximum quality
guidance_scale=4.0, # Adjusts how strictly the model follows the prompt
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_foundation_output.png")
Community Integrations
If you prefer not to write Python code, the community has already integrated Z-Image into several popular tools:
- ComfyUI: Use the ZImageLatent nodes for a visual node-based workflow.
- stable-diffusion.cpp: Run Z-Image on devices with as little as 4GB VRAM using this C++ implementation.
- DiffSynth-Studio: A great choice if you are interested in training your own LoRA adapters on top of Z-Image.
Conclusion
Z-Image offers a flexible entry point into high-fidelity image generation. Whether you choose the Turbo model for its sub-second speed or the Foundation model for its deep artistic control, the S3-DiT architecture ensures top-tier performance.
For more details, technical reports, and community showcases, be sure to visit the Official GitHub Repository and the Hugging Face Model Card.
