Getting Started with Z-Image: A Comprehensive Guide to Installation and Usage

In the world of generative AI, having the right tool for the job is crucial. Sometimes you need blazing speed for real-time applications; other times, you need deep controllability for fine art or model training.

Enter Z-Image, a powerful 6-billion parameter image generation family from Tongyi-MAI. Whether you are a developer integrating API endpoints or an artist looking for a new open-source foundation, this guide will walk you through what Z-Image is and, most importantly, how to get it running on your local machine.

What is Z-Image?

Z-Image (造相) is an efficient image generation foundation model built on a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. Unlike traditional dual-stream models, Z-Image processes text and visual tokens as a unified stream, maximizing parameter efficiency.

The family consists of distinct variants tailored for specific needs:

  • Z-Image-Turbo: The speed specialist. It generates high-quality images in just 8 steps (sub-second on H800 GPUs) and fits on 16GB consumer cards. It excels at photorealism and bilingual text but lacks some fine-grained controls like negative prompting.
  • Z-Image (Foundation): The creative powerhouse. This is the non-distilled version that supports Full CFG (Classifier-Free Guidance) and Negative Prompting. It is slower (28-50 steps) but offers higher diversity and is the ideal base for training LoRAs or ControlNets.

Installation Guide

To start generating images, you need to set up your Python environment. Z-Image is fully integrated with the Hugging Face ecosystem, making installation straightforward.

Prerequisites

Ensure you have a Python environment ready (we recommend Python 3.10+) and a GPU with CUDA support.

Step 1: Install Dependencies

You will need the latest version of the diffusers library to support the S3-DiT architecture. We also recommend installing huggingface_hub for efficient model downloading.

Open your terminal and run:

# Install the latest diffusers from source to ensure Z-Image support
pip install git+https://github.com/huggingface/diffusers

# Install other necessary libraries
pip install -U huggingface_hub torch transformers accelerate

Step 2: Download the Model

You can download the model weights directly. For faster download speeds, especially for large files, you can enable the high-performance transfer mode:

HF_XET_HIGH_PERFORMANCE=1 hf download Tongyi-MAI/Z-Image

Alternatively, the diffusers pipeline will automatically handle the download when you run the code for the first time.


How to Use Z-Image

Below are usage examples for both the Turbo (speed) and Foundation (control) variants.

Scenario A: High-Speed Generation with Z-Image-Turbo

Use this workflow if you need fast results, photorealism, or accurate text rendering. Note that guidance_scale should be set to 0.0 as this is a distilled model.

Python Code:

import torch
from diffusers import ZImagePipeline

# 1. Load the Turbo Pipeline
# We use bfloat16 for the best balance of speed and precision on modern GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# 2. Define your prompt
# Z-Image-Turbo handles complex bilingual descriptions well
prompt = "Young Chinese woman in red Hanfu, intricate embroidery... [Add your details here]"

# 3. Generate
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # Effectively 8 DiT forward passes
    guidance_scale=0.0,     # CRITICAL: Keep at 0 for Turbo
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("z_image_turbo_output.png")

Optimization Tip: You can enable Flash Attention for even faster inference by adding pipe.transformer.set_attention_backend("flash") after loading the pipeline.

Scenario B: Creative Control with Z-Image (Foundation)

Use this workflow if you need to use Negative Prompts, adjust the CFG Scale for style intensity, or require high variability between seeds.

Recommended Parameters:

  • Resolution: 512×512 up to 2048×2048
  • Inference Steps: 28 – 50
  • Guidance Scale: 3.0 – 5.0

Python Code:

import torch
from diffusers import ZImagePipeline

# 1. Load the Foundation Pipeline
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# 2. Define Prompts
# Notice we can now use a Negative Prompt to filter out unwanted elements
prompt = "Two young Asian women standing close together, neutral grey texture background..."
negative_prompt = "blurry, low quality, distorted, watermark, text, artifacts"

# 3. Generate with Control
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt, # Powerful control feature
    height=1280,
    width=720,
    cfg_normalization=False, # False for style, True for realism
    num_inference_steps=50,  # Higher steps for maximum quality
    guidance_scale=4.0,      # Adjusts how strictly the model follows the prompt
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("z_image_foundation_output.png")

Community Integrations

If you prefer not to write Python code, the community has already integrated Z-Image into several popular tools:

  • ComfyUI: Use the ZImageLatent nodes for a visual node-based workflow.
  • stable-diffusion.cpp: Run Z-Image on devices with as little as 4GB VRAM using this C++ implementation.
  • DiffSynth-Studio: A great choice if you are interested in training your own LoRA adapters on top of Z-Image.

Conclusion

Z-Image offers a flexible entry point into high-fidelity image generation. Whether you choose the Turbo model for its sub-second speed or the Foundation model for its deep artistic control, the S3-DiT architecture ensures top-tier performance.

For more details, technical reports, and community showcases, be sure to visit the Official GitHub Repository and the Hugging Face Model Card.