Z.AI Logo

GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image Generation

Image

Figure1:GLM-ImageGeneralShowcase
Image

Figure2:GLM-ImageDense-KnowledgeShowcase
TodayweareexcitedtointroduceGLM-Image,thefirstopen-source,industrial-gradediscreteauto-regressiveimagegenerationmodel.GLM-Imageadoptsahybridarchitecturecombininganauto-regressivemodulewithadiffusiondecoder.Theauto-regressivepartispartiallybasedon,andinitializedfrom,[GLM-4-9B-0414][1]with9billionparameters,whilethediffusiondecoderfollows[CogView4][2]toadoptasingle-streamDiTstructurewith7billionparameters.Ingeneralimagegenerationquality,GLM-Imagealignswithmainstreamlatentdiffusionapproaches,butitshowssignificantadvantagesintext-renderingandknowledge-intensivegenerationscenarios.Itperformsespeciallywellintasksrequiringprecisesemanticunderstandingandcomplexinformationexpression,whilemaintainingstrongcapabilitiesinhigh-fidelityandfine-graineddetailgeneration.Inadditiontotext-to-imagegeneration,GLM-Imagealsosupportsarichsetofimage-to-imagetasksincludingimageediting,styletransfer,identity-preservinggeneration,andmulti-subjectconsistency.
Background:Inrecentyears,diffusionmodelshavebecomethemainstreaminimagegenerationfortheirtrainingstabilityandstronggeneralizationcapabilities.YetevenwithsubstantialimprovementsindiffusionmodelingandVAEformulation[3][4][5],etc.,end-to-enddiffusionmodelsstillhaveshortcomingsincomplexinstructionfollowingandknowledge-intensivescenarios,oftenfallingshortinbothinformationexpressionandsemanticalignment.Atthesametime,somenewlyreleasedhigh-qualityimagegenerationmodelshavedemonstratedoutstandingperformanceinsuchknowledge-densecases,producingvisuallyrichdetailwhileexhibitingauto-regressivemodelingcharacteristics.Drawinginspirationfromthesedevelopments,GLM-Imagewasdesignedfromthebeginningwithtwodecoupledobjectives:robustunderstandingofcomplexinformationandtheabilitytoproducehigh-qualityimagedetails.Inourapproach,theauto-regressivegeneratorproducestokenswithlow-frequencysemanticsignals,whilethediffusiondecoderrefineshigh-frequencydetailstodeliverthefinalimage.Thishybridarchitecturenotonlyperformsreliablyingeneralimagegenerationtasks,butalsopresentsnoticeableadvantagesincreativeworkthatdemandsintricateknowledgerepresentation,pushingimagegenerationtowardanewstagethatcombinesartisticaestheticswithprecisioninconveyinginformation.

Techniques

Image

Figure3:GeneralPipeline

VisualTokenSelection

Inpreviousvisualauto-regressivegenerationmodels,thetokentypesusedhavetypicallyfallenintothreecategories:
Visualcodesobtainedviadiscretereconstructiontraining(VQVAE[6])
Visualcodesobtainedviadiscretesemantictraining(semantic-VQ[7])
Statisticalsemanticfeaturesextractedfrom1Dvectors(asinDALLE2[8])
Theseapproachesrankfromhightolowintheorderabovefromaninformationcompletenessstandpoint,whereastheirsemanticrelevancetendstoincreaseinthereverseorder.Forvisualgenerationmodels,thecorrelationbetweentokens(orpatches)isacrucialfactorinfluencingbothmodelconvergenceandthefinaloutputquality.Inlatentdiffusionmodels,worksasVAVAE[5]andSSVAE[9]havedemonstratedthesignificance.Whileforauto-regressivegeneration,traininglosscomparisonshowsacleardifferentmagnitude(~7vs.~3)fortokensderivedfromVQVAEandsemantic-VQwithsimilarcodebooksize,suggestingthatmodelingwithsemantictokensofferssuperiorconvergencepropertiesforvisualgenerationtraining.Ontheotherhand,1Dvectorssufferfrominsufficientinformationcompletenessandcorrespondencetowardsaspecificimage,andaremorecommonlyusedinsubsequentworksfortasksassubjectconsistencygeneration(e.g.,FLUX.1Redux[10]).
Buildingontheseconclusionsandobservations,GLM-Imageadoptssemantic-VQasitsprimarytokenizationstrategy.Tobespecific,weimplementedthetokenizerschemefromXOmniforbettersemanticcorrelationduringtokenmodeling,combinedwithadiffusiondecodersubsequentlydecodingfromthesetokenstoproducethefinalimageoutputs.

Auto-regressivePre-training

Theauto-regressivepartofGLM-ImageinitializesfromGLM-4-9B-0414andimplementscombinitorialtrainingoftext-to-imagegenerationandimage-to-imagegeneration.Wefreezethetextwordembeddinglayerofthemodelwhileenablingotherpartsfortraining,appendinganextravisionwordembeddinglayerforvisiontokenprojectionandreplacingtheoriginalLMheadwithavisionLMheadforthenewtask.WeimplementMRoPEasthepositionalembeddingforthecircumstanceofinterleavingimagesandtextsfrombothgenerationtasksoftext-to-imageandimage-to-image,asillustratedinthepicture.
Wetrainthemodelwithmultipleresolutionstagesincluding256px,512pxandamixed-resolutiontrainingstagespanningfrom512pxto1024px.ThetokenizerfromXOmnipatchifiestheimagewitha16×compressionratio,whichmeansthetokencountpersampleis256,1024and1024-to-4096,respectivelyforthethreetrainingstages.Giventhatwesettheupscalingfactorofourdiffusiondecoder’sfinaloutputto32,theresultingimageresolutionrangesfrom1024pxto2048px.
Intheinitial256-tokenstageoftraining,weimplementedastraightforwardrasterscanorderfortokengenerationstrategy.However,asweadvancedtohigher-resolutiontrainingstages,weobservedadropincontrollabilityofmodeloutputswhenapplyingthesamegenerationapproach.Toaddressthis,weadoptedaprogressivegenerationstrategy[11]:beforegeneratinghigh-resolutionimagetokens,wefirstgenerateapproximately256tokenswiththesameaspectratio,obtainedbytokenizingadown-sampledversionofthetargetimage.Consideringthatthesepreliminarytokenslargelydeterminethefinalimagelayout,butduetotheirsmallnumbertheymightreceiveinsufficientattention,weincreasedtheirtrainingweightinsubsequentstages,whicheffectivelyimprovedtheoverallgenerationquality.

DecoderFormulation

Image

Figure4:DecoderFormulation
Thediffusiondecoderreceivessemantic-VQtokensgeneratedbytheauto-regressivemodelasconditionalinputstoreconstructthetargetimage.Whilesemantic-VQtokenscarryrichsemanticinformation,theydiscardhigh-frequencyimagedetailsandprimarilypresentrelativelylow-frequencyimagelayoutinformation.Asaresult,thediffusiondecodermustretainacertaingenerativecapacitytosynthesizeandrecoverthemissingfine-graineddetails.
Forthebackbonedesign,wefollowCogView4toadoptasingle-streamDiTarchitecture.Thedecoderemploysflowmatchingasitsdiffusionschedulingstrategy,ensuringstabletrainingandefficientconvergenceforhigh-fidelityimagegeneration.Forintegration,thesemantic-VQtokensarefirstpassedthroughaprojectionlayerandthenconcatenatedwiththeVAElatentrepresentationalongthechanneldimension.Thispreservestheinputsequencelengthandincursalmostnoextracomputationaloverhead.Sincethesemantic-VQtokensalreadyprovidesufficientsemanticinformation,weremovethepromptinputfromthedecoder’sconditioning.Thisdesigneliminatestheneedforalarge-parametertextencoder,therebyreducingbothmemoryusageandcomputationalcost.Tostrengthenthedecoder’sabilitytorendercomplextextualcontent—particularlyChinesecharacters—weintroducealightweightGlyph-byT5[12]modelthatperformscharacter-levelencodingforrenderedtextregions.Theresultingglyphembeddingsareconcatenatedwiththevisionembeddingsalongthesequencedimension.
Forimageeditingtasks,itisoftencriticaltopreservethehigh-frequencydetailspresentinthereferenceimages.Thesemanticinformationprovidedsolelybysemantic-VQtokensisinsufficientformodelingfine-graineddetailpreservation.Therefore,inGLM-Imageweuseboththesemantic-VQtokensandtheVAElatentsofthereferenceimagesasadditionalconditioninginputsforthediffusiondecoder,asillustratedinFigure4.UnlikeconcurrentimageeditingmodelssuchasQwen-Image-Edit[13],whichapplyfullattentionbetweenreferenceimagesandthegeneratedimage,weadoptablock-causalattentionmechanismbetweenthereferenceandthegeneratedimage.ThisfollowstheattentiondesignpatternusedinControlNet-Reference-Only[14].Theblockcausalattentioncansignificantlyreducethecomputationoverheadonthereferenceimagetokensbykvcachewhilekeepingcompetitivedetailedpreservation.

disentangledRewardforAR+DiffusionPost-training

Inthepost-trainingstage,GLM-Imageemploysadecoupledreinforcementlearningstrategytoseparatelyoptimizeitsauto-regressivegeneratoranddiffusiondecoder,enablingimprovementstobothsemanticalignmentandvisualdetailquality.BothmodulesaretrainedwithGRPO[15]optimization.Forthediffusiondecoderspecifically,GLM-Imageadoptsflow-GRPO[16],avariantofthestandardLLMGRPOadaptedfordiffusionmodels.
Theauto-regressivemodulefocusesonlow-frequencyrewardsthatguidesemanticconsistencyandaesthetics,therebyimprovinginstructionfollowingandartisticexpressiveness.Itcombinesmultiplerewardsources,includingHPSv3[17]foraestheticscoring,OCRforenhancingtextrenderingaccuracyandVLMforoverallsemanticcorrectnessofgeneratedcontent.Thedecodermoduletargetshigh-frequencyrewardstorefinefine-detailfidelityandtextprecision.ItleveragesLPIPS[18]toimproveperceptualtextureanddetailsimilarity,integratesOCRsignalstofurtherboosttextaccuracy,andemploysadedicatedhand-scoringmodeltoenhancethecorrectnessofgeneratedhands.

Evalmetrics

Text-renderingbenchmarks

CVTG-2k
Model open-source NED CLIPScore Word Accuracy 2 regions Word Accuracy 3 regions Word Accuracy 4 regions Word Accuracy 5 regions Word Accuracy average
GLM-Image 0.9557 0.7877 0.9103 0.9209 0.9169 0.8975 0.9116
Seedream 4.5 0.9483 0.8069 0.8778 0.8952 0.9083 0.9008 0.899
Z-Image 0.9367 0.7969 0.9006 0.8722 0.8652 0.8512 0.8671
Qwen-Image-2512 0.929 0.7819 0.863 0.8571 0.861 0.8618 0.8604
Z-Image-Turbo 0.9281 0.8048 0.8872 0.8662 0.8628 0.8347 0.8585
GPT Image 1 [High] 0.9478 0.7982 0.8779 0.8659 0.8731 0.8218 0.8569
Seedream 4.0 0.9224 0.7975 0.8585 0.8484 0.8538 0.8269 0.8451
Qwen-Image 0.9116 0.8017 0.837 0.8364 0.8313 0.8158 0.8288
Nano Banana 2.0 0.8754 0.7372 0.7368 0.7748 0.7863 0.7926 0.7788
TextCrafter 0.8679 0.7868 0.7628 0.7628 0.7406 0.6977 0.737
SD3.5 Large 0.847 0.7797 0.7293 0.6825 0.6574 0.594 0.6548
Seedream 3.0 0.8537 0.7821 0.6282 0.5962 0.6043 0.561 0.5924
FLUX.1 [dev] 0.6879 0.7401 0.6089 0.5531 0.4661 0.4316 0.4965
3DIS 0.6505 0.7767 0.4495 0.3959 0.388 0.3303 0.3813
RAG-Diffusion 0.4498 0.7797 0.4388 0.3316 0.2116 0.191 0.2648
TextDiffuser-2 0.4353 0.6765 0.5322 0.3255 0.1787 0.0809 0.2326
AnyText 0.4675 0.7432 0.0513 0.1739 0.1948 0.2249 0.1804
LongText-Bench
Model LongText-Bench-EN LongText-Bench-ZH
Seedream 4.5 0.989 0.9873
GLM-Image 0.9524 0.9788
Nano Banana 2.0 0.9808 0.9491
Qwen-Image-2512 0.9561 0.9647
Qwen-Image 0.943 0.946
Z-Image 0.935 0.936
Seedream 4.0 0.9214 0.9261
Z-Image-Turbo 0.917 0.926
Seedream 3.0 0.896 0.878
X-Omni 0.9 0.814
GPT Image 1 [High] 0.956 0.619
Kolors 2.0 0.258 0.329
BAGEL 0.373 0.31
OmniGen2 0.561 0.059
HiDream-I1-Full 0.543 0.024
BLIP3-o 0.021 0.018
Janus-Pro 0.019 0.006
FLUX.1 [Dev] 0.607 0.005

Generalbenchmarks

OneIG_EN
Model Alignment Text Reasoning Style Diversity Overall
Nano Banana 2.0 0.888 0.944 0.334 0.481 0.245 0.578
Seedream 4.5 0.891 0.998 0.35 0.434 0.207 0.576
Seedream 4.0 0.892 0.983 0.347 0.453 0.191 0.573
Z-Image 0.881 0.987 0.28 0.387 0.194 0.546
Qwen-Image 0.882 0.891 0.306 0.418 0.197 0.539
GPT Image 1 [High] 0.851 0.857 0.345 0.462 0.151 0.533
Qwen-Image-2512 0.876 0.99 0.292 0.338 0.151 0.53
Seedream 3.0 0.818 0.865 0.275 0.413 0.277 0.53
GLM-Image 0.805 0.969 0.298 0.353 0.213 0.528
Z-Image-Turbo 0.84 0.994 0.298 0.368 0.139 0.528
Imagen 4 0.857 0.805 0.338 0.377 0.199 0.515
Recraft V3 0.81 0.795 0.323 0.378 0.205 0.502
HiDream-I1-Full 0.829 0.707 0.317 0.347 0.186 0.477
OmniGen2 0.804 0.68 0.271 0.377 0.242 0.475
SD3.5 Large 0.809 0.629 0.294 0.353 0.225 0.462
CogView4 0.786 0.641 0.246 0.353 0.205 0.446
FLUX.1 [Dev] 0.78 0.532 0.253 0.368 0.238 0.434
Kolors 2.0 0.82 0.427 0.262 0.36 0.3 0.434
Imagen 3 0.843 0.343 0.313 0.359 0.188 0.409
BAGEL 0.769 0.244 0.173 0.367 0.251 0.361
Lumina-Image 2.0 0.806 0.27 0.27 0.354 0.216 0.353
SANA-1.5-4.8B 0.675 0.069 0.217 0.401 0.216 0.334
SANA-1.5-1.6B 0.733 0.054 0.209 0.387 0.222 0.327
BAGEL+CoT 0.745 0.174 0.206 0.39 0.209 0.324
SD 1.5 0.69 0.207 0.207 0.383 0.429 0.319
SDXL 0.688 0.029 0.237 0.332 0.296 0.316
Show-o2-7B 0.817 0.002 0.226 0.317 0.177 0.308
BLIP3-o 0.711 0.133 0.223 0.361 0.229 0.307
Show-o2-1.5B 0.798 0.002 0.219 0.317 0.186 0.304
Janus-Pro 0.553 0.001 0.139 0.276 0.365 0.267
OneIG_ZH
Model Alignment Text Reasoning Style Diversity Overall
Nano Banana 2.0 0.843 0.983 0.311 0.461 0.236 0.567
Seedream 4.0 0.836 0.986 0.304 0.443 0.2 0.554
Seedream 4.5 0.832 0.986 0.3 0.426 0.213 0.551
Qwen-Image 0.825 0.963 0.267 0.405 0.279 0.548
Z-Image 0.793 0.988 0.266 0.386 0.243 0.535
Seedream 3.0 0.793 0.928 0.281 0.397 0.243 0.528
Qwen-Image-2512 0.823 0.983 0.272 0.342 0.157 0.515
GLM-Image 0.738 0.976 0.284 0.335 0.221 0.511
Z-Image-Turbo 0.782 0.982 0.276 0.361 0.134 0.507
GPT Image 1 [High] 0.812 0.65 0.3 0.449 0.159 0.474
Kolors 2.0 0.738 0.502 0.226 0.331 0.333 0.426
BAGEL 0.672 0.365 0.186 0.357 0.268 0.37
Cogview4 0.7 0.193 0.236 0.348 0.214 0.338
HiDream-I1-Full 0.62 0.205 0.256 0.304 0.3 0.337
Lumina-Image 2.0 0.731 0.136 0.221 0.343 0.24 0.334
BAGEL+CoT 0.719 0.127 0.219 0.385 0.197 0.329
BLIP3-o 0.608 0.092 0.213 0.369 0.233 0.303
Janus-Pro 0.324 0.148 0.104 0.264 0.358 0.24
DPGBench
Model Global Entity Attribute Relation Other Overall
Seedream 4.5 89.24 94.3 92.14 92.23 93.83 88.63
Seedream 4.0 93.86 92.24 90.74 93.87 94.16 88.54
Qwen-Image 91.32 91.56 92.02 94.31 92.73 88.32
Seedream 3.0 94.31 92.65 91.36 92.78 88.24 88.27
Z-Image 93.39 91.22 93.16 92.22 91.52 88.14
Qwen-Image-2512 89.04 91.91 92.39 90.85 93.07 87.2
Lumina-Image 2.0 - 91.97 90.2 94.85 - 87.2
Nano Banana 2.0 91 92.85 91.56 92.39 89.93 87.16
HiDream-I1-Full 76.44 90.22 89.48 93.74 91.83 85.89
GPT Image 1 [High] 88.89 88.94 89.84 92.63 90.96 85.15
Z-Image-Turbo 91.29 89.59 90.14 92.16 88.68 84.86
GLM-Image 87.74 90.25 89.08 92.15 90.17 84.78
Janus-Pro-7B 86.9 88.9 89.4 89.32 89.48 84.19
SD3 Medium 87.9 91.01 88.83 80.7 88.68 84.08
FLUX.1 [Dev] 74.35 90 88.96 90.87 88.33 83.52
DALL-E 3 90.97 89.61 88.39 90.58 89.83 83.5
Janus-Pro-1B 87.58 88.63 88.17 89.98 88.3 82.65
Emu3-Gen 85.21 86.68 86.84 90.22 83.15 80.6
PixArt-Σ 86.89 82.89 88.94 86.59 87.68 80.54
Janus 82.33 87.38 87.7 85.46 86.41 79.66
Hunyuan-DiT 84.59 80.59 88.01 74.36 86.41 78.47
Playground v2.5 83.06 82.59 81.2 84.08 83.5 75.47
SDXL 83.27 82.43 80.91 86.76 80.41 74.65
Lumina-Next 82.82 88.65 86.44 80.53 81.82 74.63
PixArt-α 74.97 79.32 78.6 82.57 76.96 71.11
SD1.5 74.63 74.23 75.39 73.49 67.81 63.18
TIFFBench
Model Overall short Overall long
Nano Banana 2.0 91 88.26
Seedream 4.5 90.49 88.52
Seedream 4.0 90.45 88.08
GPT Image 1 [High] 89.15 88.29
Qwen-Image 86.14 86.83
Seedream 3.0 86.02 84.31
Z-Image 80.2 83.01
Qwen-Image-2512 83.24 84.93
GLM-Image 81.01 81.02
Z-Image-Turbo 77.73 80.05
DALL-E 3 74.96 70.81
FLUX.1 [dev] 71.09 71.78
FLUX.1 [Pro] 67.32 69.89
Midjourney V7 68.74 65.69
SD 3 67.46 66.09
SANA 1.5 67.15 65.73
Janus-Pro-7B 66.5 65.01
Infinity 62.07 62.06
PixArt-Σ 62 58.12
Show-o 59.72 58.86
LightGen 53.22 49.41
Hunyuan-DiT 51.38 53.28
Lumina-Next 50.93 52.46

Reference

[1] https://huggingface.co/zai-org/GLM-4-9B-0414
[2] https://huggingface.co/zai-org/CogView4-6B
[3] Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow straight and fast: Learning to generate and transfer data with rectified flow." arXiv preprint arXiv:2209.03003 (2022).
[4] Yu, Sihyun, et al. "Representation alignment for generation: Training diffusion transformers is easier than you think." arXiv preprint arXiv:2410.06940 (2024).
[5] Yao, Jingfeng, Bin Yang, and Xinggang Wang. "Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[6] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).
[7] Geng, Zigang, et al. "X-omni: Reinforcement learning makes discrete autoregressive image generative models great again." arXiv preprint arXiv:2507.22058 (2025).
[8] Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.
[9] Liu, Shizhan, et al. "Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability." arXiv preprint arXiv:2512.05394 (2025).
[10] https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev
[11] Zheng, Wendi, et al. "Cogview3: Finer and faster text-to-image generation via relay diffusion." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[12] Liu, Zeyu, et al. "Glyph-byt5: A customized text encoder for accurate visual text rendering." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[13]Wu, Chenfei, et al. "Qwen-image technical report." arXiv preprint arXiv:2508.02324 (2025).
[14]Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
[15] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).
[16] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint arXiv:2505.05470 (2025).
[17] Ma, Yuhang, et al. "Hpsv3: Towards wide-spectrum human preference score." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[18] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.