Abstract:As important carriers of historical and cultural heritage, heritage buildings hold significant value for research and conservation. However, their scarcity and ongoing deterioration make complete restoration through traditional methods challenging. While existing text-to-image generation techniques can reconstruct their appearance from textual descriptions, issues such as missing details and suboptimal image quality persist. To address these limitations, this study proposes a method for generating heritage buildings based on an improved diffusion model. A gated residual mechanism is introduced to optimize information flow, mitigate gradient vanishing, and enhance generation stability. A dual attention network combining channel and spatial attention is incorporated to strengthen the modeling ability of both local details and global structures. Furthermore, VGG19 is employed as a discriminant network to extract multi-level semantic features, and perceptual loss is introduced to improve the modeling effect of key visual features. Experimental results show that, compared with other diffusion-based models (KNN-diffusion and Simple diffusion), the proposed method reduces FID by 30.39% and improves CLIP-score, IS, and SSIM by 1.08%, 9.01%, and 2.35%, respectively. This study provides a feasible technical approach for generating high-quality images of heritage buildings, contributing to the sustainable research and intelligent conservation of digital cultural heritage.