STRESZCZENIE

Diffusion models have revolutionized image generation applications, however their internal mechanisms remain poorly understood and continue to be an active area of research. In this study, we conduct an investigation on how fine-tuning impacts the attention localization quality to determine the crossmodal alignment between textual and visual features in the Stable Diffusion model. We employ the Diffusion Attentive Attribution Maps (DAAM) method to create spatial attribution maps that reveal how the tokens affect specific image regions. The Stable Diffusion model is fine-tuned on a limited dataset consisting of two specific object classes, then evaluated with the intersection-over-union (IoU) metric between attention masks and ground-truth segmentation masks. The evaluation contains both categories used for the training and the control category absent from the training data. We find that the finetuning process significantly improves the quality of attention localization for trained classes, demonstrating a 25% and 10% increase in mean IoU scores with no significant changes for the control class. These findings provide valuable insights into how diffusion models develop domain-specific patterns and confirm that fine-tuning enhances attention mechanisms without requiring full model retraining, contributing to the explainable AI research. Our code can be found at: https://github.com/mickuz/stable-diffusion-properties.