Abstract:To address the issues of insufficient fine-grained modeling and lack of spatial sensitivity in CLIP for weakly supervised semantic segmentation tasks, this study proposes a global-local embedding dynamic fusion framework, GLE-CLIP. The proposed method employs a bidirectional cross-attention (BCA) module to establish interactive relationships between textual semantics and local image features, thus generating more discriminative local embeddings. In addition, a dynamic attention fusion (DAF) mechanism is designed to adaptively balance global semantics and local details through a similarity-driven weight allocation strategy. Specifically, the framework first extracts multi-scale pixel-level embeddings using a grounding decoder, enhances cross-modal alignment through cross-attention between text and pixel features, and achieves a fusion of features across granularity levels through the dynamic projection of global embeddings. Experiments on the PASCAL VOC 2012 and MS COCO datasets demonstrate that the proposed method outperforms most existing language-supervised approaches, achieving mIoU scores of 77.9% and 47.9%, respectively. Ablation studies validate the effectiveness of the cross-attention module and dynamic fusion mechanism, while visualization results further reveal the method’s capability to capture high-frequency details.