基于CLIP全局-局部嵌入动态融合的弱监督语义分割

doi:10.15888/j.cnki.csa.009969

AIPUB归智期刊联盟

微信公众号

网站二维码

首页 > 过刊浏览>2025年第34卷第10期 >238-246. DOI:10.15888/j.cnki.csa.009969

PDF HTML阅读 XML下载导出引用引用提醒

基于CLIP全局-局部嵌入动态融合的弱监督语义分割
DOI:
                        10.15888/j.cnki.csa.009969
                    
CSTR:
                        32024.14.csa.009969
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Weakly Supervised Semantic Segmentation via Dynamic Fusion of CLIP-based Global and Local Embeddings

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对CLIP在弱监督语义分割任务中存在的细粒度建模不足与空间敏感性缺失问题, 本文提出了一种基于全局-局部嵌入动态融合的弱监督语义分割模型GLE-CLIP. 该模型通过双向交叉注意力模块(BCA)构建文本语义与图像局部特征的交互关系, 生成鉴别性更强的局部嵌入; 同时设计动态注意力融合机制(DAF), 以相似度驱动的权重分配策略自适应平衡全局语义与局部细节. 具体而言, 首先通过定位解码器提取多尺度像素级嵌入, 利用文本到像素以及像素到文本的双向交叉注意力来增强模态对齐, 并结合全局嵌入的动态投影实现跨粒度特征融合. 在PASCAL VOC 2012和MS COCO数据集上的实验表明, 通过本方法训练的分割模型的性能超越现有大部分语言监督方法, mIoU分别达75.%与47.9%. 消融实验证实了交叉注意力模块与动态融合机制的有效性, 可视化结果进一步揭示了方法对高频细节的捕捉能力.

Abstract:

To address the issues of insufficient fine-grained modeling and lack of spatial sensitivity in CLIP for weakly supervised semantic segmentation tasks, this study proposes a global-local embedding dynamic fusion framework, GLE-CLIP. The proposed method employs a bidirectional cross-attention (BCA) module to establish interactive relationships between textual semantics and local image features, thus generating more discriminative local embeddings. In addition, a dynamic attention fusion (DAF) mechanism is designed to adaptively balance global semantics and local details through a similarity-driven weight allocation strategy. Specifically, the framework first extracts multi-scale pixel-level embeddings using a grounding decoder, enhances cross-modal alignment through cross-attention between text and pixel features, and achieves a fusion of features across granularity levels through the dynamic projection of global embeddings. Experiments on the PASCAL VOC 2012 and MS COCO datasets demonstrate that the proposed method outperforms most existing language-supervised approaches, achieving mIoU scores of 77.9% and 47.9%, respectively. Ablation studies validate the effectiveness of the cross-attention module and dynamic fusion mechanism, while visualization results further reveal the method’s capability to capture high-frequency details.

参考文献

相似文献

引证文献

引用本文

张裕,李坤,颜志雄.基于CLIP全局-局部嵌入动态融合的弱监督语义分割.计算机系统应用,2025,34(10):238-246

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-03-03
最后修改日期:2025-03-24
录用日期:
在线发布日期: 2025-09-03
出版日期:

微信公众号

网站二维码

引用本文

分享

相关视频

文章指标

历史

文章二维码