Abstract:To address the inefficiencies, high costs, and safety risks inherent in traditional road and bridge inspection techniques, as well as the challenges posed by the large parameter volumes of current multimodal detection models and the difficulty in achieving real-time deployment on unmanned aerial vehicle (UAV) platforms, this study proposes a multimodal feature fusion road and bridge detection model based on cross distillation. The model employs a dual-branch teacher network and a single-branch student network architecture. Efficient knowledge transfer of modality-specific features is achieved through feature interaction and collaborative distillation mechanisms between the teacher networks. Concurrently, a dynamic feature fusion module, utilizing attention mechanisms, is introduced to enhance the perception of critical features associated with road and bridge defects. Experimental results demonstrate that, while maintaining a detection precision of 89.6% mAP@0.5, the proposed model reduces its parameter size to 8.2M and achieves an inference speed of 32.6 f/s. These results significantly outperform traditional multimodal fusion and lightweight methods. Compared to strategies utilizing feature concatenation or post-distillation unimodal fusion, the proposed model shows clear advantages in both detection accuracy and computational efficiency. Ablation studies confirm the effectiveness of the cross-distillation mechanism and the attention-based fusion module. The model successfully enables high-precision, lightweight detection of road and bridge defects, thus providing a technical foundation for the engineering application of UAV-based road and bridge inspection.