Abstract: Prompt engineering plays a crucial role in unlocking the potential of large language model. This method guides the model’s response by designing prompt instructions to ensure the relevance, coherence, and accuracy of the response. Prompt engineering does not require fine-tuning model parameters and can be seamlessly connected with downstream tasks. Therefore, various prompt engineering techniques have become a research hotspot in recent years. Accordingly, this study introduces the key steps for creating effective prompts, summarizes basic and advanced prompt engineering techniques, such as chain of thought and tree of thought, and deeply explores the advantages and limitations of each method. At the same time, it discusses how to evaluate the effectiveness of prompt methods from different perspectives and using different methods. The rapid development of these technologies enables large language models to succeed in a variety of applications, ranging from education and healthcare to code generation. Finally, future research directions of prompt engineering technology are prospected.
Abstract: Acute ischemic stroke is the most common type of stroke in clinical practice. Due to its sudden onset and short treatment time window, it becomes one of the important factors leading to disability and death world wide. With the rapid development of artificial intelligence, deep learning technology shows great potential in the diagnosis and treatment of acute ischemic stroke. Deep learning models can quickly and efficiently segment and detect lesions based on patients’ brain images. This study introduces the development history of deep learning models and commonly used public datasets for stroke research. For various modalities and scanning sequences derived from computerized tomography (CT) and magnetic resonance imaging (MRI), it elaborates on the research progress of deep learning technology in the field of lesion segmentation and detection in acute ischemic stroke and summarizes and analyzes the improvement ideas of related research. Finally, it points out existing challenges of deep learning in this field and proposes possible solutions.
Abstract: The research on the classification and identification of microscopic residual oil occurrence states plays a vital role in residual oil exploitation and is of great significance for improving oil field recovery. In recent years, a large number of studies in this field have promoted the development of technologies for identifying microscopic residual oil by introducing deep learning. However, deep learning has not yet established a unified framework for microscopic residual oil identification, nor has it formed a standardized operation process. To guide future research, this study reviews existing methods for identifying residual oil and introduces the identification technologies for microscopic residual oil based on machine vision from several aspects, including image acquisition and classification standards, image processing, and residual oil identification methods. Residual oil identification methods are categorized into traditional and deep learning-based methods. The traditional methods are further divided into those based on manual feature extraction and those based on machine learning classification. The deep learning-based methods are divided into single-stage and two-stage methods. Detailed summaries are provided for data enhancement, pre-training, image segmentation, and image classification. Finally, this study discusses the challenges of applying deep learning to microscopic residual oil identification and explores future development trends.
Abstract: Embodied AI requires the ability to interact with and perceive the environment, and capabilities such as autonomous planning, decision making, and action taking. Behavior trees (BTs) become a widely used approach in robotics due to their modularity and efficient control. However, existing behavior tree generation techniques still face certain challenges when dealing with complex tasks. These methods typically rely on domain expertise and have a limited capacity to generate behavior trees. In addition, many existing methods have language comprehension deficiencies or are theoretically unable to guarantee the success of the behavior tree, leading to difficulties in practical robotic applications. In this study, a new method for automatic behavior tree generation is proposed, which generates an initial behavior tree with task goals based on large language models (LLMs) and scene semantic perception. The method in this study designs robot action primitives and related condition nodes based on the robot’s capabilities. It then uses these to design prompts to make the LLMs output a behavior plan (generated plan), which is then transformed into an initial behavior tree. Although this paper takes this as an example, the method has wide applicability and can be applied to other types of robotic tasks according to different needs. Meanwhile, this study applies this method to robot tasks and gives specific implementation methods and examples. During the process of the robot performing a task, the behavior tree can be dynamically updated in response to the robot’s operation errors and environmental changes and has a certain degree of robustness to changes in the external environment. In this study, the first validation experiments on behavior tree generation are carried out and verified in the simulated robot environment, which demonstrates the effectiveness of the proposed method.
Abstract: Deformable 3D medical image registration remains challenging due to irregular deformations of human organs. This study proposes a multi-scale deformable 3D medical image registration method based on Transformer. Firstly, the method adopts a multi-scale strategy to realize multi-level connections to capture different levels of information. Self-attention mechanism is employed to extract global features, and dilated convolution is used to capture broader context information and more detailed local features, so as to enhance the registration network’s fusion capacity for global and local features. Secondly, according to the sparse prior of the image gradient, the normalized total gradient is introduced as a loss function, effectively reducing the interference of noise and artifacts on the registration process, and better adapting to different modes of medical images. The performance of the proposed method is evaluated on publicly available brain MRI datasets (OASIS and LPBA). The results show that the proposed method can not only maintain the advantages of the learning-based method in run-time but also well performs in mean square error and structural similarity. In addition, ablation experiment results further prove the validity of the method and normalized total gradient loss function design proposed in this study.
Abstract: In the current electricity market, the volume of daily spot market clearing data has reached millions or tens of millions. With the increase in trading activities and the complexity of the market structure, ensuring the integrity, transparency, and traceability of trading data has become a key issue to be studied in the field of market clearing in China. Therefore, this study proposes a data provenance method for power market clearing based on the PROV model and smart contracts, aiming to automate the storage and updating of provenance information through smart contracts to improve the transparency of the clearing process and the trust of the participants. The proposed method utilizes the elements of entities, activities, and agents in the PROV model, combined with the hierarchical storage and immutability of blockchain technology, to record and track trading activities and rule changes in the electricity market. The method not only enhances data transparency and trust among market participants but also optimizes data management and storage strategies, reducing operational costs. In addition, the method provides proof of compliance for power market clearing, helping market participants meet increasing regulatory requirements.
Abstract: Facing the complex marine environment, it is extremely challenging to utilize ship radiated noise for hydroacoustic target feature extraction and recognition. In this study, 3D dynamic Mel-frequency cepstrum coefficient (3D-MFCC) features of ship audio signals are fused with 3D dynamic Mel-spectrogram (3D-Mel) features as model inputs. Based on this, a new deep neural network model for underwater target recognition is proposed. The model is based on the serial architectures of convolutional neural network (CNN) and long short-term memory (LSTM). Here, the traditional CNN is replaced by multi-scale depthwise convolutional network (MSDC), and multi-scale channel attention (MSCA) is added. The experimental results show that the average recognition rate of this method on DeepShip and ShipsEar datasets reaches 85.92% and 97.32% respectively, which demonstrates a good classification effect.
Abstract: Currently, research on multi-label text classification integrates label information. However, in the field of sentiment analysis, existing methods often overlook the correlations of labels based on the intensity and polarity of emotions themselves, which are crucial for accurate classification. To address these issues, this study proposes the MGE-BERT model which features multi-label interaction, graph enhancement, and emotion perception. The model first prioritizes sentiment label sorting through the correlations of sentiment intensity and hierarchy and then combines these sorted labels with text data as inputs into the BERT model. During this process, syntactic analysis techniques and sentiment lexicons are employed, and through a unique graph construction method, intricate dependency and emotion graphs are built. To further enhance the in-depth integration of label information and text features, the study uses BERT outputs as inputs to graph convolutional network (GCN), enabling it to capture and transmit contextual relationships between nodes more precisely. Experimental results demonstrate that the proposed MGE-BERT model outperforms state-of-the-art models, achieving improvements in Macro-F1 scores by 1.6% and 2.0% on the SemEval2018 Task-1C and GoEmotions datasets, respectively.
Abstract: Currently, most explainable multimodal fake news detection methods overlook the further research and utilization of explanation data and cross-modal features. As a result, while these explainable fake news detection methods provide explanations for model decisions, their detection performance does not surpass that of advanced multimodal detection methods. To address these issues, this study proposes an iterative explainable multimodal fake news detection framework. This method consists of a main model and an explanation module, both of which receive multimodal news as input. First, the explanation module uses the explanation data calculated by the DeepLIFT algorithm as one of the inputs to the main model, contributing to the decision-making process. Next, the main model calculates cross-modal relevant features and cross-modal supplementary features through a multi-task network framework. It refines the cross-modal supplementary features by re-weighting them with the coarse prediction scores from the cross-modal relevant features and combines multiple features to make the final model decision. Finally, the explanation module trains by transferring decision knowledge from the main model by using knowledge distillation. The main model and the explanation module are trained alternately, forming an iterative framework that enhances model detection performance while providing decision explanations. Extensive experiments on two publicly available fake news detection datasets demonstrate that the proposed method outperforms state-of-the-art multimodal fake news detection methods.
Abstract: Learning-based multi-view stereo matching algorithms have achieved remarkable results, but still have the problems of limited convolutional receptive field and ignoration of image frequency information, which lead to insufficient matching performance on low-texture, repetitive, and non-Lambertian surfaces. To address these problems, this study proposes CAF-MVSNet, a context-enhanced and image-frequency-guided multi-view stereo matching network. First, the context enhancement module is fused into the feature pyramid network in the feature extraction stage to effectively expand the receptive field of the network. Then the image-frequency-guided attention module is introduced to obtain the information of lines, shapes, textures, and colors of the images by encoding different frequencies of the images, which enhances the remote contextual connection of the images and further solves the problem of accurate matching of low-texture, repetitive, and non-Lambertian surfaces for reliable feature matching. Experimental results on the DTU dataset show that CAF-MVSNet has a 12.3% improvement in the combined error compared to the classical cascade model CasMVSNet, demonstrating excellent performance. In addition, good results are achieved on the Tanks and Temples dataset, reflecting the good generalization performance of CAF-MVSNet.
Abstract: This study researches improving the UnifiedGesture model to enhance the realism of audio-driven human body animation generation. Firstly, an encoder-decoder architecture is introduced to extract facial features from audio, compensating for the deficiencies of the original model in facial expression generation. Secondly, the cross-local attention mechanism and the multi-head attention mechanism based on Transform-XL are combined to enhance the temporal dependency within long sequences. Simultaneously, the vector quantized variational autoencoder (VQVAE) is utilized to integrate and generate full-body motion sequences, enhancing the diversity and integrity of the generated motions. Finally, experiments are conducted on the BEAT dataset. The quantitative and qualitative analysis results demonstrate that the improved UnifiedGesture-F model achieves a significant improvement in the synchronicity between audio and human body movements as well as in the overall realism compared to the original model.
Abstract: TensorGCN model is one of the state-of-the-art (SOTA) models applied by graph neural networks in the field of text classification. However, in terms of processing text semantic information, the long short-term memory (LSTM) used by the model has difficulty in completely extracting the semantic features of short text and performs poorly in handling complex semantic information. At the same time, due to the large number of semantic and syntactic features contained in long texts, feature sharing is incomplete when heterogeneous information is shared among graphs, which affects the accuracy of text classification. To solve these two problems, the TensorGCN model is improved, and a text classification method based on the tensor graph convolutional network fusing BERT and the self-attention mechanism (BTSGCN) is proposed. Firstly, BERT is used to replace the LSTM module in the TensorGCN architecture for semantic feature extraction. It captures the dependencies between words by considering the surrounding words on both sides of a given word, thus extracting the semantic features of short texts more accurately. Then, the self-attention mechanism is added during the propagation among graphs to help the model better capture the features among different graphs and complete the feature fusion. Experimental results on MR, R8, R52, and 20NG datasets show that BTSGCN has higher classification accuracy than other comparison methods.
Abstract: Core images, as a crucial digital image resource in the fields of geology, oil, and gas, are essential for scientific research and engineering practices. Their security is often ensured by adding digital watermarks. During digitization, core images frequently undergo JPEG compression when they are stored, transmitted, or published on Web pages. However, existing deep learning-based image digital watermarking algorithms still have significant shortcomings in terms of visual quality and robustness under JPEG compression. This study proposes an end-to-end image robust watermarking algorithm to address the issue of robust watermark embedding in core images under JPEG compression conditions. To efficiently integrate the features of the host image and the watermark, the study introduces a pyramid efficient multi-scale attention (PEMA) module. Through a unique cross-spatial interaction strategy and channel-wise relationship construction, the module effectively captures long-range dependencies in different directions and features information at various scales. To achieve visual imperceptibility, the study embeds the digital watermark into the low-frequency components of the host image using discrete wavelet transform (DWT) and introduces the DWT LL sub-band loss (DLL) loss function to improve the visual quality of the watermark image. Experimental results demonstrate that the proposed algorithm outperforms existing mainstream algorithms in both robustness against JPEG compression and visual imperceptibility.
Abstract: Transformer method, relying on a self-attention mechanism, exhibits remarkable performance in the field of image super-resolution reconstruction. Nevertheless, the self-attention mechanism also brings about a very high computational cost. To address this issue, a lightweight image super-resolution reconstruction model based on a hybrid generalized Transformer is proposed. This model is built based on the SwinIR network architecture. Firstly, the rectangular window self-attention (RWSA) mechanism is adopted. It utilizes horizontal and vertical rectangular windows with different heads to replace the traditional square window pattern, integrating features across different windows. Secondly, the recursive generalized self-attention (RGSA) mechanism is introduced to recursively aggregate input features into representative feature maps, followed by the application of cross-attention to extract global information. Meanwhile, RWSA and RGSA are alternately combined to make more effective use of global context information. Finally, to activate more pixels for better recovery, the channel attention mechanism and self-attention mechanism are used in parallel to extract features from the input image. Test results of five benchmark datasets show that this model achieves better reconstruction performance while keeping the model parameters lightweight.
Abstract: Explainable recommendation algorithms utilize behavioral and other relevant information to not only generate recommendation results but also provide recommendation explanations, thereby increasing the transparency and credibility of recommendations. Traditional explainable recommendation algorithms are often limited to analyzing rating data and text data and fail to fully utilize data such as images. They also do not consider effective fusion methods between modalities, making it difficult to fully unearth the intrinsic relationships between different modalities. An explainable recommendation model that fuses multimodal features is proposed to address the above-mentioned issues. This model improves the quality and personalization of recommendation explanations from a multimodal perspective through feature fusion technology. Firstly, a multimodal feature extraction method is designed based on CLIP image encoder and text encoder to extract text and image features of users and items, respectively. Secondly, cross attention technology is used to achieve cross modal fusion of text and images, enhancing semantic correlation between modalities. Finally, multimodal information is combined with interactive information to jointly optimize modal alignment, rating prediction, and explanation generation. Experimental results show that the proposed method exhibits significant advantages in the three multimodal recommendation datasets, especially in improving explanation quality.
Abstract: Federated learning is a distributed machine learning technique that allows participants to train models locally and upload updates to a central server. The central server aggregates the updates to generate a better global model, ensuring data privacy and solving the problem of data silos. However, the gradient aggregation relies on a central server, which may lead to a single point of failure, and the central server is also a potential malicious attacker. Therefore, federated learning needs to be decentralized. The existing decentralized solutions ignore external adversaries and the performance bottlenecks issues caused by data communication. To address the above issues, this study proposes a decentralized federated learning method considering external adversaries. The method applies Shamir’s secret sharing scheme to divide model updates into multiple shares to protect gradient privacy. The method proposes a flooding consensus protocol that randomly selects a participant as the central server in each round to complete global aggregation, efficiently achieving the decentralization of federated learning. At the same time, the method introduces BLS aggregate signatures to prevent external adversary attacks and improve verification efficiency. Theoretical analysis and experimental results indicate that this method is safe and efficient, having higher efficiency than similar federated learning methods.
Abstract: Terrain classification is a crucial research direction in remote sensing imagery. The technology of joint hyperspectral images and LiDAR data classification has drawn much attention in recent years. The classification performance of existing deep learning models significantly depends on the richness and quality of labeled samples, which often poses a major challenge in practical applications. In addition, many models fail to effectively utilize the information complementarity between hyperspectral images and LiDAR data. To solve the above problems, this study proposes a semi-supervised double-branch classification network with cross-modal channel weight adjustment. Through the attention mechanism, the similarity between two data channels is analyzed deeply, and the weight of each channel is adaptively adjusted accordingly. At the same time, the semi-supervised method of consistency regularization and pseudo-labeling is combined to effectively utilize the information of unlabeled samples. In the experiment of joint classification of hyperspectral images and LiDAR data on the two iconic joint datasets of Houston and MUUFL, the proposed method shows significant advantages over existing classification models, effectively improving classification accuracy and efficiency.
Abstract: Aiming at the difficult balance between the global receptive field and efficient computation and unclear details of image reconstruction, an attribute guided network based on CNN-Mamba (CMANet) is proposed. Firstly, when the model is reconstructed, attribute information is introduced and interrelationships among these attributes are considered, which helps the model to improve the reliability and accuracy of the whole reconstruction process. Secondly, the hourglass state space module is introduced to explore the key features of face images and maintain the advantage of linear complexity in long-distance dependency modeling. Finally, an adaptive Mamba fusion module is introduced. When image features learn long-distance dependencies in multiple directions, attributes are adaptively supplemented in different directions, and features supplemented in different directions are adaptively fused, making the model more flexible and efficient in processing diverse images. A large number of experiments prove the superiority of the proposed method.
Abstract: Ineffective object recognition models occur in remote sensing images through complex background interference and dense target integration. To this end, this study improves the YOLOv5s object output model. First, a mixed attention menu is utilized to improve the convolutional attention model (CBAM) and add it to backbone networks. Accordingly, the extracted features of the model contain local and global information to enhance the model’s ability to identify targets in complex backgrounds. Then the study uses the ultra-light sampler DySample to reduce model parameters and improve model performance. Finally, the study employs the EIoU loss function to improve the positioning level of the target to be detected. Experimental verification of RSOD and DIOR data sets shows that the improved YOLOv5s has a 7.8% higher accuracy than the original model in detecting targets in remote sensing images, meeting the real-time detection requirements of targets in remote sensing images. In addition, the improved model retains the advantages it has in comparison to other object recognition models.
Abstract: Aiming at noise and pseudo-edge interference in the edge extraction process of automobile coating images caused by the complex environment and uneven lighting in production plants, an edge extraction algorithm for automobile coating images with an improved Canny operator is proposed. Firstly, the algorithm adopts a cascade filter composed of a multi-level median rational hybrid filter and a guided filter to denoise and smooth the image, while retaining the target edge information during noise reduction. Secondly, the improved Sobel operator convolution template is applied to extract the gradient vectors from four directions of horizontal, vertical, 45°, and 135°, so as to improve edge localization accuracy. Finally, in the edge connection stage, the improved Otsu method (maximum interclass variance method) is used to select high and low thresholds, increasing the adaptability of the algorithm. Experimental results show that in terms of image denoising, compared with traditional median filtering, the algorithm ensures that the peak signal-to-noise ratio of the denoised image is higher than 35 dB, and the structural similarity is greater than 0.9. The overall peak signal-to-noise ratio increases by more than 6%, and the structural similarity is improved by more than 6.5%. In the aspect of edge extraction, it can effectively reduce the interference of the pseudo-edge and has a high degree of edge connectivity.
Abstract: In response to the problem that current plug-and-play image restoration methods cannot accurately model image degradation models in blind image restoration tasks such as low-light image enhancement, this study constructs a solution that combines a plug-and-play splitting algorithm with a guided diffusion model. This solution cleverly avoids directly solving complex data sub-problems caused by complex degradation models. Instead, it uses real degraded images to solve data sub-problems and takes the solutions of data sub-problems as “anchor points” to indirectly constrain and optimize the solving process of prior sub-problems. This ensures that the image restoration results can be more closely approximated to the real image restoration target. This method is validated on multiple public datasets. The results show that the proposed algorithm achieves an average improvement of 4.89% in PSNR and 9.48% in SSIM compared to current representative methods. Experiments prove that the proposed method performs better in repair metrics, validating its effectiveness.
Abstract: As a core component of urban transportation, the improvement of safety and efficiency of the subway system is of great significance in ensuring the safety of passengers’ lives and property. Pedestrian gate-breaking behavior can not only cause equipment damage and traffic delays but also pose a threat to the safety of other passengers. Therefore, accurately detecting and recognizing the behavior of pedestrians breaking through subway gates has become an important task in intelligent transportation management. This study proposes a pedestrian gate-breaking threat detection algorithm. Firstly, the algorithm uses the mobile network convolution module in the feature extractor of the RAFT optical flow method and adds the ECA channel attention mechanism. At the same time, the 3D structure is used in the related volume building block and the field radius is reduced, to reduce the number of model parameters and improve the detection speed. Experimental results show that the average endpoint error of the proposed algorithm for pedestrian detection is 0.79. The detection speed can reach 55.98 frames per second, and the number of model parameters is reduced by 35.3%. To obtain the threat value of passengers breaking through subway gates, this paper uses the improved optical flow method to calculate the motion information of adjacent picture frames and combines the gate-breaking threat calculation formula proposed in this study to obtain the threat value of passengers in the current picture frame. This method meets the requirements of real-time performance, accuracy, and lightweight design, and can be effectively deployed to better meet the engineering practice requirements of pedestrian threat detection and emergency management for large passenger flows within the station.
Abstract: To solve the problems that denial of service (DoS) attacks in the Internet of Vehicles are difficult to prevent and the existing supervised learning methods cannot effectively detect zero-day attacks, this study proposes a hybrid DoS attack intrusion detection system. Firstly, the dataset is preprocessed to improve data quality. Secondly, feature selection is used to filter out redundant features, which aims to obtain more representative features. Thirdly, the ensemble learning method is used to integrate five tree-based supervised classifiers through stacking to detect known DoS attacks. Finally, an unsupervised anomaly detection method is proposed, which combines the convolutional denoising autoencoder with the attention mechanism to establish a normal behavior model. It is used to detect unknown DoS attacks that are missed by stacking ensemble models. Experimental results show that for the detection of known DoS attacks, the detection accuracy of the proposed system on the Car-Hacking dataset and the CICIDS2017 dataset is 100% and 99.967%, respectively. For the detection of unknown DoS attacks, the detection accuracy of the proposed system on the above two datasets is 100% and 83.953%, respectively, and the average test time on the two datasets is 0.072 ms and 0.157 ms, respectively, which verifies the effectiveness and feasibility of the proposed system.
Abstract: To address the problems that the traditional artificial potential field (APF) does not fully consider the variability of vehicle collision avoidance risk distribution and that falling into local extremum leads to path planning failure, this study proposes an adaptive elliptic scope APF based on gradient statistical mutation quantum genetic algorithm (GSM-QGA). Based on the traditional circular scope of the repulsive field, the study designs a calculation method for the dynamic elliptic scope of the repulsive potential field by analyzing the relative motion state of vehicles and obstacles. At the same time, through the analysis of the influencing factors of the potential field function, the velocity factor is introduced to complete the design of the repulsive potential field and gravitational potential field function. The GSM-QGA is used as the local optimum correction strategy for the improved artificial potential field. When the vehicle falls into the local extremum and moves back and forth, a pseudo-global map is constructed according to the current position of the vehicle, and a feasible path is planned to jump out of the local extremum range. The simulation results show that the path planned by the improved algorithm not only can effectively prevent vehicles from getting stuck in local extremum and reduce unnecessary obstacle avoidance operations of vehicles but also has advantages over traditional APF algorithm and APF algorithm based on fixed elliptic scope in terms of path smoothness and path length. The length of the planned path is shortened by 6.37% and 9.14%, respectively.
Abstract: Digital watermarking algorithms attract widespread attention due to their important application value in the fields of copyright protection, content authentication, and data hiding. In practical applications, images with embedded watermarks are often affected by differentiable noises such as image distortion and sharpening blurring. At the same time, they also face interference from non-differentiable noises such as JPEG compression and transmission errors. Existing studies mostly focus on scheme design in a single noise environment, or attempt to use differentiable models to approximately simulate non-differentiable noises. These methods limit the robustness of watermarking algorithms to a certain extent. To solve this problem, this study proposes an end-to-end one-stage digital watermarking scheme based on an invertible neural network. The scheme uses an invertible neural network to simulate non-differentiable noise, enhancing the algorithm’s adaptability and robustness to actual noisy environments. Compared with existing algorithms, this algorithm improves the peak signal-to-noise ratio (PSNR) by 3.12 dB and the average extraction accuracy (ACC) by 35.36% in the case of multiple noise superposition.
Abstract: 3D object recognition and detection based on point clouds is an important research topic in the fields of computer vision and autonomous navigation. Nowadays, deep learning algorithms have greatly improved the accuracy and robustness of 3D point cloud classification. However, deep learning networks usually have problems such as complex network structure and time-consuming training. This study proposes a three-dimensional point cloud classification network named Point-GBLS, which combines deep learning and a broad learning system. The network structure is simple and the training time is short. Firstly, point cloud features are extracted by a deep learning-based feature extraction network. Then, an improved broad learning system is used to classify them. Experiments on the ModelNet40 and ScanObjectNN dataset show that the recognition accuracy of Point-GBLS is more than 92% and 78% respectively. The training time is less than 50% of that of similar deep learning methods. It is superior to deep learning networks with the same backbone.
Abstract: To solve unclear boundaries and incoherent, incomplete, or even lost segmentation results in the semantic segmentation task of colon polyp images, a colon polyp image segmentation network named colon polyp image segmentation network based on multi-scale features and contextual aggregation (MFCA-Net) is proposed. The network selects PvTv2 as the backbone network for feature extraction. The multi-scale feature complement module (MFCM) is designed to extract rich multi-scale local information and reduce the influence of polyp morphology changes on segmentation results. The global information enhancement module (GIEM) is designed. A large-kernel deep convolution embedded with positional attention is constructed to accurately locate polyps and improve the network’s ability to distinguish complex backgrounds. The high-level semantic-guided context aggregation module (HSCAM) is designed. It guides local features with global features, complements differences, and cross-fuses shallow details and deep semantic information to improve the coherence and integrity of segmentation. The boundary perception module (BPM) is designed. Boundary features are optimized by combining traditional image processing methods and deep learning methods to achieve fine-grained segmentation and obtain clearer boundaries. Experiments show that the proposed network obtains higher mDice and mIoU scores compared with current mainstream algorithms on the publicly available colon polyp image datasets such as Kvasir, ClinicDB, ColonDB, and ETIS, and has higher segmentation accuracy and robustness.
Abstract: The matrix factorization model is one of the classic models in recommendation systems. It can be used to predict users’ ratings on items, and then make recommendations to users to improve user experience. Current matrix factorization models cannot effectively extract the local similarity relationship between users, which leads to poor rating prediction and the cold start problem. With the development of social networks, the trust relationship between users has become an important research tool for recommendation systems. Therefore, this study proposes a local Bayesian probabilistic matrix factorization model based on user trust relationship (TLBPMF) for rating prediction. The model studies users’ ratings by combining the trust relationship information of users. It identifies user groups with similar preferences and clusters them. According to the clustering results, rating submatrixes are obtained. A probabilistic matrix factorization model is established for each submatrix to deeply explore the local similarity relationship between users. The parameters of this model are estimated by the Gibbs sampling algorithm. A rating dataset from a film website is selected for experiments. The results show that the model is superior to the benchmark model in prediction accuracy and has better performance on cold start users.
Abstract: The audio-visual event localization (AVEL) task locates events in a video by observing audio information and corresponding visual information. In this paper, a cross-modal time alignment network named CMTAN is designed for the AVEL task. The network consists of four parts: preprocessing, cross-modal interaction, time alignment, and feature fusion. Specifically, in the preprocessing part, the background and noise in the modal information are reduced by the processing of a new cross-modal audio guidance module and a noise reduction module. Then, in the cross-modal interaction part, the information reinforcement and information complementation modules based on the multi-head attention mechanism are used for cross-modal interaction, and the unimodal information is optimized with global information. In the time alignment part, a time alignment module focusing on the unimodal global information before and after cross-modal interaction is designed to perform feature alignment of modal information. Finally, in the feature fusion process, two kinds of modal information are fused from shallow to deep by a multi-stage fusion module. The fused modal information is ultimately used for event localization. Extensive experiments demonstrate that CMTAN has excellent performance in both weakly and fully supervised AVEL tasks.
Abstract: In the research on low-light image enhancement, although existing technologies make progress in improving image brightness, the issues of insufficient detail restoration and color distortion still persist. To tackle these problems, this study introduces a dual-attention Retinex-based Transformer network—DARFormer. The network consists of an illumination estimation network and corruption restoration network, which aims to enhance the brightness of low light images while preserving more details and preventing color distortion. Illumination estimation network uses an image prior to estimate the brightness mapping, which is used to enhance the brightness of low-light images. The corruption restoration network optimizes the quality of the brightness-enhanced image, employing a Transformer architecture with spatial attention and channel attention. Experiments carried out on public datasets LOL_v1, LOL_v2, and SID show that compared with the prevalent enhancement methods, DARFormer achieves better enhancement results in quantitative and qualitative indicators.
Abstract: With the development of information technology, back translation plagiarism, such as through the use of translation tools, becomes increasingly complex and covert, posing higher requirements for plagiarism detection methods. For this reason, a plagiarism detection method based on prompt engineering is proposed. This method guides large language model (LLM) to pay attention to potential similarities in sentence texts at the semantic level by designing prompt words, which can effectively identify highly semantically similar content. Firstly, the existing plagiarism detection technologies and the application of prompt engineering are reviewed. Based on this, a backtracking plagiarism behavior detection process based on prompt engineering is designed. Secondly, a prompt template is designed to propose a plagiarism detection index based on sentence compression ratio by merging and reducing the pairs of sentences to be detected. Finally, experiments demonstrate that the plagiarism detection method based on prompt engineering has significant advantages over traditional methods in detecting back translation plagiarism behavior.
Abstract: Remote sensing hyperspectral image single super-resolution (HSISR) tasks have made considerable progress in recent years. Methods using deep convolutional neural network (CNN) technology are widely employed. However, most CNN-based super-resolution models tend to ignore the spectral structure of remote sensing hyperspectral images. Meanwhile, due to the limitation of convolutional networks by the size of convolutional kernels, long-distance feature dependencies are ignored, which in turn affects the reconstruction accuracy. To solve these problems, this study proposes adual-branch remote sensing hyperspectral image super-resolution network based on grouped ConvLSTM and Transformer (DGCTNet), which combines the advantages of Transformer in capturing long-distance dependencies and ConvLSTM in extracting sequential features. It enhances the reconstructed image effect by extracting spatial features while maintaining spectral orderliness. In addition, DGCTNet also designs an edge learning network to diffuse edge information into the image space. At the same time, to recalibrate the spectral response, the proposed dual-group level channel self-attention mechanism (DSA) is added. Experiments on the Houston dataset show that the proposed DGCTNet method outperforms the current state-of-the-art comparison models in terms of quantitative evaluation metrics and visual quality in a wide variety of scenarios.
Abstract: To address the issue of image quality decline caused by existing reflection removal algorithms when handling complex scenes, this study proposes a color-aware dual-channel reflection removal algorithm. First, a background color generator is designed to accurately predict the background color information of an image, provide background support for the basic reflection removal network, and generate preliminary reflection removal results. Subsequently, a dual-channel reflection removal network is proposed to further optimize these preliminary results. Additionally, the algorithm designs a sparse Transformer module, a channel attention module, and a feature fusion module within the dual-channel reflection removal network, thereby enhancing the precision and effect of reflection removal. Experimental results demonstrate that this method performs excellently on the RRID and Flash datasets, effectively removing reflected light and significantly enhancing image realism.
Abstract: Faced with insufficient labeled data in the field of video quality assessment, researchers begin to turn to self-supervised learning methods, aiming to learn video quality assessment models with the help of large amounts of unlabeled data. However, existing self-supervised learning methods primarily focus on video distortion types and content information, while ignoring dynamic information and spatiotemporal features of videos changing over time. This leads to unsatisfactory evaluation performance in complex dynamic scenes. To address these issues, a new self-supervised learning method is proposed. By taking playback speed prediction as an auxiliary pretraining task, the model can better capture dynamic changes and spatiotemporal features of videos. Combined with distortion type prediction and contrastive learning, the model’s sensitivity to video quality differences is enhanced. At the same time, to more comprehensively capture the spatiotemporal features of videos, a multi-scale spatiotemporal feature extraction module is further designed to enhance the model’s spatiotemporal modeling capability. Experimental results demonstrate that the proposed method significantly outperforms existing self-supervised learning-based approaches on the LIVE, CSIQ, and LIVE-VQC datasets. On the LIVE-VQC dataset, the proposed method achieves an average improvement of 7.90% and a maximum improvement of 17.70% in the PLCC index. Similarly, it also shows considerable competitiveness on the KoNViD-1k dataset. These results indicate that the proposed self-supervised learning framework effectively enhances the dynamic feature capture ability of the video quality assessment model and exhibits unique advantages in processing complex dynamic videos.
Abstract: In response to challenges faced in crowd counting, such as non-uniform head sizes, uneven crowd density distribution, and complex background interference, a convolutional neural network (CNN) model (multi-scale feature weighted fusion attention convolutional neural network, MSFANet) that focuses on crowd regions and addresses multi-scale changes is proposed. The front end of the network adopts an improved VGG-16 model to perform the first step of coarse-grained feature extraction on the input crowd image. A multi-scale feature extraction module is added in the middle to extract the multi-scale feature information of the image. Then, an attention module is added to weigh the multi-scale features. At the back end, a sawtooth shaped dilated convolution module is adopted to increase the receptive field, extract the detailed features of the image, and generate high-quality crowd density maps. Experiments on this model are conducted on three public datasets. The results show that on the Shanghai Tech Part B dataset, the mean absolute error (MAE) is reduced to 7.8, and the mean squared error (MSE) decreases to 12.5. On the Shanghai Tech Part A dataset, the MAE is reduced to 64.9, and the MSE decreases to 108.4. On the UCF_CC_50 dataset, the MAE is reduced to 185.1, and the MSE decreases to 249.8. These experimental results affirm that the proposed model exhibits strong accuracy and robustness.
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.