神经机器翻译技术能够自动翻译多种语言的语义信息, 已被应用于跨指令集架构的二进制代码相似性检测, 并取得了较好的效果. 将汇编指令序列当作文本序列处理时, 指令顺序关系很重要. 进行二进制基本块级别相似性检测时, 神经网络使用位置嵌入来对指令位置进行建模. 然而, 这种位置嵌入未能捕获指令位置之间的邻接、优先等关系. 针对该问题, 本文使用指令位置的连续函数来建模汇编指令的全局绝对位置和顺序关系, 实现对词序嵌入的泛化. 首先使用Transformer训练源指令集架构编码器; 然后使用三元组损失训练目标指令集架构编码器, 并微调源指令集架构编码器; 最后使用嵌入向量之间欧式距离的映射表示基本块之间的相似程度. 在公开数据集MISA上的实验表明, P@1评价指标达到69.5%, 比对比方法MIRROR提升了4.6%.
Neural machine translation technology can translate the semantic information of multiple languages automatically. Therefore, it has been applied to binary code similarity detection of cross-instruction set architecture successfully. When the sequences of assembly instructions are treated as sequences of textual tokens, the order of instructions is important. When binary basic block-level similarity detection is performed, the neural networks model instruction positions with position embeddings, but it failed to reflect the ordered relationships (e.g., adjacency or precedence) between instructions. To address this problem, this paper uses a continuous function of instruction positions to model the global absolute positions and ordered relationships of assembly instructions, achieving the generalization of word order embeddings. Firstly, the source instruction set architecture (ISA) encoder is constructed by Transformer. Secondly, the target ISA encoder is trained by triplet loss, and the source ISA encoder is fine-tuned. Finally, the Euclidean distances between embedding vectors are mapped to [0,1], which are used as the similarity metrics between basic blocks. The experimental results on the public dataset MISA show that the evaluation metric P@1 of this paper is 69.5%, which is 4.6% higher than the baseline method MIRROR.