基于序列前缀技术的XML频繁路径挖掘算法

doi:10.15888/j.cnki.csa.006166

微信公众号

网站二维码

首页 > 过刊浏览>2018年第27卷第1期 >78-85. DOI:10.15888/j.cnki.csa.006166

PDF HTML阅读 XML下载导出引用引用提醒

基于序列前缀技术的XML频繁路径挖掘算法增强出版
DOI:
                        10.15888/j.cnki.csa.006166
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金（61273293）

Prefix-Based XML Frequent Path Mining Algorithm

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

增强出版

文章评论

摘要:

XML文档是半结构化数据，对其进行频繁路径挖掘可以分为两步：XML文档序列化和序列挖掘阶段. 现有的序列化方式将XML文档表示为Xpath路径集合，其中有大量的节点冗余；序列挖掘阶段采用的类Apriori算法需要多次扫描数据库并产生大量的候选集，采用的PrefixSpan算法会产生大量的投影数据库，占用较大的内存. 针对以往XML频繁路径挖掘算法存在的不足，本文提出一种高效的挖掘算法——基于序列前缀技术的XML频繁路径挖掘算法（PXFP，Prefix-based XML Frequent Path Mining Algorithm）. PXFP算法以广度优先方式遍历XML文档树并将每个节点表示为“节点：父节点”的形式，这种序列化的方式减少了节点冗余. 在序列挖掘阶段借鉴PrefixSpan 算法中前缀的概念，但不产生投影数据库，仅得到直接后缀（即前缀的子节点），通过记录频繁子路径的位置信息逐渐扩大频繁模式的长度，位置信息的引入减少了对数据库的扫描. 实验结果表明，PXFP算法取得了比PrefixSpan算法更高的时间和空间效率.

Abstract:

XML documents are semi-structured data, and XML frequent path mining can be divided into two steps: XML document serialization and sequence mining. The existing serialization method expresses the XML document as a set of Xpath paths with a plenty of node redundancy. Algorithms based on Apriori require multiple scanning of the database and can generate a large number of candidate sets. The PrefixSpan algorithm generates a large number of projection databases, occupying a lot of memory space. In view of the shortcomings of the existing algorithms used in XML frequent path mining, this paper proposes an efficient mining algorithm called Prefix-based XML Frequent Path Mining Algorithm (PXFP). The PXFP algorithm traverses the XML document tree in a breadth-first manner and represents each node as “node: parent node”, which reduces the node redundancy. The PXFP does not generate the projection database, but only gets the sub-node of the prefix, and then increases the length of the frequent pattern by the position information of the frequent sub-path, which reduces scanning the database. The experimental results show that the PXFP algorithm achieves higher time and space efficiency than the PrefixSpan algorithm.

参考文献

相似文献

引证文献

引用本文

张洁,毛国君.基于序列前缀技术的XML频繁路径挖掘算法.计算机系统应用,2018,27(1):78-85

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2017-04-09
最后修改日期:2017-05-09
录用日期:
在线发布日期: 2017-12-22
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史