Published by Data Mining and Knowledge Discovery on April 14, 2004
Ben Kao, Minghua Zhang, Chi Lap Yip, David W. Cheung, Usama M. Fayyad
Data Mining and Knowledge Discovery volume 10, pages87–116 (2005)
Abstract. We study two problems: (1) mining frequent sequences from a transactional database, and (2) incremental update of frequent sequences when the underlying database changes over time. We review existing sequence mining algorithms including GSP, PrefixSpan, SPADE, and ISM. We point out the large memory requirement of PrefixSpan, SPADE, and ISM, and evaluate the performance of GSP. We discuss the high I/O cost of GSP, particularly when the database contains long frequent sequences. To reduce the I/O requirement, we pro-pose an algorithm MFS, which could be considered as a generalization of GSP. The general strategy of MFS is to ﬁrst ﬁnd an approximate solution to the set of frequent sequences and then perform successive reﬁnement until the exact set of frequent sequences is obtained. We show that this successive reﬁnement approach results in a signiﬁcant improvement in I/O cost. We discuss how MFS can be applied to the incremental update problem. In particular, the result of a previous mining exercise can be used (by MFS) as a good initial approximate solution for the mining of an updated database. This results in an I/O efﬁcient algorithm. To improve processing efﬁciency, we devise pruning techniques that, when coupled with GSP or MFS, result in algorithms that are both CPU and I/O efﬁcient.
Keywords: data mining, sequence, incremental update.