Using Background Knowledge to Rank Itemsets

Abstract

Assessing the quality of discovered results is an important open problem indata mining. Such assessment is particularly vital when mining itemsets, sincecommonly many of the discovered patterns can be easily explained by backgroundknowledge. The simplest approach to screen uninteresting patterns is to comparethe observed frequency against the independence model. Since the parameters forthe independence model are the column margins, we can view such screening as away of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusingbackground knowledge. Namely, we show that we can efficiently use additionalknowledge such as row margins, lazarus counts, and bounds of ones. Wedemonstrate that these statistics describe forms of data that occur in practiceand have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. Inits general setting, solving a maximum entropy model is infeasible, but wedemonstrate that for our setting it can be solved in polynomial time.Experiments show that more sophisticated models fit the data better and thatusing more information improves the frequency prediction of itemsets.

Quick Read (beta)

loading the full paper ...