Mining Association Rules Between Sets of Items in Large Databases (1993)
March 2, 2024Mining Association Rules between Sets of Items in Large Databases presents one of the earliest approaches for association rules mining. An association rule is an implication of the form \(X \Rightarrow I_j\) where \(X\) is a set of some items, and \(I_j\) is an item that is not in \(X\). For instance, if transactions buying bread and butter also buy milk, that forms an association rule. The paper introduces a method to find association rules and measure confidence and support factors from transactions. The confidence factor measures how many transactions that satisfy \(X\) also satisfy \(I_j\). The support factor of an association rule is defined to be the fraction of transactions that contain the union of items in the rule. The support factor implies statistical significance of the association rule.
The introduced approach takes transactions and a threshold of support. The following pseude code shows the LargeItemsets algorithm. The method ignores the rules with supports smaller than the threshold, and terminates within a feasible timeframe. The “large” in the algorithm denotes to the support factors exceeding the threshold. It iterates over transactions.
procedure LargeItemsets(database tuples, minsupport)
let L = {} // Large set
let F = {{}} // Frontier set
while F != {}:
// make a pass over the database
let C = {} // Candidate set
for t in database tuples:
for f in F:
if t.contains(f):
let C_f = []
Extend(f, t, C_f)
for c_f in C_f:
if c_f in C:
c_f.count++
else:
c_f.count = 0
C += c_f
// consolidate
let F = {}
for c in C:
if c.count / size(database tuples) > minsupport:
L += c
if c should be used as a frontier in the next pass:
F += c
return L
procedure Extend(X: itemset, t: tuple, results: List[itemset])
let item I_j such that forall I_l in X, I_j >= I_l
for t in I_k such that I_k > I_j:
results.append([X, I_k])
if (XI_k) is exptected to be large:
Extend(XI_k, t, results)
Given f is a set of items that can form an association rule when another item is added, the frontier set F is a collection of such f.
F is initialized as an empty set.
For each transaction t containing f, the algorithm produces c_f, which contains the items in f and an item in t but not in f.
If c_f’s support is greater than the threshold, c_f is added to L.
After comparing the running frontier set with the transactions, the frontier set is replaced by a subset of the collection of c_fs.
The support of each c_f in this subset is expected to be below the threshold but it turns out to be higher.
Under the assumption of the statistical independence, the expected support of a c_f is the product of the support of f and the relative frequency of the item not in f.
If the next forntier set is empty, the algorithm terminates.