Logical Analysis of Data
Inessa Epstein, Rutgers University, New Brunswick, NJ
Mentor: Professor Peter Hammer, Rutgers University, New Brunswick, NJ
One of the major goals of LAD (Logical Analysis of Data) is to classify new observations in a way consistent with past classifications. The available information consists of an archive or previous observations. Each observation consists of a vector of attribute values and the outcome of it. A specific characteristic of LAD is the detection of logical patterns which distinguish logical patterns in one class from all other observations. A pattern characteristic for a specific class is a combination of attribute values (or sets of values) occurring together only in some observations from the considered class. An important feature of LAD methodology is the possibility of using patterns in explaining the results of classification to human experts by standard formal reasoning. The general purpose of LAD is to observe data for which a positive or negative result is known and to make predictions for data not in the set. Unfortunately, such predictions usually cannot be 100% accurate. What I am working on now is identifying vectors with attributes that are unlike the original observed data and, as a result, accurate predictions for such data would be difficult to make.
A Boolean vector is a vector with its only possible values being 0 and 1. A Boolean function is a mapping from a Boolean vector to a Boolean value. A Partially defined Boolean function (pdBf) is a Boolean function that is defined over only a subset of all possible vector combinations.
Once we have a partially defined Boolean function for a subset of the vector space, we can then employ a procedure for classifying a set of new points that we can refer to as the testing set. The procedure for classifying points with unknown results is as follows: 1. Start with a training set for which result is known. 2. Use training set to generate positive and negative patterns. 3. For each new point, check whether it is covered by more positive or negative patterns and classify accordingly.
This theory works somewhat well and makes decent predictions, but, unfortunately, sometimes points are misclassified and mistakes are made. My problem this summer is to investigate the nature of such mistakes.
For this purpose, we defined a notion of distance for each point that we call a rank of exclusion. Rank of exclusion is the minimum number of variables needed to prove that a vector is not in the original training set.
x1 x2 x3 x4
1 1 1 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 0
Rank1 = -x1
Rank2 = -x1 v x2 -x3 v -x3 -x4
Rank3 = -x1 v x2 -x3 v -x3 -x4 v x2x3x4
Vector 0 0 0 1 Exclusion rank: 1
Vector 1 1 0 1 Exclusion rank: 2
Vectors with a lower exclusion rank are less similar to the original training set and, intuitively, less information is available about them
Conjecture: Vectors with a higher exclusion rank have a higher probability of being classified correctly.