4. Apriori What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions. What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database. What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.
Here’s the best part: By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules. The power of this is: You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets. For example: You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3itemsets or more. That’s precisely what Apriori helps with! You might be wondering how Apriori works? Before getting into the nittygritty of algorithm, you’ll need to define 3 things:  The first is the size of your itemset. Do you want to see patterns for a 2itemset, 3itemset, etc.?
 The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
 The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.
The basic Apriori algorithm is a 3 step approach:  Join. Scan the whole database for how frequent 1itemsets are.
 Prune. Those itemsets that satisfy the support and confidencemove onto the next round for 2itemsets.
 Repeat. This is repeated for each itemset level until we reach our previously defined size.
Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships. But wait, there’s more… Apriori can also be modified to do classification based on labelled data. Why use Apriori? Apriori is well understood, easy to implement and has many derivatives. On the other hand… The algorithm can be quite memory, space and time intensive when generating itemsets. Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange. The next algorithm was the most difficult for me to understand, look at the next algorithm… 5. EMWhat does it do? In data mining, expectationmaximization (EM) is generally used as a clustering algorithm (like kmeans) for knowledge discovery. In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. OK, hang on while I explain… I’m not a statistician, so hopefully my simplification is both correct and helps with understanding. Here are a few concepts that will make this way easier… What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model. Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade. In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade. Cool, what are the parameters of a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance. Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100. So, all you need to describe a normal distribution are 2 parameters: And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample. Here’s the deal: We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades. In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood. Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade. You’re probably wondering, what’s probability then? Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability. In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us. Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.). Here’s the kicker: For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering. Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now. The best part is… By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me! How does EM help with clustering? EM begins by making a guess at the model parameters. Then it follows an iterative 3step process:  Estep: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
 Mstep: Update the model parameters based on the cluster assignments from the Estep.
 Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).
Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning. Why use EM? A key selling point of EM is it’s simple and straightforward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data. This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to. EM is not without weaknesses though…  First, EM is fast in the early iterations, but slow in the later iterations.
 Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.
Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikitlearn also has an implementation in its gmm module. What data mining does Google do? Take a look…
