Comparing algorithms – how to decide which one is better?
When implementing an MDM solution, or any matching algorithm, you may be replacing a previous matching algorithm. You will need to be really sure that the new algorithm will be an improvement on the old one. However, knowing which algorithm is better, or worse, is a challenging pursuit.
For example, is the algorithm that matches more, better than the other one? You can’t just take that to the extreme and just match everything together, and at the other end of the spectrum, if you matched nothing together, your matching algorithm would be pointless.
The way to do it is that the algorithms must be evaluated against the number of incorrect matches they make, both false positives (FP), which are overmatched records, and false negatives (FN), under-matched records.
In addition, when comparing algorithms together, they must be compared on the same data, so you can compare apples with apples and oranges with oranges. Otherwise, the matching performance may be related to the data being matched, rather than an evaluation of the algorithm itself.
A process for comparing algorithms
To compare algorithms thoroughly and robustly we recommend a three-stage comparison process. This will both evaluate and describe the differences between them and give evidence as to which algorithm matches “better”*.
The 3 stages are:
1. Aligning false positives
2. Evaluating total matches
3. Highlight matching differences
*”better” being, at the very least, one of either FP or FN being equal and the other one being improved.
Aligning false positives
Typically, you will have an existing algorithm, the baseline, which will have already matched your data. In this way, you can quickly and easily figure out whether records have matched or not for this algorithm. The newer algorithm, however, is probably still under development and being “tuned” to achieve maximum accuracy and performance.
Over the total data set, you will want to take a representative sample of pairs of records that are both matched and unmatched. Sometimes, the implementation of an algorithm may have tools to help you gather this sample, such as the IBM MDM tool’s Sample Pairs. In any case, you will want to take a sample, probably of around 1000 pairs, and hand them over to at least three (odd numbers are better for tiebreaks) data stewards and ask them to evaluate whether the pairs are a match or not. Combining the answers of the data stewards (e.g. majority decision wins), will give you a baseline answer as to whether the pairs are true matches or not.
Now, the same pairs of records are fed through the existing and new algorithms, assigning if each pair of records is a match or not. Depending on the data steward answer and the algorithm answer, you can classify each pair as below (for each algorithm):
You can now count each pair’s classification to end up with a small table such as below:
What we can see here is that the two algorithms perform variably. It should also be noted that TP and FN are inverse of one another and TN and FP are also inverse of one another. In this example, we see that Algorithm 1 seems to be doing better on false positives, but does worse on false negatives – suggesting no clear winner.
Instead, what we aim to do is vary the newly developing algorithm such that the number of false positives aligns with the number of false positives in the existing algorithm, since we can assume that this is an acceptable level of false positives for the business because they currently accept it. This may mean changing a PME threshold, or it may mean broadening or limiting a set of deterministic rules. In any case, it is likely that as the FP rate changes, the FN rate will move in the other direction. If we take the above example and assume that Algorithm 2 had to change, we may end up with something more like:
Sometimes it might not be possible to get the number of false positives exactly the same, because of the granularity of configurable elements in the algorithm. However, they should be as close as you can get them. In the above example, we can see that while the false positives improved for Algorithm 2, the number of false negatives got worse.
What we saw in the example in the previous section is that Algorithm 2 seemed to be performing worse than Algorithm 1, as they had the same false positive rate, but a worse false negative rate (which meant less overall matches). However, this isn’t the entire story. While newer algorithms, such as Probabilistic Matching or Neural Networks, may not match two specific pairs of records, they will normally take into account more than just those two records when matching.
For example, in Probabilistic Matching, you can get transitive matches, so while A and C may not match directly, adding a third record, B, so that A and B match and B and C match means that A, B and C all get matches together.
It is for this reason that we don’t rely heavily on the false negative rate in the previous exercise. Instead, we look at the total matching.
Setting the number of false positives to be equal across both algorithm mitigates the risk of incorrect matches in one algorithm compared to the other. In other words, by aligning the false positives and then looking at the total number of matches over 100% of your data, you can be confident that while not 100% of the matches may be accurate, the inaccuracy rate is fairly similar between both algorithms. It can therefore be disregarded when comparing the two (but not disregarded when assessing the absolute quality of the algorithms themselves, only the relative quality between the two).
To get the total matches, you need to ensure that you’ve filtered 100% of your data down to only get records which can be (or have been) matched across both algorithms.
For instance, using the below diagram, the circle on the left are the records available to Algorithm 1 and the circle on the right are records available to Algorithm 2. The green overlapping records are those which should be used for the following exercise, not those in the blue or yellow sections.
Once you have your data, you want to match it all using both algorithms. Then you can calculate the total number of records (which will be the same across both algorithms) as well as the distinct golden records for each algorithm (a golden record is the grouping of matched records). Let’s look at the example below:
In this example, Algorithm 1 has 100 records and they match into 87 golden records. On the other hand, Algorithm 2 also has 100 records, but they match into 43 golden records. In this example, Algorithm 2 achieves many more matches than Algorithm 1.
Knowing that the false positive rate is the same between both algorithms, we can say with some confidence that the matches that both algorithms make at least as accurate as one another. In this example, therefore, we can say that both algorithms achieve the same false positive rate, but Algorithm 2 achieves significantly more matches, which means that the number of false negatives is much less* in Algorithm 2 than Algorithm 1, highlighting that Algorithm 2 is superior in matching than Algorithm 1.
* The reason we can say more matches means fewer false negatives, when the false positive rate is fixed, is because records that are matched by the algorithm can either be correct or incorrect (when compared against a data steward’s answer). The incorrect matches are false positives, which we’ve fixed or minimised during the first exercise with the pair sampling. This means that the remainder of the matches made by an algorithm must be correct matches. So, by fixing the FP rate the algorithm which achieves more matches is achieving the same rate of false positives as the other algorithm, but also more correct matches.
The above 2 processes give evidence as to which algorithm is matching “better”, by looking at false positive and false negative rates. And while you may get a clear “this algorithm is better” answer, it is very likely that the algorithms will match differently, not just better or worse. In this case, it is worthwhile looking at the actual similarities and differences between the matching algorithms.
To do this, we can break records down into one of four scenarios, with scenario four being able to be broken down into sub-scenarios, although these sub-scenarios are beyond this article. The 4 scenarios are:
1. The matching performed on these records is exactly the same between both algorithms. E.g. Algo1(A, B, C), Algo1(D, E), Algo2(A, B, C) and Algo2(D, E)
2. The matches for Algorithm 1, in this scenario, are partitioned into subsets of matches in Algorithm 2. E.g. Algo1(A, B, C, D), Algo2(A, B) and Algo2(C, D)
3. This is the inverse of Scenario 2. That is, the matches for Algorithm 2 are partitioned into subsets of matches in Algorithm 1. E.g. Algo1(A, B), Algo1(C, D) and Algo2(A, B, C, D)
4. All other records fall into Scenario 4. This includes matches that may “chain together”. E.g. Algo1(A), Algo1(B, C), Algo1(D), Algo2(A, B) and Algo2(C, D)
If you are happy with the existing matching, then by looking at the matching in scenarios 2, 3 and 4 you may be able to find areas where the newer algorithm matches better and you may also find areas where the new algorithm can improve. Both of these situations are of benefit, as it builds confidence in the new algorithm, as well as highlighting how improved the new algorithm will be over the existing algorithm.
If you apply these three techniques you can be confident that you understand how your new algorithm performs in comparison to the old one and also where it is different. This will be important when it comes to applying the results to business outcomes.
If you would like more information on algorithms and matching then look at my previous blogs here and also look out for the next three in the series! Happy matching 🙂