To actually implement that, we first need to correlate (or match, or reconcile) entities we get from each source. Until now, this was a manual step on the part of users. They were supposed to navigate to a certain screen in our product and drag-and-drop entities (for example, their clients’ or employees’ names) into matching groups. Needless to say, nobody would actually do it. The users pay us for the software and expect the software to do the work.
So, our product manager tells us to develop a new feature to automatically match entities across data sources. I know from experience that it is impossible to achieve a perfect match rate automatically and that a combination of computer and human matching works best. I propose a wizard-like interface that to matches that the user can either accept, select a better match or reject outright. The PM does not buy it. Our users wouldn’t do even that, he says. They pay us for the software and expect it to do the work. The matches have to take effect automatically.
Now we have a well defined technical problem. There is a list of entities — say, company names. We are getting new entities, one by one, and have to find the best single match to the given list — or decide that no match is good enough.
How do you approach solving this problem? By doing two things:
1. Measure, measure, measure.
2. Know your building blocks.
Let’s talk about measuring first. As we evaluate different algorithms, it is important to build solid intuition about how they perform, but intuition is no substitute for solid metrics. It is given that no algorithm will be perfect, so we need a statistical way to evaluate and compare them. A note of caution, though: all the statistical metrics are important, but at the end of the day the only thing that matters is how “right” the actual results feel to the user. It may be easy (or not) to explain to them why, say, “InsightSquared” doesn’t get matched to “Insight Squared”, while at the same time “InsightSquared, Inc” gets matched to, say, “Google, Inc”. They probably won’t buy it, and our goal is to make them buy it. So at the end of the development it is important to visually examine the results and make sure they feel right. But still, we need to measure first.
One way to quantify matching is to count true positives, true negatives, false positives and false negatives. Since our matches will be immediately used in production and we don’t expect our users to actively curate them we want to minimize false positives while keeping true positives reasonably high.
Where do we get the actual values for positives and negatives? One needs a high quality manually matched data set. I get lucky. We always eat our own dog food here at InsightSquared, and we have an instance of our system with our own data, all carefully manually matched. So I write a test harness that runs all of our data through any given matching algorithm, compares the results with manual matches and calculates the statistics: true positives, true negatives, false positives and false negatives. After some testing, it turns out that the four statistics of true and false positives and negatives is too many when you keep comparing multiple sets of results. Fortunately, there are two other metrics that will do instead: precision and recall. Precision is the fraction of retrieved entities that are relevant, while recall is the fraction of relevant entities that are retrieved. In our previous terminology, precision is the number of true positives divided by the number of all positives, and recall is the number of true positives divided by the sum of true positives and false negatives. Again, in the case of InsightSquared we want precision to be as close to 100% as possible while keeping recall reasonably close to 100%. (If you feel that even two numbers is too many, there is also one single metric called F-measure that combines recall and precision.)
Now that we know how to measure, let’s talk about the kind of algorithms we might use. First, I build a strawman: an exact string comparison algorithm. Run it on our data, get 66% recall and 99% precision. Not bad already! In addition to being a nice test for the test harness, this gives us a lower bound on recall and an upper bound on precision. (You would expect 100% precision, assuming that all the strings — the company names we are matching — are unique, which is not an entirely trivial assumption.)
The next idea is to do fuzzy, rather than exact, string comparisons. One way to do it is to use Levenshtein distance, which is defined as the number of one character transformations (adding, deleting or changing single character) required to get from one string to another. You compare the entity being matched to all potential matches, the one with the smallest Levenshtein distance wins. This metric is often used in spellcheckers to produce potential corrections to a misspelled word. So we try it, and the result is a disaster. Here is why: turns out that misspellings are not our biggest problem at all! The biggest problem is adding and removing words: “InsightSquared” vs. “InsightSquared Inc”. And, of course, the Levenshtein distance between “ABC” and “ABC Incorporated” is 13, and between “ABC” and “XYZ” it is just 3, so we would match ABC to XYZ. Not good.
Let’s try something else, then. The next approach is to treat this as a search problem. When you type a search query into Google, it essentially matches it against the documents it crawled before, and returns a list of documents sorted by the similarity to your query. Sounds remarkably similar to the matching problem!
Time for a spoiler: this approach works, and works remarkably well. It is also not too hard to build from scratch. We just need a few building blocks.
The first block is the algorithm called TF-IDF, which is used by most modern search engines. This algorithm, given a search query term (for example, a word) and a corpus of documents, produces a numeric measure of relevance of each document to this term. I am not going to describe it here in detail. The general idea, though, is that the relevance measure is a product of term frequency (TF, how frequently the term is used in the document) and inverse document frequency (IDF, one over how often this term is used in the entire document corpus). Intuition: generally common terms are discounted. This is exactly what we need, because we expect to see some very common terms like “Inc.” and “Corporation” which should not be, on their own, used for a definitive match.
In our case both “documents” and “queries” are just company names being matched to each other. How do you extract terms from them? That would be our second building block. One way is just to split them into words. This approach is simple and easy to understand, but maybe it is too simple? For example, it leaves little hope for catching typos and misspellings. Another approach is to use n-grams. N-grams for the given integer n is just a sliding window of sequences of n characters. For example, five-grams for “InsightSquared” would be ["Insig", "nsigh", "sight", "ightS", "ghtSq", "htSqu", "tSqua", "Squar", "quare", "uared"]. N-grams are better than words in catching typos, and they are also a good fit for TF-IDF, which would automatically discount common n-grams and emphasize rare and exotic ones.
Finally, TF-IDF gives you a score per term, but what you really need is a score for the entire query, which can contain multiple terms (and surely will if you are using n-grams). To combine multiple TF-IDF scores into one metric, we are going to use the final building block: cosine similarity. It comes from geometry, where the cosine of the angle between two vectors is used as a measure of their similarity. The cosine is (somewhat surprisingly) easier to calculate than the angle itself, and it has a nice range of [0, 1] (for positive vectors), 1 being the most similar, and 0 the least similar. Again, the formula for cosine similarity can be found in Wikipedia and any number of other places. What is sometimes glossed over is how to actually define the vectors. In the TF-IDF context, when comparing the query to a document from the corpus, get the union of the terms used in the query and in the documents, and then construct two parallel vectors with TF-IDF scores for those terms, one vector for the query and another vector for the document. (In my experiments, using just the query’s terms for the vector, instead of the union, gives very similar results, and is noticeably faster.) Now, we calculate cosine similarities of our query against all documents in the corpus, and pick the document with the largest similarity. Done!
Well, almost. Last, but not least, we need to handle the scenario when there are no good matches at all. In this case something seemingly random will surface as the best (but still bad) match. Something like “InsightSquared, Inc.” vs. “Google, Inc.” from our earlier example, which produces low but non-zero TF-IDF scores (for “Inc.”) and cosine similarity values, but clearly cannot be allowed to happen. We can address this simply by introducing a threshold similarity, below which we will return “no match.” Having similarity values always in the [0, 1] range helps here. This threshold value will be our main tool in achieving the right balance between true positives and false positives — between recall and precision.
And this is it! Armed with our building blocks and measuring tools, we can try different algorithms with different parameters (such as n-gram length and threshold values) and find the one that works best for us.
In our case, Levenshtein distance performs poorly, TF-IDF over words performs remarkably well, and the best variants of TF-IDF over n-grams are slightly better than that. N-grams are slower (more terms per document) and somewhat harder for a human to reason about, but I make the decision to use them because, well, they produce slightly better results and also give us a chance to catch misspellings, which always looks impressive. With 98% precision and 87% recall over the training set, and misspellings already caught in the test runs, we have deployed auto-matching in production this week.
Let’s see how it performs in the wild!
 Wikipedia does it well, and, in fact, it is quite possible to implement everything described in this post by using Wikipedia alone