Hamming distance on database records
Today I needed to quickly compare multiple records in one table with one I already had and find the most similar one, specifically in MSSQL but the query I used can be modified for most other DB systems. I am going to write this up a little formally to make sure that it is clear. Note: that this implementation can be modified to be a dynamically created query or a stored procedure.
For a given record in sourceTable (id = 12), rank records in table, based on how many values of each record match the corresponding column:value pair.
Use a modified hamming distance algorithm in the SQL as follows:
SELECT table.id, CASE WHEN sourceTable.columnName1 = table.columnName1 THEN 0 ELSE 1 END + CASE WHEN sourceTable.columnName2 = table.columnName2 THEN 0 ELSE 1 END + ... CASE WHEN sourceTable.columnNameN = table.columnNameN THEN 0 ELSE 1 END AS hammingDistance FROM table, sourceTable WHERE sourceTable.id = 12 ORDER BY hammingDistance ASC
This implementation of the hamming code algorithm ranks all the records in the [table] table based on how many differences there are between the column values with the same column name in each table. So as output from this query we get the id’s of the table ranked with the most similar first (a hamming distance value of 0 being identical) and the least similar last. I am not sure that this is the most efficient way of doing this but at this point in time it works and can be modified for more complicated comparisons.