Hamming distance on database records

Today I needed to quickly compare multiple records in one table with one I already had and find the most similar one, specifically in MSSQL but the query I used can be modified for most other DB systems. I am going to write this up a little formally to make sure that it is clear. Note: that this implementation can be modified to be a dynamically created query or a stored procedure.

Problem

For a given record in sourceTable (id = 12), rank records in table, based on how many values of each record match the corresponding column:value pair.

Solution

Use a modified hamming distance algorithm in the SQL as follows:

SELECT table.id,

CASE WHEN sourceTable.columnName1 = table.columnName1 THEN 0 ELSE 1 END +
CASE WHEN sourceTable.columnName2 = table.columnName2 THEN 0 ELSE 1 END +
...
CASE WHEN sourceTable.columnNameN = table.columnNameN THEN 0 ELSE 1 END AS hammingDistance
FROM table, sourceTable
WHERE sourceTable.id = 12
 ORDER BY hammingDistance ASC

This implementation of the hamming code algorithm ranks all the records in the [table] table based on how many differences there are between the column values with the same column name in each table. So as output from this query we get the id’s of the table ranked with the most similar first (a hamming distance value of 0 being identical) and the least similar last. I am not sure that this is the most efficient way of doing this but at this point in time it works and can be modified for more complicated comparisons.

About Simeon Cheeseman

I enjoy a wide variety of computer and board games, have a BSc in Computer Science and have played percussion for 18 years.

Posted on January 27, 2012, in Algorithms, Database, SQL and tagged , , , . Bookmark the permalink. 3 Comments.

  1. Hello,

    I am searching what is the best way to apply Hamming search on database’s hashes.

    Did you ever applied your own SQL function on big query such as 1M elements ? How much time did it cost ?

    Thanks you in advance !

    Sebastien D.

    • Hey, no I never tried this on anything big, I did it mostly to help me remember the course content at UNI.

      • Hello Simeon,

        thanks for your response. I will try to use NoSQL database to support scalability and supply performant process times when I do queries.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: