Data matching question (maths problem) [SOLVED]
by Duncan Gray · in Technical Issues · 07/06/2010 (12:03 am) · 16 replies
I want to match a player with 16 other players of similar characteristics out of a potential pool of 20-30 thousand players (and that figure can increase to several million).
All the players have answered 50 yes/no type questions which will be used to profile them, so in theory I just need to score them on the basis that if 80% or more of their answers agree then they are a good match etc. The cut-off percentage can be varied until it produces the top 16 or so matches for each person.
Sounds straight forward enough right? The problem is that you need to scan each player against 30000 others so to find 16 top matches for each of the 30 000 players you need 30 000 x 30 000 = 900 million table scans.
I have a few ideas on how to make it more efficient but perhaps someone knows of a better approach? Ideally I want a single pass solution.
All the players have answered 50 yes/no type questions which will be used to profile them, so in theory I just need to score them on the basis that if 80% or more of their answers agree then they are a good match etc. The cut-off percentage can be varied until it produces the top 16 or so matches for each person.
Sounds straight forward enough right? The problem is that you need to scan each player against 30000 others so to find 16 top matches for each of the 30 000 players you need 30 000 x 30 000 = 900 million table scans.
I have a few ideas on how to make it more efficient but perhaps someone knows of a better approach? Ideally I want a single pass solution.
About the author
#2
07/06/2010 (10:41 am)
If you order your table by score, then this becomes a simple sort algorithm problem...Quick sort is usually a good option.
#3
First, get the answer number from the first player. We'll pretend the first player's answer is 5.
Next, use a SQL statement like: SELECT playerName, (ans & 5) | (!ans & !5) AS matches FROM yourTable;
Finally, use a fast bit counting algorithm to determine the number of matching answers the players have.
Here's an example:
Pretend that 3 people (A, B, and C) gave answers to 5 questions. A '1' is a yes response and a '0' is a no response.
A: 10110 = 22
B: 10000 = 16
C: 01011 = 11
So your table would look like:
So, for player A, we have the answer of 22.
SELECT playerName, (ans & 22) | (!ans & !22) AS matches
To see what's going on, let's compare player A (22) to player B (16)
22 = 10110
16 = 10000
(22 & 16) = 10110 & 10000 = 10000 = 16, the first half
(!22 & !16) = 01001 & 01111 = 01001 = 9, the second half
(16 | 9) = 10000 | 01001 = 11001 = 25, our result!
A bit counting algorithm would return 3 for the number 25 (as 3 bits are set). Therefore, 3 answers match.
That's closer to 30000 queries and 30000 bit counts in code.
07/06/2010 (12:47 pm)
If you encode the answers as a binary number, you could potentially get some help from your database. You'll need a 64-bit number with your 50 questions, but I don't think that will be a problem as long as you account for it.First, get the answer number from the first player. We'll pretend the first player's answer is 5.
Next, use a SQL statement like: SELECT playerName, (ans & 5) | (!ans & !5) AS matches FROM yourTable;
Finally, use a fast bit counting algorithm to determine the number of matching answers the players have.
Here's an example:
Pretend that 3 people (A, B, and C) gave answers to 5 questions. A '1' is a yes response and a '0' is a no response.
A: 10110 = 22
B: 10000 = 16
C: 01011 = 11
So your table would look like:
PLAYERNAME ANS ========== === A 22 B 16 C 11
So, for player A, we have the answer of 22.
SELECT playerName, (ans & 22) | (!ans & !22) AS matches
PLAYERNAME matches ========== === A 31 B 25 C 2
To see what's going on, let's compare player A (22) to player B (16)
22 = 10110
16 = 10000
(22 & 16) = 10110 & 10000 = 10000 = 16, the first half
(!22 & !16) = 01001 & 01111 = 01001 = 9, the second half
(16 | 9) = 10000 | 01001 = 11001 = 25, our result!
A bit counting algorithm would return 3 for the number 25 (as 3 bits are set). Therefore, 3 answers match.
That's closer to 30000 queries and 30000 bit counts in code.
#4
07/06/2010 (1:15 pm)
Thanks William, that solution seems brilliant! You and Duncan just help me solve a problem my HiveMind AI have been plagued with for over 2 years, (a way to comprehend and compute a logical solution based from massive amounts of reported data). Oh and "hey" Duncan, great to see your still in the Torque world.
#5
Yeah, William, correct, I was going to store the answers as a 64 bit word which is why the questions required yes/no answers and I was going to use a bit counting function to calculate the amount of matching bits (answers) as a percentage.
William, I also like your efficient way of bit comparison prior to a final bit count, however... to find compatible players for all 30 000 players still requires one run through the entire table of 30 000 rows for each player which still comes to 900 million calls to the bit-counting algorithm.
The direction I was leaning towards was to pre-profile each player to some degree. Lets say that I split the 64 bit word into 8 sections of 8 bits each. Next I do a bit count of each 8 bit section and store the count in a separate table column which is indexed. So I will then have 8 indexed columns, each containing a bit count for an 8 bit section. This will allow me to eliminate big chunks of the database by simply comparing pre-counted bit counts and reasoning that an 8 bit section with a bit count of 2 will never provide an 80% bit match to an 8 bit section with a bit count of 5 or more etc. I want to chew on it some more but it's looking like it should be a much faster way of doing it.
The general idea is to only compare against rows with enough bits in them. Sub-dividing the word into 8 sections just gives me a finer degree of elimination before getting down to bit by bit comparison.
07/06/2010 (4:02 pm)
Thanks guys, the community here is still great which is why I still pop in from time to time even though I'm not a huge fan of some the the business decisions that have taken place here.Yeah, William, correct, I was going to store the answers as a 64 bit word which is why the questions required yes/no answers and I was going to use a bit counting function to calculate the amount of matching bits (answers) as a percentage.
William, I also like your efficient way of bit comparison prior to a final bit count, however... to find compatible players for all 30 000 players still requires one run through the entire table of 30 000 rows for each player which still comes to 900 million calls to the bit-counting algorithm.
The direction I was leaning towards was to pre-profile each player to some degree. Lets say that I split the 64 bit word into 8 sections of 8 bits each. Next I do a bit count of each 8 bit section and store the count in a separate table column which is indexed. So I will then have 8 indexed columns, each containing a bit count for an 8 bit section. This will allow me to eliminate big chunks of the database by simply comparing pre-counted bit counts and reasoning that an 8 bit section with a bit count of 2 will never provide an 80% bit match to an 8 bit section with a bit count of 5 or more etc. I want to chew on it some more but it's looking like it should be a much faster way of doing it.
The general idea is to only compare against rows with enough bits in them. Sub-dividing the word into 8 sections just gives me a finer degree of elimination before getting down to bit by bit comparison.
#6
Say, in 8 bits, you could have
00011111
00000011
One has 5 bits, the other has 2. But they still match on 5 of the answers.
This problem is very interesting; I'll probably give it a little more thought.
07/06/2010 (9:14 pm)
If "no"s are matches (which I assumed above), you should be careful.Say, in 8 bits, you could have
00011111
00000011
One has 5 bits, the other has 2. But they still match on 5 of the answers.
This problem is very interesting; I'll probably give it a little more thought.
#7
As you say, very interesting problem.
07/06/2010 (10:45 pm)
William, yes you are correct. "no"s are also matches. The two headed monster got me again.As you say, very interesting problem.
#8
After bashing a few ideas around it seems that creating a pattern from a bit reduction technique is the way to go.
Basically I want to reduce the 64 bit int down to an 8 bit int which will be the pattern ID to which that 64 bit int belongs
I'm still looking for a good mathematical way of generating a pattern but by way of example, what if I divide the 64 bit int into 8 sections of 8 bits each. I test each 8 bit group with
Now I just need to select from the same patternID which will yield a result set ~ 1/256 the size of the table, or 256 times faster than a full table scan.
Of course I would still need to bit-count and sort the result set as explained so succinctly by William
Other bit reduction techniques which might work is sequentially XORing or ANDing bit-pairs until you are left with 8 bits. Needs more research.
07/07/2010 (10:41 am)
I believe that pattern matching should be the magic solution I have been looking for. After bashing a few ideas around it seems that creating a pattern from a bit reduction technique is the way to go.
Basically I want to reduce the 64 bit int down to an 8 bit int which will be the pattern ID to which that 64 bit int belongs
I'm still looking for a good mathematical way of generating a pattern but by way of example, what if I divide the 64 bit int into 8 sections of 8 bits each. I test each 8 bit group with
Result << 1; Result += (int)value > 128;Now store Result as an int called patternID in an indexed column of the table.
Now I just need to select from the same patternID which will yield a result set ~ 1/256 the size of the table, or 256 times faster than a full table scan.
Of course I would still need to bit-count and sort the result set as explained so succinctly by William
Other bit reduction techniques which might work is sequentially XORing or ANDing bit-pairs until you are left with 8 bits. Needs more research.
#9
For example, somebody with 11000000[192] would match well with a 01000000[64] (7 matching answers), but that bit in their patternID would not match at all.
I did a quick test in C++ and found that if you load all of the values into memory it takes less than a minute to get the answer.
The only optimization I made here is that after I randomly create a set of answers (with yes = 1), I pre-calculate the "no" answers as it's own bitset (by copying the yeses and flipping all the bits).
If you could thread this to 4 cores, you could easily get an answer in 20 seconds.
07/07/2010 (7:43 pm)
It seems like your "Result += (int)value > 128;" is only testing that the highest single bit is set and at least one other bit is set. I think you'll miss a lot of matches.For example, somebody with 11000000[192] would match well with a 01000000[64] (7 matching answers), but that bit in their patternID would not match at all.
I did a quick test in C++ and found that if you load all of the values into memory it takes less than a minute to get the answer.
#include "stdafx.h"
#include <bitset>
#include <vector>
using namespace System;
using namespace std;
struct Person
{
int id;
bitset<64> yeses;
bitset<64> noes;
int matches( const Person& other )
{
return ((yeses & other.yeses) | (noes & other.noes)).count();
}
};
int main(array<System::String ^> ^args)
{
vector<Person> thePeople;
for( int i = 0; i < 30000; ++i )
{
Person p;
p.id = i;
for( int a = 0; a < 64; ++a )
p.yeses.set( a, rand() & 1 );
p.noes = p.yeses; p.noes.flip(); // Auto-compute the noes.
thePeople.push_back( p );
}
int total = 0;
for( int i = 0; i < 29999; ++i )
{
if( i % 1000 == 0 ) Console::WriteLine( L"i = {0}", i );
for( int j = i + 1; j < 30000; ++j )
{
int matches = thePeople[i].matches( thePeople[j] );
if( matches >= 56 ) ++total;
}
}
Console::WriteLine(L"{0} matches", total );
Console::ReadKey();
return 0;
}The only optimization I made here is that after I randomly create a set of answers (with yes = 1), I pre-calculate the "no" answers as it's own bitset (by copying the yeses and flipping all the bits).
If you could thread this to 4 cores, you could easily get an answer in 20 seconds.
#10
Tying up an entire server for 20-60 seconds at a time is not ideal in my situation because there will be a lot of other needs to cater to as well.
Also consider people who want to use this system in AI applications where they need to rescan constantly due to changing input conditions.
As you pointed out, my optimization idea still has faults but there has to be a mathematical way to pre-group the data based on probability of matches so that only the reduced result set has to run through your nice matching algorithm.
I'm reading through some good data here
07/07/2010 (8:36 pm)
I like your method and am not faulting it other than that I still would prefer it to go quicker.Tying up an entire server for 20-60 seconds at a time is not ideal in my situation because there will be a lot of other needs to cater to as well.
Also consider people who want to use this system in AI applications where they need to rescan constantly due to changing input conditions.
As you pointed out, my optimization idea still has faults but there has to be a mathematical way to pre-group the data based on probability of matches so that only the reduced result set has to run through your nice matching algorithm.
I'm reading through some good data here
#11
Seeing as I'm looking for a majority match, I can use the 34 least significant bits as a grouping index because if two 64 bit numbers have 34 bits in perfect agreement then its a majority match and I only need to count how many of the remaining 30 bits match to fine tune it.
07/07/2010 (11:11 pm)
Quick thoughtSeeing as I'm looking for a majority match, I can use the 34 least significant bits as a grouping index because if two 64 bit numbers have 34 bits in perfect agreement then its a majority match and I only need to count how many of the remaining 30 bits match to fine tune it.
#12
What if you kept a list of the top 32 matches for each player (stored with the id AND the number of matches). In addition, you keep a reverse look-up table so that for a given player, you know all of the other players who hold him as a closest match.
To get the top 16 matches for a player, you just have get their stored list.
When an answer changes for a given player, use the reverse look-up to get the list of everybody who *MIGHT* change. Then recalculate the matches for those people and re-sort their list of matches.
If a new player can be added, you'll just scan through the list once and insert him into the appropriate lists (which would just take microseconds).
After X-number of answer changes (or Y minutes), you re-run the whole algorithm to re-build the 32 closest matches. This wouldn't have to be done too often. The idea being that when an answer changes, at worst it can move somebody from the 33rd closest match to the 32nd (which wouldn't be picked up). If one person is changing their answers constantly, then this might need to be done after a few hundred answer changes. If answer changes are done equally by everybody, you could probably get away with waiting for a few thousand answer changes before re-running the whole table algorithm.
07/08/2010 (11:11 am)
What if I match perfectly on the top 26 bits, but only 33 of the bottom 34 bits? I'd have 59/64 matches, but I would be out of contention.What if you kept a list of the top 32 matches for each player (stored with the id AND the number of matches). In addition, you keep a reverse look-up table so that for a given player, you know all of the other players who hold him as a closest match.
To get the top 16 matches for a player, you just have get their stored list.
When an answer changes for a given player, use the reverse look-up to get the list of everybody who *MIGHT* change. Then recalculate the matches for those people and re-sort their list of matches.
If a new player can be added, you'll just scan through the list once and insert him into the appropriate lists (which would just take microseconds).
After X-number of answer changes (or Y minutes), you re-run the whole algorithm to re-build the 32 closest matches. This wouldn't have to be done too often. The idea being that when an answer changes, at worst it can move somebody from the 33rd closest match to the 32nd (which wouldn't be picked up). If one person is changing their answers constantly, then this might need to be done after a few hundred answer changes. If answer changes are done equally by everybody, you could probably get away with waiting for a few thousand answer changes before re-running the whole table algorithm.
#13
I read through the Rete Algortym a few days ago and they do pretty much what you just described, but it's a lot of additional coding and data hand-holding which is why I made it plan B.
07/08/2010 (2:23 pm)
Yep, what you just described was my plan B if I failed to find a good pre-sort shortcut as plan A.I read through the Rete Algortym a few days ago and they do pretty much what you just described, but it's a lot of additional coding and data hand-holding which is why I made it plan B.
Quote:What if I match perfectly on the top 26 bits, but only 33 of the bottom 34 bits? I'd have 59/64 matches, but I would be out of contention.Correct, I have to find a better way.
#14
First, encode each answer into it's own column (Q1, Q2, Q3, etc.). Each column defines a coordinate for a 50-dimensional cube. Then you just need to calculate the Manhattan distance between each player. A huge advantage is that all answers are 0 and 1. Therefore, the distance between a coordinate is also the XOR of the values.
Let's pretend that there are just 3 questions and that we are trying to find matches for a player with the answers "1 0 1".
SELECT playerId, Q1 ^ 1 + Q2 ^ 0 + Q3 ^ 1 AS distance FROM yourTable ORDER BY distance
You just take the first 16 rows and you have your answer.
If you build the query dynamically, you can make some optimizations. When you XOR with 0, nothing happens to the original value. (Also, when you XOR with 1, the value flips, but I'm not sure you can easily do that in SQL.)
SELECT playerId, Q1 ^ 1 + Q2 + Q3 ^ 1 AS distance...
I guess the question here is whether or not the database will be fast enough with the answer.
Whatever you do end up implementing, let us know here... I'm very curious as to what ends up working.
07/08/2010 (4:30 pm)
I thought of one more thing...First, encode each answer into it's own column (Q1, Q2, Q3, etc.). Each column defines a coordinate for a 50-dimensional cube. Then you just need to calculate the Manhattan distance between each player. A huge advantage is that all answers are 0 and 1. Therefore, the distance between a coordinate is also the XOR of the values.
Let's pretend that there are just 3 questions and that we are trying to find matches for a player with the answers "1 0 1".
SELECT playerId, Q1 ^ 1 + Q2 ^ 0 + Q3 ^ 1 AS distance FROM yourTable ORDER BY distance
You just take the first 16 rows and you have your answer.
If you build the query dynamically, you can make some optimizations. When you XOR with 0, nothing happens to the original value. (Also, when you XOR with 1, the value flips, but I'm not sure you can easily do that in SQL.)
SELECT playerId, Q1 ^ 1 + Q2 + Q3 ^ 1 AS distance...
I guess the question here is whether or not the database will be fast enough with the answer.
Whatever you do end up implementing, let us know here... I'm very curious as to what ends up working.
#15
Behind the scenes the database engine still has to do a full table scan prior to sorting and returning the top 16 so its back to square one.
According to Hamming Distance you can XOR the word as a whole and then count the 1's afterwards. They give a fast C code example.
You could plug the XOR approach into your C code Person.matches() and compare with your previous time but I'm not convinced it will be significantly different.
I'm still researching if its feasible to recognize a trend in the bit pattern and assigned it to a group for index purposes.
07/08/2010 (6:05 pm)
Hmmm...Behind the scenes the database engine still has to do a full table scan prior to sorting and returning the top 16 so its back to square one.
According to Hamming Distance you can XOR the word as a whole and then count the 1's afterwards. They give a fast C code example.
unsigned hamdist(unsigned x, unsigned y)
{
unsigned dist = 0, val = x ^ y;
// Count the number of set bits
while(val)
{
++dist;
val &= val - 1;
}
return dist;
}You could plug the XOR approach into your C code Person.matches() and compare with your previous time but I'm not convinced it will be significantly different.
I'm still researching if its feasible to recognize a trend in the bit pattern and assigned it to a group for index purposes.
#16
I used the above bit counting code in a Mysql "user defined function", a UDF which, if you don't know what that is, its a way of writing a C compiled function that you can call from within an sql statement.
It solves my problem nicely because it means I could keep the data on the server and still efficiently do bit counting and just return the result-set to the client.
For those of you wanting to write your own UDF, have a look at some examples here
07/09/2010 (5:03 am)
I have been unable to find a suitable pattern to use in a pattern matching index so I went old school and did it the 'normal' way.I used the above bit counting code in a Mysql "user defined function", a UDF which, if you don't know what that is, its a way of writing a C compiled function that you can call from within an sql statement.
It solves my problem nicely because it means I could keep the data on the server and still efficiently do bit counting and just return the result-set to the client.
For those of you wanting to write your own UDF, have a look at some examples here
Torque Owner Jonas
I see your point in making the system this way but may i suggest a different approach ?
Lets say i wanna be paired up with players and the system needs to run 900million scans i will get a long wait time plus its not a 100% sure i will be matched with what im looking for therefor a sys like this would look better:
The first persons % score lets say 50% will be passed into the variable "LookPPL" then the search will look for players with 50% or more in score. Basing the mechanics this way will reduce time/effort on the overall performance.
Hope it helps!