|

Home :: Articles
:: MySQL Select Random Row Fast |
 |
Host 6 Domains on 1 Bluehost Account $6.95 Per Month
MySQL Select Random Row Fast Last Updated: 2004-10-15 15:06:08
As of this writing, there is no automatic way for MySQL to select a random row from a database table. Extracting a random row from a table can be useful for many reasons. It can pull up random products, random advertisments, or any other random thing that you have stored in your database table.
For most purposes on smaller database tables, the following will work fine:
$random_row = mysql_fetch_row(mysql_query("select * from YOUR_TABLE order by rand() limit 1"));
$random_row will be an array containing the data extracted from the random row. However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row.
I wrote the script below as a workaround for a random row with a large database table. This will return a random row in about 0.05 seconds, regardless of the size of the table. If the max_id of the table is not dynamically changing, the function can be rewritten to only execute one database query instead of two.
I am using this script for a new search engine spidering project I am working on. With this, it is useful to have an tinyint(1) column in the table that is updated as the row is selected so that it will not be chosen again. Since the function uses greater than and less than, it would not return an empty result unless all rows had been updated at which point the script could be ended. For such a script this also allows many updaters to be running at once selecting random rows that will not be selected again because the tinyint will be changed upon selection.
<?php
//CODE FROM WWW.GREGGDEV.COM
function random_row($table, $column) {
$max_sql = "SELECT max(" . $column . ")
AS max_id
FROM " . $table;
$max_row = mysql_fetch_array(mysql_query($max_sql));
$random_number = mt_rand(1, $max_row['max_id']);
$random_sql = "SELECT * FROM " . $table . "
WHERE " . $column . " >= " . $random_number . "
ORDER BY " . $column . " ASC
LIMIT 1";
$random_row = mysql_fetch_row(mysql_query($random_sql));
if (!is_array($random_row)) {
$random_sql = "SELECT * FROM " . $table . "
WHERE " . $column . " < " . $random_number . "
ORDER BY " . $column . " DESC
LIMIT 1";
$random_row = mysql_fetch_row(mysql_query($random_sql));
}
return $random_row;
}
//USAGE
echo '<pre>';
print_r(random_row('YOUR_TABLE', 'YOUR_COLUMN'));
echo '</pre>';
?>
More information on selecting random rows from a MySQL database table can be found here:
MySQL Manual
Visitor Comments:
From: Barney Boisvert
I found your article on selecting a random row from MySQL (http://www.greggdev.com/web/articles.php?id=6) when I was looking for a new solution (i was actually using what you proprosed). The reason I was looking for a new solution was that my present solution (and yours) is only as random as the uniformity of keys across the full key range. So if I delete a large block of keys from the middle, suddenly the items on the border of that deleted block have a much higher chance of being selected than the keys not bordering a deleted block. In most cases, this doesn't really matter, but it is a downside to the method, and something I'd say is worth mentioning in the article, so people don't assume that the behaviour is as random as the PRNG (such as using ORDER BY rand() LIMIT 1 provides).
Response From www.greggdev.com:
Thank you, very good point. I normally use this with columns that are only selected once and marked as selected when they are, this makes the issue you have outlined not have effect.
You could have a column that counts the number of times a row is selected and only allow a certain number by your query to select rows that have a count below this number... increment it when it is selected so it won't be selected again until you increase the count selectable in your query.
Response from Barney Boisvert
If you're using a scheme where you can only select a given row once (I
think you mistyped 'column' instead of 'row'), then the problem still
persists. If your randomization routine returns the id of a row that
has already been returned (and marked as unavailable), then the next
higher row will be returned. This gives that next higher row double
the odds of being returned when compared to a row that is not next to
an unavailable row.
Say I have ids 1-10 in my table. The first call selects 4, and marks
it as unavailable. The next call selects 5 and marks it unavailable.
Now on the third call, numbers 1-3 and 7-10 have a 1:10 chance of
being selected (if the randomization routine returns the specific
number out of the 10 choices), but 6 has a 3:10 chance of being
selected (if the randomization call returns 4, 5, or 6 out of the 10
choices).
Like I said, this clustering is often irrelevant, but in some cases it
can be a big problem.
Response from www.greggdev.com
How 'bout a new indexed column (id2) ... run a script on the table to generate numbers in this column 1,2,3,4.... from the beginning of the table to the end, then query to this column instead of the primary id column. When new blocks of missing id's develop, just rerun the script to resequence the id2 column.
Response from Barney Boisvert
Yeah, that would solve it. You could even keep it up to date in real
time. When you do your select run
UPDATE myTable SET
id2 = id2 - 1
WHERE id2 > $justSelectedId2
Definitely keep things truly random, though with updating the index
all the time, it might be slower than just ORDER BY rand(). If you're
not introducing empty blocks, though, then this would be a very good
solution. For that matter, just rebuilding id2 automatically in the
wee hours of the morning would probably keep you close enough to fully
randomized even if you were making holes throughout the day. Might
take a while with large tables, though.
Quite an interesting idea. Have to stew it over a little more in my
head, but I just might do it.
Response from Barney Boisvert
This topic came up again on a mailing list I subscribe to, and I gave
your page as a reference. Just thought I'd follow up and say that I
did implement the extra column specific for the randomization, and it
works pretty well. The rebuilding is fairly quick (about a minute for
a 45,000 row table), though the SELECT was much slower (from a few
milliseconds with the primary key up to 50-70ms with the non-primary
unique key). I'm guessing that's because the primary key is
clustered, which makes the relative lookups very fast, but the
secondary unique key still has to do some non-sequential searching.
However, what I found was that changing the select to just an equality
comparison brought the speed back.
That equality check made it fast, but it also makes it so that holes
in the numbering provide a failure point (you might pick a random ID
that doens't have a record), rather than just a bias on the
randomization. But, as long as you keep your table pretty well
maintained, it's significantly cheaper to just keep running the
randomized selection until you get a row, rather than trying to beat
it with a non-brute-force solution.
So the overall algorithm looks like this (in psuedocode).
function randomRow(table, column) {
var maxRow = query("SELECT MAX($column) AS maxID FROM $table");
var randomID;
var randomRow;
do {
randomID = randRange(1, maxRow.maxID);
randomRow = query("SELECT * FROM $table WHERE $column = $randomID");
} while (randomRow.recordCount == 0);
return randomRow;
}
That runs in about the same times as the range comparison on the
primary key (rolling around between 0-3 ms). That's all with that
same ~45,000 row table.
Response from www.greggdev.com:
How 'bout using limit 1 with a greater than, less than type query... I think it will be about as fast as using =
e.g. query("SELECT * FROM $table WHERE $column >= $randomID limit 1");
Response from Barney Boisvert
Good call. That works perfectly, and much more elegant than the loop.
From: Colin Guthrie
User comments suggest that if the blocks of the key field are deleted then it all breaks down.
Could you "select count(*) AS x from table".
Get random number y between 1 and x.
Then a "select xxxx from table limit y,1"
That would get you your random row regardless of keys.
I have a 30 row table.
I delete rows 10 - 20.
I now have a 19 row table.
So SELECT COUNT(*) FROM table; gives me 19.
I pick a random number x between 0 and 18.
SELECT * FROM table LIMIT x,1; should then return the xth row of the
table. This has no relation what so ever to the ids in the table (note I
use COUNT(*) not MAX(id))
After I sent my initial mail I noticed this:
http://dev.mysql.com/doc/mysql/en/select.html
<snip>
Posted by David Phillips on April 2 2003 9:15am [Delete] [Edit]
This method of selecting a random row should be fast:
LOCK TABLES foo READ;
SELECT FLOOR(RAND() * COUNT(*)) AS rand_row FROM foo;
SELECT * FROM foo LIMIT $rand_row, 1;
UNLOCK TABLES;
Unfortunately, variables cannot be used in the LIMIT clause, otherwise
the entire thing could be done completely in SQL.
</snip>
So this person agrees with my method!
There was a comment from someone else about including a where clause,
but I couldn't get his logic and I believe the method above would work
for him too.
More Articles... Census Data Resources Google Toolbar PageRank not Displaying Affiliate Data Feeds Future date with PHP Lookup domain names from an IP address How to change web hosts Mozilla Googlebot directs regular Googlebot/2.1 Evaluating Web Hosting Reviews Web Host Review Google's oligarchy of websites Google: 301 Redirects reappear in index after site banned Googlebot/2.1 Mozilla/5.0 not obeying robots.txt SEO - The Other Side Search Engine Submission Tips SEO after Google's Florida Update
Return to Article Menu
|
|

Gregg Website Tools
|