Gregg website development, website maintenance and e-commerce solutions
Services Portfolio Pricing Free Quote Company Contact Us FAQs
Articles

Services
:: Website Development
:: E-Commerce Development
:: Website Management
:: Website Maintenance
:: Website Upgrades
:: Website Search
:: Custom Scripts
:: Search Engine Optimization
:: WEBSITE HOSTING
:: WEBSITE DIRECTORY

Website Tools
:: Website Search
:: Keyword Density
:: HTTP Header Viewer
:: Reverse DNS Lookup
:: DNS Lookup
:: Newsletter

Portfolio

Pricing

Free Quote

Contact Us

Company

FAQs

Links

PowWeb Hosting - $7.77/month!
Recent Articles
Census Data Resources
Google Toolbar PageRank not Displaying
Affiliate Data Feeds
Future date with PHP
Lookup domain names from an IP address
How to change web hosts
Mozilla Googlebot directs regular Googlebot/2.1
Evaluating Web Hosting Reviews
Web Host Review
Google's oligarchy of websites
Google: 301 Redirects reappear in index after site banned
Googlebot/2.1 Mozilla/5.0 not obeying robots.txt
MySQL Select Random Row Fast
SEO - The Other Side
Search Engine Submission Tips
SEO after Google's Florida Update
Read more at Matt Gregg Blog


Home  ::  Articles  ::  MySQL Select Random Row Fast

Host 6 Domains on 1 Bluehost Account $6.95 Per Month

MySQL Select Random Row Fast
Last Updated: 2004-10-15 15:06:08

As of this writing, there is no automatic way for MySQL to select a random row from a database table. Extracting a random row from a table can be useful for many reasons. It can pull up random products, random advertisments, or any other random thing that you have stored in your database table.

For most purposes on smaller database tables, the following will work fine:

$random_row = mysql_fetch_row(mysql_query("select * from YOUR_TABLE order by rand() limit 1"));

$random_row will be an array containing the data extracted from the random row. However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row.

I wrote the script below as a workaround for a random row with a large database table. This will return a random row in about 0.05 seconds, regardless of the size of the table. If the max_id of the table is not dynamically changing, the function can be rewritten to only execute one database query instead of two.

I am using this script for a new search engine spidering project I am working on. With this, it is useful to have an tinyint(1) column in the table that is updated as the row is selected so that it will not be chosen again. Since the function uses greater than and less than, it would not return an empty result unless all rows had been updated at which point the script could be ended. For such a script this also allows many updaters to be running at once selecting random rows that will not be selected again because the tinyint will be changed upon selection.



<?php
  
//CODE FROM WWW.GREGGDEV.COM
  
function random_row($table, $column) {
      
$max_sql = "SELECT max(" . $column . ")
                  AS max_id
                  FROM "
. $table;
      
$max_row = mysql_fetch_array(mysql_query($max_sql));
      
$random_number = mt_rand(1, $max_row['max_id']);
      
$random_sql = "SELECT * FROM " . $table . "
                     WHERE "
. $column . " >= " . $random_number . "
                     ORDER BY "
. $column . " ASC
                     LIMIT 1"
;
      
$random_row = mysql_fetch_row(mysql_query($random_sql));
      if (!
is_array($random_row)) {
          
$random_sql = "SELECT * FROM " . $table . "
                         WHERE "
. $column . " < " . $random_number . "
                         ORDER BY "
. $column . " DESC
                         LIMIT 1"
;
          
$random_row = mysql_fetch_row(mysql_query($random_sql));
      }
      return
$random_row;
  }
  
  
//USAGE
  
echo '<pre>';
  
print_r(random_row('YOUR_TABLE', 'YOUR_COLUMN'));
  echo
'</pre>';
?>
More information on selecting random rows from a MySQL database table can be found here:
MySQL Manual

Visitor Comments:
From: Barney Boisvert
I found your article on selecting a random row from MySQL (http://www.greggdev.com/web/articles.php?id=6) when I was looking for a new solution (i was actually using what you proprosed). The reason I was looking for a new solution was that my present solution (and yours) is only as random as the uniformity of keys across the full key range. So if I delete a large block of keys from the middle, suddenly the items on the border of that deleted block have a much higher chance of being selected than the keys not bordering a deleted block. In most cases, this doesn't really matter, but it is a downside to the method, and something I'd say is worth mentioning in the article, so people don't assume that the behaviour is as random as the PRNG (such as using ORDER BY rand() LIMIT 1 provides).

Response From www.greggdev.com:
Thank you, very good point. I normally use this with columns that are only selected once and marked as selected when they are, this makes the issue you have outlined not have effect. You could have a column that counts the number of times a row is selected and only allow a certain number by your query to select rows that have a count below this number... increment it when it is selected so it won't be selected again until you increase the count selectable in your query.

Response from Barney Boisvert
If you're using a scheme where you can only select a given row once (I think you mistyped 'column' instead of 'row'), then the problem still persists. If your randomization routine returns the id of a row that has already been returned (and marked as unavailable), then the next higher row will be returned. This gives that next higher row double the odds of being returned when compared to a row that is not next to an unavailable row. Say I have ids 1-10 in my table. The first call selects 4, and marks it as unavailable. The next call selects 5 and marks it unavailable. Now on the third call, numbers 1-3 and 7-10 have a 1:10 chance of being selected (if the randomization routine returns the specific number out of the 10 choices), but 6 has a 3:10 chance of being selected (if the randomization call returns 4, 5, or 6 out of the 10 choices). Like I said, this clustering is often irrelevant, but in some cases it can be a big problem.

Response from www.greggdev.com
How 'bout a new indexed column (id2) ... run a script on the table to generate numbers in this column 1,2,3,4.... from the beginning of the table to the end, then query to this column instead of the primary id column. When new blocks of missing id's develop, just rerun the script to resequence the id2 column.

Response from Barney Boisvert
Yeah, that would solve it. You could even keep it up to date in real time. When you do your select run
UPDATE myTable SET
id2 = id2 - 1
WHERE id2 > $justSelectedId2
Definitely keep things truly random, though with updating the index all the time, it might be slower than just ORDER BY rand(). If you're not introducing empty blocks, though, then this would be a very good solution. For that matter, just rebuilding id2 automatically in the wee hours of the morning would probably keep you close enough to fully randomized even if you were making holes throughout the day. Might take a while with large tables, though. Quite an interesting idea. Have to stew it over a little more in my head, but I just might do it.

Response from Barney Boisvert
This topic came up again on a mailing list I subscribe to, and I gave your page as a reference. Just thought I'd follow up and say that I did implement the extra column specific for the randomization, and it works pretty well. The rebuilding is fairly quick (about a minute for a 45,000 row table), though the SELECT was much slower (from a few milliseconds with the primary key up to 50-70ms with the non-primary unique key). I'm guessing that's because the primary key is clustered, which makes the relative lookups very fast, but the secondary unique key still has to do some non-sequential searching. However, what I found was that changing the select to just an equality comparison brought the speed back.

That equality check made it fast, but it also makes it so that holes in the numbering provide a failure point (you might pick a random ID that doens't have a record), rather than just a bias on the randomization. But, as long as you keep your table pretty well maintained, it's significantly cheaper to just keep running the randomized selection until you get a row, rather than trying to beat it with a non-brute-force solution.

So the overall algorithm looks like this (in psuedocode).

function randomRow(table, column) {
var maxRow = query("SELECT MAX($column) AS maxID FROM $table");
var randomID;
var randomRow;
do {
randomID = randRange(1, maxRow.maxID);
randomRow = query("SELECT * FROM $table WHERE $column = $randomID");
} while (randomRow.recordCount == 0); return randomRow;
}

That runs in about the same times as the range comparison on the primary key (rolling around between 0-3 ms). That's all with that same ~45,000 row table.

Response from www.greggdev.com:
How 'bout using limit 1 with a greater than, less than type query... I think it will be about as fast as using =
e.g. query("SELECT * FROM $table WHERE $column >= $randomID limit 1");

Response from Barney Boisvert
Good call. That works perfectly, and much more elegant than the loop.


From: Colin Guthrie
User comments suggest that if the blocks of the key field are deleted then it all breaks down.

Could you "select count(*) AS x from table".
Get random number y between 1 and x.
Then a "select xxxx from table limit y,1"

That would get you your random row regardless of keys.

I have a 30 row table.

I delete rows 10 - 20.

I now have a 19 row table.

So SELECT COUNT(*) FROM table; gives me 19.

I pick a random number x between 0 and 18.

SELECT * FROM table LIMIT x,1; should then return the xth row of the table. This has no relation what so ever to the ids in the table (note I use COUNT(*) not MAX(id))

After I sent my initial mail I noticed this:
http://dev.mysql.com/doc/mysql/en/select.html

<snip> Posted by David Phillips on April 2 2003 9:15am [Delete] [Edit]

This method of selecting a random row should be fast:

LOCK TABLES foo READ; SELECT FLOOR(RAND() * COUNT(*)) AS rand_row FROM foo; SELECT * FROM foo LIMIT $rand_row, 1; UNLOCK TABLES;

Unfortunately, variables cannot be used in the LIMIT clause, otherwise the entire thing could be done completely in SQL. </snip>

So this person agrees with my method!

There was a comment from someone else about including a where clause, but I couldn't get his logic and I believe the method above would work for him too.


More Articles...
Census Data Resources
Google Toolbar PageRank not Displaying
Affiliate Data Feeds
Future date with PHP
Lookup domain names from an IP address
How to change web hosts
Mozilla Googlebot directs regular Googlebot/2.1
Evaluating Web Hosting Reviews
Web Host Review
Google's oligarchy of websites
Google: 301 Redirects reappear in index after site banned
Googlebot/2.1 Mozilla/5.0 not obeying robots.txt
SEO - The Other Side
Search Engine Submission Tips
SEO after Google's Florida Update


Return to Article Menu



Gregg Website Tools
Search Enable Website
Add a search feature to any website.
Keyword Density
Check the keyword density of your website.
HTTP Header Viewer
View the HTTP headers of any webpage.
Reverse DNS Lookup
Get the host name of an IP Address.
DNS Lookup
Get the IP Address of a website.
Spider Simulator
Simulate a SE Spider.
Subscribe Today
Gregg Website Development Newsletter
Featured Sites
EyeBike
EyeBike.com offers an online search engine and price comparison for bicycle parts featuring many online vendors.

Lake Tahoe Real Estate
iPods
Hesperia Real Estate
Host 6 Domains on 1 Bluehost Account $6.95 Per Month
Services :: Portfolio :: Pricing :: Free Quote :: Company :: Contact Us :: FAQs
We're located in the High Desert, Southern California
Copyright © 2004, Gregg Website Development, Privacy Policy
PO Box 400308, Hesperia, CA 92340