[Binari] select a random record in mysql

30 Sep 2010

      Woops. Told some people about this yesterday but left it lying around. 
As promised: the search for something random.

Because I don't like to do a half-assed job, I went to infinity, beyond 
and back to find the best way to do this. Bear with me while I sum up a 
few considerations I made. As we like to stay realistic, imagine these 
queries being run on a table with 6 kazillion records, each containing a 
blob field with a 10MP raw image, let's say about 20MB each. I do this 
shit all the time, but to save me the waiting for results, I quickly 
generated about 300 000 records in a single table in a test database, 
totalling 1 GB.

A first idea was quite simple and straight forward:

     Select * from myrecords order by rand() limit 1;

Sure enough, this will give you a random record. By raising the value 
for the limit, even a few random record if you like. Seems simple, and 
simple is good, right? Not always: What this does, according to info 
found on the interweb, is read every record of the database (yes, all 
20MB of it), generate a random value for it, and then add it to a 
temporary table. Finally, it will sort this table by the value of that 
random value and pick the first one. Taking in mind that the random 
values generated with rand() are of course not indexed and therefore 
sorting and searching operations do take some time, the lousy 
performance should come to no surprise.
Up to the scoreboard:
- randomness: about as good as the random generator of mysql, which has 
proven to be quite alright
- speed: not so much. on the test db:  31.44, 58.10 and 59.07 seconds. A 
bit long for a pageload maybe...

Let's make an improvement here to come to a second attempt:

     Select * from myrecords where id = (Select id from myrecords order 
by rand() limit 1);

expectations:
- randomness: same?
- speed: Even though we're actually doing 2 queries here, it's should be 
a lot better because less data is being read (My wonderful 10MP holiday 
pics, remember). But generating 6 kazillion random values and then 
sorting them does take some time. However, I stared this on my test 
database about 5 minutes ago... still running (and yes, I DID do 
something else meanwhile). WTF? I'll tell you what the fuck: in mysql, 
subqueries in the where clause are executed every time, for every 
record*. So we just made it exponentially worse and on top of that, we 
might end up with really funny results, or none at all, as the result 
from the nested query is different every time it's tested against the id 
of a record. so getting back to the randomness, that should have been:
- randomness: a bit too random!
time on test db: infinity!

Let's split it up and add some of our favourite programming language 
(all code fragment are in gipsy pseudocode and might contain syntax 
errors). Method number 3 then becomes:

     long rand = `select id from myrecords order by rand() limit 1`;
     mysqlresult randrecord = `select * from myrecords where id = @rand`;

That should be better, right? Right!:
- randomness: same
- speed: better, as expected. no surprises this time. as the second 
result is a single record select by its (indexed) id, I'm not going to 
take it into account for all methods that follow. But you can add 
anything between 0.00 and 0.01 seconds to every result if that makes you 
feel more at ease.
test setup results (so just for the first query): 1.03, 1.02 and 1.03 
seconds.

Let's try from a different angle now for method number 4:

in your favourite programming language:

     long maxid = `select max(id) from myrecords`;
     long rand = random(1, maxid);
     mysqlresult result;
     while ((result = `select * from myrecords where id = rand`) == null) {
         /*
          damn it! the record with that id was deleted from the 
database! but there's nothing we can do about it, so let's do just that: 
nothing!
         */
     }

a bit more code, but:
- randomness: now it's the quality of the random generator of your 
programming language and not that of mysql that matters. but we can 
consider/hope that to be quite good (at least if you're php version is 5 
or higher or you're using seeded mt_rand, or you're using secure random 
in case of java).
- speed? on average, a lot better! these are all just simple queries. of 
course, there's some overhead in case you have a lot of deleted records. 
and in extreme cases, that while loop might do quite some iterations. 
and this is when we expect our identifier to be auto_incrementing of 
course. also, you could slightly improve this by also getting the 
min(id) and do a random(minid, maxid) of course. will help a bit if the 
first kazillion records have been erased.
on my test db, the first query takes about 0.02 seconds (first time, all 
other times it's cached). with a single select going somewhere under 
0.01 second and the in-memory programming code taking some microseconds 
at most, most of the time this should be better than the previous 
method. Best case: 0.03 seconds, worst case: 0.02 + 0.01 (times the 
number of records - 1), or something like that. But in most cases 
this'll end up somewhere around 0.1 second I think. That is, if you are 
using auto increment and most records aren't being deleted.

an alternative to this, method number 5, would be to do count(id) 
instead and use the mysql offset (because count doesn't give us any idea 
about the id's, except for the perfect world scenario where the first 
record has id 1 and the last one has id count(id) and no record is 
missing in between):

     long nrofrecords = `select count(id) from myrecords`; //even though 
everybody says it doesn't matter, i still won't do a count(*)! deal with it!
     long rand = random(0,nrofrecords); //again, [min, max[
     result = `select * from myrecords limit rand, 1`;

do not mistake rand for rand() here. it's merely a poor choice for a 
variable name by yours truly. you're welcome. but on to our expectations 
here:
- randomness: again, depends on your programming language's random 
generator.
- speed:  first operation is about as fast as the speed of light, this 
value is actually stored in mysql's information_schema.statistics table. 
the last query is a simple select. or is it? turns out the speed will 
depend on the value of 'rand'. the higher rand, the slower, since 
apparently, this will just run through all records until requested 
offset value is reached. on the test database with rand = 254999, 
execution time rose to a dazzling 12.84 seconds from time to time, 
although it went as low as 0.20 seconds at times as well. Can be fast, 
but not too reliable.

a 6th method could be to do this in your favourite programming language 
to avoid the mistake of using a subquery.

     long[] ids = `select id from myrecords`;
     long rand = random(0, ids.length()); //beware of out of ranges. but 
i'm supposing random takes a [lower,upper[ interval here.
     mysqlresult result = `select * from myrecords where id = rand`;

expectations:
- randomness: same as previous
- speed: pretty good. the first query of course returns a big result 
set, but we only need the id's so it should stay manageable. afterwards, 
we just need one more query and we are guaranteed to have a result right 
away, unlike the previous method. This might be a better id in case a 
lot of records are missing, or you're using random identifiers (or 
string identifiers for that matter). The first query is obviously where 
most of the time goes into, and that took 0.71 seconds the first time 
but only 0.27 seconds the next few times. The downside is that, in case 
of our 6 kazillion records with insanely long id's (obviously, because 
int or long won't do in this case), we might need a bit of memory to 
store the intermediate result. swapping would work as well, but... yeah, 
you get the picture.

happy? quite! but not as happy as could be. you know what would be 
really awesome? to do this in one query!

let's have a go at it. Method number 7:

     Select a.* from myrecords as a where id >= (select floor(max(b.id) 
* rand()) from myrecords as b) order by id limit 1;

even though we still might have some ordering to do here (what if the 
random value turns out to be 0? then we have to order the entire table! 
Not really, as id is indexed, the work is already mostly done by a nice 
internal b-tree. However, the first time I ran this query, we were 
talking minutes instead of seconds. A closer look shows us me might have 
run in the old "subquery in a where clause" trap again.

There is, however, a way to have mysql "cache" the subquery. It's a big 
ugly, but you can wrap your subquery in another subquery (sidenote: we 
could have done this in attempt nr. 2 as well). In this case, our query 
becomes:

     Select a.* from myrecords as a where id >= (select d.edge from 
(select floor(max(b.id) * rand()) as edge from myrecords as b) as d) 
order by id limit 1;

I apologize if that hurt a bit, but it does work! Time's back to 0.10 up 
to 2.50 seconds for the testdb. The big differences spring from the 
amount of records left to order after the subquery. Also, note that this 
is actually a small variation of the 4th method. The loss of time here 
is caused by the use of >=, which leaves us with (quite) some records to 
order before we have a result. If we'd use = instead and drop the order 
by and limit part, we'd be doing exactly what method number 4 is doing.

So the bottom line of this search for the best way to randomly pick a 
record from a database table: don't try to do it to nicely. Aesthetics 
are good and nice, but they aren't everything.

Remarks are welcome, I'm making a paper out of this!

* I guess this is as good a time as any to add that the behaviour of 
subqueries in where clauses is one of the weak spots of mysql (amongst 
some other stuff, alas). If you're using postgres, you might run into 
the expected results right away, without using wrapping selects or 
storing intermediate results outside mysql using a programming language.

-- 
Marijn Vandevoorde

Tel:+32 (0)9 331 36 95                        fax:+32 (0)9 3313809
VIB Department of Plant Systems Biology,          Ghent University
Technologiepark 927, 9052 Gent, BELGIUM
marijn.vandevoorde@psb.vib-ugent.be     http://www.psb.vib-ugent.be
==================================================================
"Fat bottomed girls, you make the rockin' world go round."
                                                       --F. Mercury

Marijn Vandevoorde

Marijn Vandevoorde

Thomas Abeel

Marijn Vandevoorde

tags

participants (3)