
Woops. Told some people about this yesterday but left it lying around. As promised: the search for something random. Because I don't like to do a half-assed job, I went to infinity, beyond and back to find the best way to do this. Bear with me while I sum up a few considerations I made. As we like to stay realistic, imagine these queries being run on a table with 6 kazillion records, each containing a blob field with a 10MP raw image, let's say about 20MB each. I do this shit all the time, but to save me the waiting for results, I quickly generated about 300 000 records in a single table in a test database, totalling 1 GB. A first idea was quite simple and straight forward: Select * from myrecords order by rand() limit 1; Sure enough, this will give you a random record. By raising the value for the limit, even a few random record if you like. Seems simple, and simple is good, right? Not always: What this does, according to info found on the interweb, is read every record of the database (yes, all 20MB of it), generate a random value for it, and then add it to a temporary table. Finally, it will sort this table by the value of that random value and pick the first one. Taking in mind that the random values generated with rand() are of course not indexed and therefore sorting and searching operations do take some time, the lousy performance should come to no surprise. Up to the scoreboard: - randomness: about as good as the random generator of mysql, which has proven to be quite alright - speed: not so much. on the test db: 31.44, 58.10 and 59.07 seconds. A bit long for a pageload maybe... Let's make an improvement here to come to a second attempt: Select * from myrecords where id = (Select id from myrecords order by rand() limit 1); expectations: - randomness: same? - speed: Even though we're actually doing 2 queries here, it's should be a lot better because less data is being read (My wonderful 10MP holiday pics, remember). But generating 6 kazillion random values and then sorting them does take some time. However, I stared this on my test database about 5 minutes ago... still running (and yes, I DID do something else meanwhile). WTF? I'll tell you what the fuck: in mysql, subqueries in the where clause are executed every time, for every record*. So we just made it exponentially worse and on top of that, we might end up with really funny results, or none at all, as the result from the nested query is different every time it's tested against the id of a record. so getting back to the randomness, that should have been: - randomness: a bit too random! time on test db: infinity! Let's split it up and add some of our favourite programming language (all code fragment are in gipsy pseudocode and might contain syntax errors). Method number 3 then becomes: long rand = `select id from myrecords order by rand() limit 1`; mysqlresult randrecord = `select * from myrecords where id = @rand`; That should be better, right? Right!: - randomness: same - speed: better, as expected. no surprises this time. as the second result is a single record select by its (indexed) id, I'm not going to take it into account for all methods that follow. But you can add anything between 0.00 and 0.01 seconds to every result if that makes you feel more at ease. test setup results (so just for the first query): 1.03, 1.02 and 1.03 seconds. Let's try from a different angle now for method number 4: in your favourite programming language: long maxid = `select max(id) from myrecords`; long rand = random(1, maxid); mysqlresult result; while ((result = `select * from myrecords where id = rand`) == null) { /* damn it! the record with that id was deleted from the database! but there's nothing we can do about it, so let's do just that: nothing! */ } a bit more code, but: - randomness: now it's the quality of the random generator of your programming language and not that of mysql that matters. but we can consider/hope that to be quite good (at least if you're php version is 5 or higher or you're using seeded mt_rand, or you're using secure random in case of java). - speed? on average, a lot better! these are all just simple queries. of course, there's some overhead in case you have a lot of deleted records. and in extreme cases, that while loop might do quite some iterations. and this is when we expect our identifier to be auto_incrementing of course. also, you could slightly improve this by also getting the min(id) and do a random(minid, maxid) of course. will help a bit if the first kazillion records have been erased. on my test db, the first query takes about 0.02 seconds (first time, all other times it's cached). with a single select going somewhere under 0.01 second and the in-memory programming code taking some microseconds at most, most of the time this should be better than the previous method. Best case: 0.03 seconds, worst case: 0.02 + 0.01 (times the number of records - 1), or something like that. But in most cases this'll end up somewhere around 0.1 second I think. That is, if you are using auto increment and most records aren't being deleted. an alternative to this, method number 5, would be to do count(id) instead and use the mysql offset (because count doesn't give us any idea about the id's, except for the perfect world scenario where the first record has id 1 and the last one has id count(id) and no record is missing in between): long nrofrecords = `select count(id) from myrecords`; //even though everybody says it doesn't matter, i still won't do a count(*)! deal with it! long rand = random(0,nrofrecords); //again, [min, max[ result = `select * from myrecords limit rand, 1`; do not mistake rand for rand() here. it's merely a poor choice for a variable name by yours truly. you're welcome. but on to our expectations here: - randomness: again, depends on your programming language's random generator. - speed: first operation is about as fast as the speed of light, this value is actually stored in mysql's information_schema.statistics table. the last query is a simple select. or is it? turns out the speed will depend on the value of 'rand'. the higher rand, the slower, since apparently, this will just run through all records until requested offset value is reached. on the test database with rand = 254999, execution time rose to a dazzling 12.84 seconds from time to time, although it went as low as 0.20 seconds at times as well. Can be fast, but not too reliable. a 6th method could be to do this in your favourite programming language to avoid the mistake of using a subquery. long[] ids = `select id from myrecords`; long rand = random(0, ids.length()); //beware of out of ranges. but i'm supposing random takes a [lower,upper[ interval here. mysqlresult result = `select * from myrecords where id = rand`; expectations: - randomness: same as previous - speed: pretty good. the first query of course returns a big result set, but we only need the id's so it should stay manageable. afterwards, we just need one more query and we are guaranteed to have a result right away, unlike the previous method. This might be a better id in case a lot of records are missing, or you're using random identifiers (or string identifiers for that matter). The first query is obviously where most of the time goes into, and that took 0.71 seconds the first time but only 0.27 seconds the next few times. The downside is that, in case of our 6 kazillion records with insanely long id's (obviously, because int or long won't do in this case), we might need a bit of memory to store the intermediate result. swapping would work as well, but... yeah, you get the picture. happy? quite! but not as happy as could be. you know what would be really awesome? to do this in one query! let's have a go at it. Method number 7: Select a.* from myrecords as a where id >= (select floor(max(b.id) * rand()) from myrecords as b) order by id limit 1; even though we still might have some ordering to do here (what if the random value turns out to be 0? then we have to order the entire table! Not really, as id is indexed, the work is already mostly done by a nice internal b-tree. However, the first time I ran this query, we were talking minutes instead of seconds. A closer look shows us me might have run in the old "subquery in a where clause" trap again. There is, however, a way to have mysql "cache" the subquery. It's a big ugly, but you can wrap your subquery in another subquery (sidenote: we could have done this in attempt nr. 2 as well). In this case, our query becomes: Select a.* from myrecords as a where id >= (select d.edge from (select floor(max(b.id) * rand()) as edge from myrecords as b) as d) order by id limit 1; I apologize if that hurt a bit, but it does work! Time's back to 0.10 up to 2.50 seconds for the testdb. The big differences spring from the amount of records left to order after the subquery. Also, note that this is actually a small variation of the 4th method. The loss of time here is caused by the use of >=, which leaves us with (quite) some records to order before we have a result. If we'd use = instead and drop the order by and limit part, we'd be doing exactly what method number 4 is doing. So the bottom line of this search for the best way to randomly pick a record from a database table: don't try to do it to nicely. Aesthetics are good and nice, but they aren't everything. Remarks are welcome, I'm making a paper out of this! * I guess this is as good a time as any to add that the behaviour of subqueries in where clauses is one of the weak spots of mysql (amongst some other stuff, alas). If you're using postgres, you might run into the expected results right away, without using wrapping selects or storing intermediate results outside mysql using a programming language. -- Marijn Vandevoorde Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM marijn.vandevoorde@psb.vib-ugent.be http://www.psb.vib-ugent.be ================================================================== "Fat bottomed girls, you make the rockin' world go round." --F. Mercury

I just realized I typed "bear with me". I have no idea what bearing is actually, but since bears are quite awesome, it must be something totally cool. On 09/30/2010 03:31 PM, Marijn Vandevoorde wrote:
Woops. Told some people about this yesterday but left it lying around. As promised: the search for something random.
Because I don't like to do a half-assed job, I went to infinity, beyond and back to find the best way to do this. Bear with me while I sum up a few considerations I made. As we like to stay realistic, imagine these queries being run on a table with 6 kazillion records, each containing a blob field with a 10MP raw image, let's say about 20MB each. I do this shit all the time, but to save me the waiting for results, I quickly generated about 300 000 records in a single table in a test database, totalling 1 GB.
A first idea was quite simple and straight forward:
Select * from myrecords order by rand() limit 1;
Sure enough, this will give you a random record. By raising the value for the limit, even a few random record if you like. Seems simple, and simple is good, right? Not always: What this does, according to info found on the interweb, is read every record of the database (yes, all 20MB of it), generate a random value for it, and then add it to a temporary table. Finally, it will sort this table by the value of that random value and pick the first one. Taking in mind that the random values generated with rand() are of course not indexed and therefore sorting and searching operations do take some time, the lousy performance should come to no surprise. Up to the scoreboard: - randomness: about as good as the random generator of mysql, which has proven to be quite alright - speed: not so much. on the test db: 31.44, 58.10 and 59.07 seconds. A bit long for a pageload maybe...
Let's make an improvement here to come to a second attempt:
Select * from myrecords where id = (Select id from myrecords order by rand() limit 1);
expectations: - randomness: same? - speed: Even though we're actually doing 2 queries here, it's should be a lot better because less data is being read (My wonderful 10MP holiday pics, remember). But generating 6 kazillion random values and then sorting them does take some time. However, I stared this on my test database about 5 minutes ago... still running (and yes, I DID do something else meanwhile). WTF? I'll tell you what the fuck: in mysql, subqueries in the where clause are executed every time, for every record*. So we just made it exponentially worse and on top of that, we might end up with really funny results, or none at all, as the result from the nested query is different every time it's tested against the id of a record. so getting back to the randomness, that should have been: - randomness: a bit too random! time on test db: infinity!
Let's split it up and add some of our favourite programming language (all code fragment are in gipsy pseudocode and might contain syntax errors). Method number 3 then becomes:
long rand = `select id from myrecords order by rand() limit 1`; mysqlresult randrecord = `select * from myrecords where id = @rand`;
That should be better, right? Right!: - randomness: same - speed: better, as expected. no surprises this time. as the second result is a single record select by its (indexed) id, I'm not going to take it into account for all methods that follow. But you can add anything between 0.00 and 0.01 seconds to every result if that makes you feel more at ease. test setup results (so just for the first query): 1.03, 1.02 and 1.03 seconds.
Let's try from a different angle now for method number 4:
in your favourite programming language:
long maxid = `select max(id) from myrecords`; long rand = random(1, maxid); mysqlresult result; while ((result = `select * from myrecords where id = rand`) == null) { /* damn it! the record with that id was deleted from the database! but there's nothing we can do about it, so let's do just that: nothing! */ }
a bit more code, but: - randomness: now it's the quality of the random generator of your programming language and not that of mysql that matters. but we can consider/hope that to be quite good (at least if you're php version is 5 or higher or you're using seeded mt_rand, or you're using secure random in case of java). - speed? on average, a lot better! these are all just simple queries. of course, there's some overhead in case you have a lot of deleted records. and in extreme cases, that while loop might do quite some iterations. and this is when we expect our identifier to be auto_incrementing of course. also, you could slightly improve this by also getting the min(id) and do a random(minid, maxid) of course. will help a bit if the first kazillion records have been erased. on my test db, the first query takes about 0.02 seconds (first time, all other times it's cached). with a single select going somewhere under 0.01 second and the in-memory programming code taking some microseconds at most, most of the time this should be better than the previous method. Best case: 0.03 seconds, worst case: 0.02 + 0.01 (times the number of records - 1), or something like that. But in most cases this'll end up somewhere around 0.1 second I think. That is, if you are using auto increment and most records aren't being deleted.
an alternative to this, method number 5, would be to do count(id) instead and use the mysql offset (because count doesn't give us any idea about the id's, except for the perfect world scenario where the first record has id 1 and the last one has id count(id) and no record is missing in between):
long nrofrecords = `select count(id) from myrecords`; //even though everybody says it doesn't matter, i still won't do a count(*)! deal with it! long rand = random(0,nrofrecords); //again, [min, max[ result = `select * from myrecords limit rand, 1`;
do not mistake rand for rand() here. it's merely a poor choice for a variable name by yours truly. you're welcome. but on to our expectations here: - randomness: again, depends on your programming language's random generator. - speed: first operation is about as fast as the speed of light, this value is actually stored in mysql's information_schema.statistics table. the last query is a simple select. or is it? turns out the speed will depend on the value of 'rand'. the higher rand, the slower, since apparently, this will just run through all records until requested offset value is reached. on the test database with rand = 254999, execution time rose to a dazzling 12.84 seconds from time to time, although it went as low as 0.20 seconds at times as well. Can be fast, but not too reliable.
a 6th method could be to do this in your favourite programming language to avoid the mistake of using a subquery.
long[] ids = `select id from myrecords`; long rand = random(0, ids.length()); //beware of out of ranges. but i'm supposing random takes a [lower,upper[ interval here. mysqlresult result = `select * from myrecords where id = rand`;
expectations: - randomness: same as previous - speed: pretty good. the first query of course returns a big result set, but we only need the id's so it should stay manageable. afterwards, we just need one more query and we are guaranteed to have a result right away, unlike the previous method. This might be a better id in case a lot of records are missing, or you're using random identifiers (or string identifiers for that matter). The first query is obviously where most of the time goes into, and that took 0.71 seconds the first time but only 0.27 seconds the next few times. The downside is that, in case of our 6 kazillion records with insanely long id's (obviously, because int or long won't do in this case), we might need a bit of memory to store the intermediate result. swapping would work as well, but... yeah, you get the picture.
happy? quite! but not as happy as could be. you know what would be really awesome? to do this in one query!
let's have a go at it. Method number 7:
Select a.* from myrecords as a where id>= (select floor(max(b.id) * rand()) from myrecords as b) order by id limit 1;
even though we still might have some ordering to do here (what if the random value turns out to be 0? then we have to order the entire table! Not really, as id is indexed, the work is already mostly done by a nice internal b-tree. However, the first time I ran this query, we were talking minutes instead of seconds. A closer look shows us me might have run in the old "subquery in a where clause" trap again.
There is, however, a way to have mysql "cache" the subquery. It's a big ugly, but you can wrap your subquery in another subquery (sidenote: we could have done this in attempt nr. 2 as well). In this case, our query becomes:
Select a.* from myrecords as a where id>= (select d.edge from (select floor(max(b.id) * rand()) as edge from myrecords as b) as d) order by id limit 1;
I apologize if that hurt a bit, but it does work! Time's back to 0.10 up to 2.50 seconds for the testdb. The big differences spring from the amount of records left to order after the subquery. Also, note that this is actually a small variation of the 4th method. The loss of time here is caused by the use of>=, which leaves us with (quite) some records to order before we have a result. If we'd use = instead and drop the order by and limit part, we'd be doing exactly what method number 4 is doing.
So the bottom line of this search for the best way to randomly pick a record from a database table: don't try to do it to nicely. Aesthetics are good and nice, but they aren't everything.
Remarks are welcome, I'm making a paper out of this!
* I guess this is as good a time as any to add that the behaviour of subqueries in where clauses is one of the weak spots of mysql (amongst some other stuff, alas). If you're using postgres, you might run into the expected results right away, without using wrapping selects or storing intermediate results outside mysql using a programming language.
-- Marijn Vandevoorde Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM marijn.vandevoorde@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================

Beter dan bare with me http://www.wsu.edu/~brians/errors/bare.html On 30/09/2010 17:19, Marijn Vandevoorde wrote:
I just realized I typed "bear with me". I have no idea what bearing is actually, but since bears are quite awesome, it must be something totally cool.
On 09/30/2010 03:31 PM, Marijn Vandevoorde wrote:
Woops. Told some people about this yesterday but left it lying around. As promised: the search for something random.
Because I don't like to do a half-assed job, I went to infinity, beyond and back to find the best way to do this. Bear with me while I sum up a few considerations I made. As we like to stay realistic, imagine these queries being run on a table with 6 kazillion records, each containing a blob field with a 10MP raw image, let's say about 20MB each. I do this shit all the time, but to save me the waiting for results, I quickly generated about 300 000 records in a single table in a test database, totalling 1 GB.
A first idea was quite simple and straight forward:
Select * from myrecords order by rand() limit 1;
Sure enough, this will give you a random record. By raising the value for the limit, even a few random record if you like. Seems simple, and simple is good, right? Not always: What this does, according to info found on the interweb, is read every record of the database (yes, all 20MB of it), generate a random value for it, and then add it to a temporary table. Finally, it will sort this table by the value of that random value and pick the first one. Taking in mind that the random values generated with rand() are of course not indexed and therefore sorting and searching operations do take some time, the lousy performance should come to no surprise. Up to the scoreboard: - randomness: about as good as the random generator of mysql, which has proven to be quite alright - speed: not so much. on the test db: 31.44, 58.10 and 59.07 seconds. A bit long for a pageload maybe...
Let's make an improvement here to come to a second attempt:
Select * from myrecords where id = (Select id from myrecords order by rand() limit 1);
expectations: - randomness: same? - speed: Even though we're actually doing 2 queries here, it's should be a lot better because less data is being read (My wonderful 10MP holiday pics, remember). But generating 6 kazillion random values and then sorting them does take some time. However, I stared this on my test database about 5 minutes ago... still running (and yes, I DID do something else meanwhile). WTF? I'll tell you what the fuck: in mysql, subqueries in the where clause are executed every time, for every record*. So we just made it exponentially worse and on top of that, we might end up with really funny results, or none at all, as the result from the nested query is different every time it's tested against the id of a record. so getting back to the randomness, that should have been: - randomness: a bit too random! time on test db: infinity!
Let's split it up and add some of our favourite programming language (all code fragment are in gipsy pseudocode and might contain syntax errors). Method number 3 then becomes:
long rand = `select id from myrecords order by rand() limit 1`; mysqlresult randrecord = `select * from myrecords where id = @rand`;
That should be better, right? Right!: - randomness: same - speed: better, as expected. no surprises this time. as the second result is a single record select by its (indexed) id, I'm not going to take it into account for all methods that follow. But you can add anything between 0.00 and 0.01 seconds to every result if that makes you feel more at ease. test setup results (so just for the first query): 1.03, 1.02 and 1.03 seconds.
Let's try from a different angle now for method number 4:
in your favourite programming language:
long maxid = `select max(id) from myrecords`; long rand = random(1, maxid); mysqlresult result; while ((result = `select * from myrecords where id = rand`) == null) { /* damn it! the record with that id was deleted from the database! but there's nothing we can do about it, so let's do just that: nothing! */ }
a bit more code, but: - randomness: now it's the quality of the random generator of your programming language and not that of mysql that matters. but we can consider/hope that to be quite good (at least if you're php version is 5 or higher or you're using seeded mt_rand, or you're using secure random in case of java). - speed? on average, a lot better! these are all just simple queries. of course, there's some overhead in case you have a lot of deleted records. and in extreme cases, that while loop might do quite some iterations. and this is when we expect our identifier to be auto_incrementing of course. also, you could slightly improve this by also getting the min(id) and do a random(minid, maxid) of course. will help a bit if the first kazillion records have been erased. on my test db, the first query takes about 0.02 seconds (first time, all other times it's cached). with a single select going somewhere under 0.01 second and the in-memory programming code taking some microseconds at most, most of the time this should be better than the previous method. Best case: 0.03 seconds, worst case: 0.02 + 0.01 (times the number of records - 1), or something like that. But in most cases this'll end up somewhere around 0.1 second I think. That is, if you are using auto increment and most records aren't being deleted.
an alternative to this, method number 5, would be to do count(id) instead and use the mysql offset (because count doesn't give us any idea about the id's, except for the perfect world scenario where the first record has id 1 and the last one has id count(id) and no record is missing in between):
long nrofrecords = `select count(id) from myrecords`; //even though everybody says it doesn't matter, i still won't do a count(*)! deal with it! long rand = random(0,nrofrecords); //again, [min, max[ result = `select * from myrecords limit rand, 1`;
do not mistake rand for rand() here. it's merely a poor choice for a variable name by yours truly. you're welcome. but on to our expectations here: - randomness: again, depends on your programming language's random generator. - speed: first operation is about as fast as the speed of light, this value is actually stored in mysql's information_schema.statistics table. the last query is a simple select. or is it? turns out the speed will depend on the value of 'rand'. the higher rand, the slower, since apparently, this will just run through all records until requested offset value is reached. on the test database with rand = 254999, execution time rose to a dazzling 12.84 seconds from time to time, although it went as low as 0.20 seconds at times as well. Can be fast, but not too reliable.
a 6th method could be to do this in your favourite programming language to avoid the mistake of using a subquery.
long[] ids = `select id from myrecords`; long rand = random(0, ids.length()); //beware of out of ranges. but i'm supposing random takes a [lower,upper[ interval here. mysqlresult result = `select * from myrecords where id = rand`;
expectations: - randomness: same as previous - speed: pretty good. the first query of course returns a big result set, but we only need the id's so it should stay manageable. afterwards, we just need one more query and we are guaranteed to have a result right away, unlike the previous method. This might be a better id in case a lot of records are missing, or you're using random identifiers (or string identifiers for that matter). The first query is obviously where most of the time goes into, and that took 0.71 seconds the first time but only 0.27 seconds the next few times. The downside is that, in case of our 6 kazillion records with insanely long id's (obviously, because int or long won't do in this case), we might need a bit of memory to store the intermediate result. swapping would work as well, but... yeah, you get the picture.
happy? quite! but not as happy as could be. you know what would be really awesome? to do this in one query!
let's have a go at it. Method number 7:
Select a.* from myrecords as a where id>= (select floor(max(b.id) * rand()) from myrecords as b) order by id limit 1;
even though we still might have some ordering to do here (what if the random value turns out to be 0? then we have to order the entire table! Not really, as id is indexed, the work is already mostly done by a nice internal b-tree. However, the first time I ran this query, we were talking minutes instead of seconds. A closer look shows us me might have run in the old "subquery in a where clause" trap again.
There is, however, a way to have mysql "cache" the subquery. It's a big ugly, but you can wrap your subquery in another subquery (sidenote: we could have done this in attempt nr. 2 as well). In this case, our query becomes:
Select a.* from myrecords as a where id>= (select d.edge from (select floor(max(b.id) * rand()) as edge from myrecords as b) as d) order by id limit 1;
I apologize if that hurt a bit, but it does work! Time's back to 0.10 up to 2.50 seconds for the testdb. The big differences spring from the amount of records left to order after the subquery. Also, note that this is actually a small variation of the 4th method. The loss of time here is caused by the use of>=, which leaves us with (quite) some records to order before we have a result. If we'd use = instead and drop the order by and limit part, we'd be doing exactly what method number 4 is doing.
So the bottom line of this search for the best way to randomly pick a record from a database table: don't try to do it to nicely. Aesthetics are good and nice, but they aren't everything.
Remarks are welcome, I'm making a paper out of this!
* I guess this is as good a time as any to add that the behaviour of subqueries in where clauses is one of the weak spots of mysql (amongst some other stuff, alas). If you're using postgres, you might run into the expected results right away, without using wrapping selects or storing intermediate results outside mysql using a programming language.

i accidently right! On 09/30/2010 05:21 PM, Thomas Abeel wrote:
Beter dan bare with me
http://www.wsu.edu/~brians/errors/bare.html
On 30/09/2010 17:19, Marijn Vandevoorde wrote:
I just realized I typed "bear with me". I have no idea what bearing is actually, but since bears are quite awesome, it must be something totally cool.
On 09/30/2010 03:31 PM, Marijn Vandevoorde wrote:
Woops. Told some people about this yesterday but left it lying around. As promised: the search for something random.
Because I don't like to do a half-assed job, I went to infinity, beyond and back to find the best way to do this. Bear with me while I sum up a few considerations I made. As we like to stay realistic, imagine these queries being run on a table with 6 kazillion records, each containing a blob field with a 10MP raw image, let's say about 20MB each. I do this shit all the time, but to save me the waiting for results, I quickly generated about 300 000 records in a single table in a test database, totalling 1 GB.
A first idea was quite simple and straight forward:
Select * from myrecords order by rand() limit 1;
Sure enough, this will give you a random record. By raising the value for the limit, even a few random record if you like. Seems simple, and simple is good, right? Not always: What this does, according to info found on the interweb, is read every record of the database (yes, all 20MB of it), generate a random value for it, and then add it to a temporary table. Finally, it will sort this table by the value of that random value and pick the first one. Taking in mind that the random values generated with rand() are of course not indexed and therefore sorting and searching operations do take some time, the lousy performance should come to no surprise. Up to the scoreboard: - randomness: about as good as the random generator of mysql, which has proven to be quite alright - speed: not so much. on the test db: 31.44, 58.10 and 59.07 seconds. A bit long for a pageload maybe...
Let's make an improvement here to come to a second attempt:
Select * from myrecords where id = (Select id from myrecords order by rand() limit 1);
expectations: - randomness: same? - speed: Even though we're actually doing 2 queries here, it's should be a lot better because less data is being read (My wonderful 10MP holiday pics, remember). But generating 6 kazillion random values and then sorting them does take some time. However, I stared this on my test database about 5 minutes ago... still running (and yes, I DID do something else meanwhile). WTF? I'll tell you what the fuck: in mysql, subqueries in the where clause are executed every time, for every record*. So we just made it exponentially worse and on top of that, we might end up with really funny results, or none at all, as the result from the nested query is different every time it's tested against the id of a record. so getting back to the randomness, that should have been: - randomness: a bit too random! time on test db: infinity!
Let's split it up and add some of our favourite programming language (all code fragment are in gipsy pseudocode and might contain syntax errors). Method number 3 then becomes:
long rand = `select id from myrecords order by rand() limit 1`; mysqlresult randrecord = `select * from myrecords where id = @rand`;
That should be better, right? Right!: - randomness: same - speed: better, as expected. no surprises this time. as the second result is a single record select by its (indexed) id, I'm not going to take it into account for all methods that follow. But you can add anything between 0.00 and 0.01 seconds to every result if that makes you feel more at ease. test setup results (so just for the first query): 1.03, 1.02 and 1.03 seconds.
Let's try from a different angle now for method number 4:
in your favourite programming language:
long maxid = `select max(id) from myrecords`; long rand = random(1, maxid); mysqlresult result; while ((result = `select * from myrecords where id = rand`) == null) { /* damn it! the record with that id was deleted from the database! but there's nothing we can do about it, so let's do just that: nothing! */ }
a bit more code, but: - randomness: now it's the quality of the random generator of your programming language and not that of mysql that matters. but we can consider/hope that to be quite good (at least if you're php version is 5 or higher or you're using seeded mt_rand, or you're using secure random in case of java). - speed? on average, a lot better! these are all just simple queries. of course, there's some overhead in case you have a lot of deleted records. and in extreme cases, that while loop might do quite some iterations. and this is when we expect our identifier to be auto_incrementing of course. also, you could slightly improve this by also getting the min(id) and do a random(minid, maxid) of course. will help a bit if the first kazillion records have been erased. on my test db, the first query takes about 0.02 seconds (first time, all other times it's cached). with a single select going somewhere under 0.01 second and the in-memory programming code taking some microseconds at most, most of the time this should be better than the previous method. Best case: 0.03 seconds, worst case: 0.02 + 0.01 (times the number of records - 1), or something like that. But in most cases this'll end up somewhere around 0.1 second I think. That is, if you are using auto increment and most records aren't being deleted.
an alternative to this, method number 5, would be to do count(id) instead and use the mysql offset (because count doesn't give us any idea about the id's, except for the perfect world scenario where the first record has id 1 and the last one has id count(id) and no record is missing in between):
long nrofrecords = `select count(id) from myrecords`; //even though everybody says it doesn't matter, i still won't do a count(*)! deal with it! long rand = random(0,nrofrecords); //again, [min, max[ result = `select * from myrecords limit rand, 1`;
do not mistake rand for rand() here. it's merely a poor choice for a variable name by yours truly. you're welcome. but on to our expectations here: - randomness: again, depends on your programming language's random generator. - speed: first operation is about as fast as the speed of light, this value is actually stored in mysql's information_schema.statistics table. the last query is a simple select. or is it? turns out the speed will depend on the value of 'rand'. the higher rand, the slower, since apparently, this will just run through all records until requested offset value is reached. on the test database with rand = 254999, execution time rose to a dazzling 12.84 seconds from time to time, although it went as low as 0.20 seconds at times as well. Can be fast, but not too reliable.
a 6th method could be to do this in your favourite programming language to avoid the mistake of using a subquery.
long[] ids = `select id from myrecords`; long rand = random(0, ids.length()); //beware of out of ranges. but i'm supposing random takes a [lower,upper[ interval here. mysqlresult result = `select * from myrecords where id = rand`;
expectations: - randomness: same as previous - speed: pretty good. the first query of course returns a big result set, but we only need the id's so it should stay manageable. afterwards, we just need one more query and we are guaranteed to have a result right away, unlike the previous method. This might be a better id in case a lot of records are missing, or you're using random identifiers (or string identifiers for that matter). The first query is obviously where most of the time goes into, and that took 0.71 seconds the first time but only 0.27 seconds the next few times. The downside is that, in case of our 6 kazillion records with insanely long id's (obviously, because int or long won't do in this case), we might need a bit of memory to store the intermediate result. swapping would work as well, but... yeah, you get the picture.
happy? quite! but not as happy as could be. you know what would be really awesome? to do this in one query!
let's have a go at it. Method number 7:
Select a.* from myrecords as a where id>= (select floor(max(b.id) * rand()) from myrecords as b) order by id limit 1;
even though we still might have some ordering to do here (what if the random value turns out to be 0? then we have to order the entire table! Not really, as id is indexed, the work is already mostly done by a nice internal b-tree. However, the first time I ran this query, we were talking minutes instead of seconds. A closer look shows us me might have run in the old "subquery in a where clause" trap again.
There is, however, a way to have mysql "cache" the subquery. It's a big ugly, but you can wrap your subquery in another subquery (sidenote: we could have done this in attempt nr. 2 as well). In this case, our query becomes:
Select a.* from myrecords as a where id>= (select d.edge from (select floor(max(b.id) * rand()) as edge from myrecords as b) as d) order by id limit 1;
I apologize if that hurt a bit, but it does work! Time's back to 0.10 up to 2.50 seconds for the testdb. The big differences spring from the amount of records left to order after the subquery. Also, note that this is actually a small variation of the 4th method. The loss of time here is caused by the use of>=, which leaves us with (quite) some records to order before we have a result. If we'd use = instead and drop the order by and limit part, we'd be doing exactly what method number 4 is doing.
So the bottom line of this search for the best way to randomly pick a record from a database table: don't try to do it to nicely. Aesthetics are good and nice, but they aren't everything.
Remarks are welcome, I'm making a paper out of this!
* I guess this is as good a time as any to add that the behaviour of subqueries in where clauses is one of the weak spots of mysql (amongst some other stuff, alas). If you're using postgres, you might run into the expected results right away, without using wrapping selects or storing intermediate results outside mysql using a programming language.
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- Marijn Vandevoorde Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM marijn.vandevoorde@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================
participants (3)
-
Marijn Vandevoorde
-
Marijn Vandevoorde
-
Thomas Abeel