
Hey people, I've some problems (execution-time related) with some of my queries, involving go-labels. For those not familiar with go-labels: these labels form a directed graph (which can be cyclic), so we have child-parent relationships between the various go-labels. Now, we already have an extra table parents_go (with 2 columns: child_go, parent_go). For each parent_go, all the children (and grandchildren, and great-grandchildren,...) are included as child_go, so we do not have to travel in the graph anymore. In this case, GO:0004402 has 6 children go-labels. This query attempts to find the count of genes (with a certain go-label, in this case GO:0004402), sorted by their evidence tags. Now this is my query: SELECT COUNT(DISTINCT(`gene_go`.`gene_id`)) as 'count',`gene_go`.`evidence` FROM `gene_go` WHERE (`gene_go`.`go`='GO:0004402' OR `gene_go`.`go` IN (SELECT `child_go` FROM `go_parents` WHERE `parent_go`='GO:0004402')) GROUP BY `gene_go`.`evidence` Problem is that it takes a while friggin 27 seconds to perform (which is pretty lame for a website, even with ajax). Okay I thought, lets be smart and first get all the child_go labels from GO:0004402, then loop over them and execute a query each time, counting the genes per evidence tag only for that specific child_go label. This results in a whooping 17 ms execution time... but the results are not the same (apparently some idiots annotated some genes with both the label of a child_go and of a parent_go label). So my questions are: a) Why is the execution time for the first query so goddamn slow compared to the php-loop solution. b) Any idea how to fix the problem? I'd rather not retrieve all the genes and start manually sorting and counting them. Michiel -- ================================================================== Michiel Van Bel PhD student Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel@psb.ugent.be http://www.psb.ugent.be ==================================================================

I think the problem with the run time is the use of the IN part. I'll think about how you can do it with a join. This is basically the same thing as Kenny asked the very first SQL challange ;) I use that solution for about everything :) Seb Michiel Van Bel wrote:
Hey people,
I've some problems (execution-time related) with some of my queries, involving go-labels. For those not familiar with go-labels: these labels form a directed graph (which can be cyclic), so we have child-parent relationships between the various go-labels.
Now, we already have an extra table parents_go (with 2 columns: child_go, parent_go). For each parent_go, all the children (and grandchildren, and great-grandchildren,...) are included as child_go, so we do not have to travel in the graph anymore. In this case, GO:0004402 has 6 children go-labels.
This query attempts to find the count of genes (with a certain go-label, in this case GO:0004402), sorted by their evidence tags.
Now this is my query:
SELECT COUNT(DISTINCT(`gene_go`.`gene_id`)) as 'count',`gene_go`.`evidence` FROM `gene_go` WHERE (`gene_go`.`go`='GO:0004402' OR `gene_go`.`go` IN (SELECT `child_go` FROM `go_parents` WHERE `parent_go`='GO:0004402')) GROUP BY `gene_go`.`evidence`
Problem is that it takes a while friggin 27 seconds to perform (which is pretty lame for a website, even with ajax).
Okay I thought, lets be smart and first get all the child_go labels from GO:0004402, then loop over them and execute a query each time, counting the genes per evidence tag only for that specific child_go label. This results in a whooping 17 ms execution time... but the results are not the same (apparently some idiots annotated some genes with both the label of a child_go and of a parent_go label).
So my questions are: a) Why is the execution time for the first query so goddamn slow compared to the php-loop solution. b) Any idea how to fix the problem? I'd rather not retrieve all the genes and start manually sorting and counting them.
Michiel
-- ================================================================== Sebastian Proost PhD Student Tel:+32 (0)9 331 36 92 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM sepro@psb.ugent.be http://www.psb.ugent.be ================================================================== "If I knew what I was doing, it wouldn't be called research." --Albert Einstein

Apparently you're right...the subquery is executed every time again for every gene. The next best thing i can think of is this (and it performs in only 8 ms, so no problems there): * retrieve all child_go labels from a parent_go label * build a string of this type : ('go_1','go_2','go_3',...) * replace the subquery (in the original query below) with this string. Michiel Well, I can also use another solution (rather ugly, but it works as well). Sebastian Proost wrote:
I think the problem with the run time is the use of the IN part. I'll think about how you can do it with a join. This is basically the same thing as Kenny asked the very first SQL challange ;) I use that solution for about everything :)
Seb
Michiel Van Bel wrote:
Hey people,
I've some problems (execution-time related) with some of my queries, involving go-labels. For those not familiar with go-labels: these labels form a directed graph (which can be cyclic), so we have child-parent relationships between the various go-labels.
Now, we already have an extra table parents_go (with 2 columns: child_go, parent_go). For each parent_go, all the children (and grandchildren, and great-grandchildren,...) are included as child_go, so we do not have to travel in the graph anymore. In this case, GO:0004402 has 6 children go-labels.
This query attempts to find the count of genes (with a certain go-label, in this case GO:0004402), sorted by their evidence tags.
Now this is my query:
SELECT COUNT(DISTINCT(`gene_go`.`gene_id`)) as 'count',`gene_go`.`evidence` FROM `gene_go` WHERE (`gene_go`.`go`='GO:0004402' OR `gene_go`.`go` IN (SELECT `child_go` FROM `go_parents` WHERE `parent_go`='GO:0004402')) GROUP BY `gene_go`.`evidence`
Problem is that it takes a while friggin 27 seconds to perform (which is pretty lame for a website, even with ajax).
Okay I thought, lets be smart and first get all the child_go labels from GO:0004402, then loop over them and execute a query each time, counting the genes per evidence tag only for that specific child_go label. This results in a whooping 17 ms execution time... but the results are not the same (apparently some idiots annotated some genes with both the label of a child_go and of a parent_go label).
So my questions are: a) Why is the execution time for the first query so goddamn slow compared to the php-loop solution. b) Any idea how to fix the problem? I'd rather not retrieve all the genes and start manually sorting and counting them.
Michiel
-- ================================================================== Michiel Van Bel PhD student Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel@psb.ugent.be http://www.psb.ugent.be ==================================================================
participants (2)
-
Michiel Van Bel
-
Sebastian Proost