
I'm guessing it was tobus :p Sofie Michiel Van Bel wrote:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- Sofie Van Landeghem PhD Student VIB Department of Plant Systems Biology, Ghent University Bioinformatics and Evolutionary Genomics Technologiepark 927, 9052 Gent, BELGIUM Tel: +32 (0)9 331 36 95 fax:+32 (0)9 3313809 Website: http://bioinformatics.psb.ugent.be