
Hall of shame! (not me though, i havent used the cluster in months) Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Michiel Van Bel PhD student Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================