Hi Cluster people,
As you might have noticed there where cluster problems the last two days.
The partition where the grid engine is installed was full, I cleaned
it up but it was
full again in a couple of hours, which make the complete grid engine
fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the
problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter /
job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue,
it stores it on
its own partition and not somewhere on a data partition, the partition
went from
21% usage to 99% usage in a couple of hours. To solve this, the jobs
had to be
changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again
some day.
frederik