Re: [Binari] cluster problems of last 2 days

Hall of shame! (not me though, i havent used the cluster in months) Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Michiel Van Bel PhD student Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================

I'm guessing it was tobus :p Sofie Michiel Van Bel wrote:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- Sofie Van Landeghem PhD Student VIB Department of Plant Systems Biology, Ghent University Bioinformatics and Evolutionary Genomics Technologiepark 927, 9052 Gent, BELGIUM Tel: +32 (0)9 331 36 95 fax:+32 (0)9 3313809 Website: http://bioinformatics.psb.ugent.be

Frederik Delaere wrote: weet niet of hij er veel is, zit een paar tafels ver van gij hij staat al op de wall of shame en was niet aanwezig op den brainstorm, ok da was een enorme hint :p Lieven Sterck wrote: dicht bij mij in de buurt? of iemand die hier nie zo vaak is (en da nog al eens gedaan heeft ;-) ) Frederik Delaere wrote: er zal daar wel iemand rood worden, haha Sebastian Proost wrote:
I second that motion !
(wasn't me though)
Michiel Van Bel schreef:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Sebastian Proost PhD Student
Tel:+ 32 (0) 9 33 13 822 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM sebastian.proost@psb.vib-ugent.be http://www.psb.ugent.be ================================================================== "If I knew what I was doing, it wouldn't be called research." --Albert Einstein ------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ============================================================== Lieven Sterck, PhD Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be ============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson

DAMN.... geschoten en gemist ... :-/ Joost Van Den Cruyce wrote:
Nope, didn't touched the cluster in weeks. Nice try although
Lieven Sterck wrote:
Frederik Delaere wrote: weet niet of hij er veel is, zit een paar tafels ver van gij hij staat al op de wall of shame en was niet aanwezig op den brainstorm, ok da was een enorme hint :p
Lieven Sterck wrote: dicht bij mij in de buurt? of iemand die hier nie zo vaak is (en da nog al eens gedaan heeft ;-) )
Frederik Delaere wrote: er zal daar wel iemand rood worden, haha
Sebastian Proost wrote:
I second that motion !
(wasn't me though)
Michiel Van Bel schreef:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Sebastian Proost PhD Student
Tel:+ 32 (0) 9 33 13 822 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM sebastian.proost@psb.vib-ugent.be http://www.psb.ugent.be ================================================================== "If I knew what I was doing, it wouldn't be called research." --Albert Einstein ------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ============================================================== Lieven Sterck, PhD
Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be
============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson
------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ================================================================== Joost Van Den Cruyce
Tel:+32 (0)9 331 38 92 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM jocru@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================
------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ============================================================== Lieven Sterck, PhD Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be ============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson

JOOST, JOOST, JOOST !! ?? nie..... :-) Lieven Sterck wrote:
Frederik Delaere wrote: weet niet of hij er veel is, zit een paar tafels ver van gij hij staat al op de wall of shame en was niet aanwezig op den brainstorm, ok da was een enorme hint :p
Lieven Sterck wrote: dicht bij mij in de buurt? of iemand die hier nie zo vaak is (en da nog al eens gedaan heeft ;-) )
Frederik Delaere wrote: er zal daar wel iemand rood worden, haha
Sebastian Proost wrote:
I second that motion !
(wasn't me though)
Michiel Van Bel schreef:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Sebastian Proost PhD Student
Tel:+ 32 (0) 9 33 13 822 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM sebastian.proost@psb.vib-ugent.be http://www.psb.ugent.be ================================================================== "If I knew what I was doing, it wouldn't be called research." --Albert Einstein ------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ============================================================== Lieven Sterck, PhD
Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be
============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson
------------------------------------------------------------------------
_______________________________________________ Binari Implicitly Neglects All Recursive Iterations https://maillist.psb.ugent.be/mailman/listinfo/binari
-- ============================================================== Lieven Sterck, PhD Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be ============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson

Ah, crap, now I already filled in tobus on the HoS: https://bioinformatics.psb.ugent.be/knowledge/wiki-bioinformatics/Hall_of_sh... On Fri, 14 Aug 2009, Lieven Sterck wrote:
JOOST, JOOST, JOOST !! ?? nie..... :-)
Lieven Sterck wrote:
Frederik Delaere wrote: weet niet of hij er veel is, zit een paar tafels ver van gij hij staat al op de wall of shame en was niet aanwezig op den brainstorm, ok da was een enorme hint :p
Lieven Sterck wrote: dicht bij mij in de buurt? of iemand die hier nie zo vaak is (en da nog al eens gedaan heeft ;-) )
Frederik Delaere wrote: er zal daar wel iemand rood worden, haha
Sebastian Proost wrote:
I second that motion !
(wasn't me though)
Michiel Van Bel schreef:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ================================================================== Kenny Billiau Web Developer Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM kenny.billiau@ugent.be http://bioinformatics.psb.ugent.be ================================================================== "What we feel isn't important. The only question is what we do." -- Rohl

Frederik Delaere wrote: kzie net dat je hem al toegevoegd hebt aan de hall of shame :) (kan zijn dat hij er toch was dan, kdacht dat een bepaalde entry in die hall ook van hem was, maar dat was dus van joost) zijn er al gewonden gevallen ? Kenny Billiau wrote:
Ah, crap, now I already filled in tobus on the HoS: https://bioinformatics.psb.ugent.be/knowledge/wiki-bioinformatics/Hall_of_sh...
On Fri, 14 Aug 2009, Lieven Sterck wrote:
JOOST, JOOST, JOOST !! ?? nie..... :-)
Lieven Sterck wrote:
Frederik Delaere wrote: weet niet of hij er veel is, zit een paar tafels ver van gij hij staat al op de wall of shame en was niet aanwezig op den brainstorm, ok da was een enorme hint :p
Lieven Sterck wrote: dicht bij mij in de buurt? of iemand die hier nie zo vaak is (en da nog al eens gedaan heeft ;-) )
Frederik Delaere wrote: er zal daar wel iemand rood worden, haha
Sebastian Proost wrote:
I second that motion !
(wasn't me though)
Michiel Van Bel schreef:
Hall of shame! (not me though, i havent used the cluster in months)
Frederik Delaere wrote:
Hi Cluster people,
As you might have noticed there where cluster problems the last two days. The partition where the grid engine is installed was full, I cleaned it up but it was full again in a couple of hours, which make the complete grid engine fail, and jobs got lost.
It took me a while to figure out what went wrong, but I found the problem:
somebody submitted a couple of 1000 jobs with a 2 megabyte parameter / job
eg: "qsub somejob.pl <insert 2 megabytes of parameters here>"
grid engine stores all these commands while the jobs are in the queue, it stores it on its own partition and not somewhere on a data partition, the partition went from 21% usage to 99% usage in a couple of hours. To solve this, the jobs had to be changed to read these parameters from a file stored somewhere on the NAS.
So in the future, tell new people about this, or it will happen again some day.
frederik
-- ============================================================== Lieven Sterck, PhD Tel:+32 (0)9 3313821 Fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, UGent Bioinformatics and Evolutionary Genomics Division Technologiepark 927, B-9052 Gent, Belgium Email: lieven.sterck@psb.vib-ugent.be Website: http://bioinformatics.psb.ugent.be ============================================================== "You tried your best and you failed miserably. The lesson is: never try!" - H. Simpson
participants (6)
-
Joost Van Den Cruyce
-
Kenny Billiau
-
Lieven Sterck
-
Michiel Van Bel
-
Sebastian Proost
-
Sofie Van Landeghem