Re: [Binari] some cluster nodes were down this morning

So.... who's going in the wall of shame this time? :) My jobs seem to be still running fine so I hope it's not me :) Sofie HelpDesk wrote:
around 8 cluster nodes were down this morning. We checked the logs and discovered memory problems. The kernel detected out of memory and started to kill process to survive. This started yesterday evening around 19:39. Finally the nodes became unreachable and we had to force a reset.
from node psbcls62 (8GB RAM):
Jul 28 19:39:45 psbcls62 kernel: Out of memory: Killed process 10721 (perl). Jul 28 19:48:50 psbcls62 kernel: Out of memory: Killed process 10734 (perl). Jul 28 20:07:24 psbcls62 kernel: Out of memory: Killed process 10828 (perl). Jul 28 21:25:23 psbcls62 kernel: Out of memory: Killed process 11004 (perl). Jul 28 21:26:42 psbcls62 kernel: Out of memory: Killed process 11017 (perl). Jul 28 21:36:15 psbcls62 kernel: Out of memory: Killed process 16474 (perl).
Please check your jobs for memory leaks or memory usage.
Luc
-- Sofie Van Landeghem PhD Student VIB Department of Plant Systems Biology, Ghent University Bioinformatics and Evolutionary Genomics Technologiepark 927, 9052 Gent, BELGIUM Tel: +32 (0)9 331 36 95 fax:+32 (0)9 3313809 Website: http://bioinformatics.psb.ugent.be

Not me! I'm still not present in the hall of shame :-) Besides, it's only a couple of nodes. Perhaps not really what's necessary to be allowed in the hall of shame. Sofie Van Landeghem wrote:
So.... who's going in the wall of shame this time? :) My jobs seem to be still running fine so I hope it's not me :)
Sofie
HelpDesk wrote:
around 8 cluster nodes were down this morning. We checked the logs and discovered memory problems. The kernel detected out of memory and started to kill process to survive. This started yesterday evening around 19:39. Finally the nodes became unreachable and we had to force a reset.
from node psbcls62 (8GB RAM):
Jul 28 19:39:45 psbcls62 kernel: Out of memory: Killed process 10721 (perl). Jul 28 19:48:50 psbcls62 kernel: Out of memory: Killed process 10734 (perl). Jul 28 20:07:24 psbcls62 kernel: Out of memory: Killed process 10828 (perl). Jul 28 21:25:23 psbcls62 kernel: Out of memory: Killed process 11004 (perl). Jul 28 21:26:42 psbcls62 kernel: Out of memory: Killed process 11017 (perl). Jul 28 21:36:15 psbcls62 kernel: Out of memory: Killed process 16474 (perl).
Please check your jobs for memory leaks or memory usage.
Luc
-- ================================================================== Michiel Van Bel PhD student Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel@psb.vib-ugent.be http://www.psb.vib-ugent.be ==================================================================
participants (2)
-
Michiel Van Bel
-
Sofie Van Landeghem