Red Hat Bugzilla – Bug 817198
Queueing/dequeueing thousands of jobs causes pbs scheduling to fail
Last modified: 2013-02-13 17:26:50 EST
Description of problem:
User queues then dequeues thousands of jobs (using qdel). Any subsequent jobs queued fail to run. I can't find the exact reason - systemctl status of pbs_sched, pbs_server and munge are all OK, and the nodes are shown by xpbsmon as being active.
However munge appears to be in a bad state, despite systemctl status munge.service saying it's OK. If I run systemctl restart munge.service, the restart fails. So this may well be a munge bug at root. Unfortuantely I can't find any diagnostics directly confirming this.
This problem is reasonably serious, because at present, the only way I can recover from it is to reboot the host on which pbs_server is running.
I apologise in advance that it's going to be difficult for me to supply much more information on this, because killing our server to recreate the state is rather disruptive. (the user has been requested to stop queueing and dequeueing so many jobs).
Version-Release number of selected component (if applicable): 3.0.3
How reproducible: Always
Steps to Reproduce:
1. User queues and dequeues thousands of jobs
2. Queue another job
Job fails to run
(i.e. user actions shouldn't be able to crash pbs)
Not sure what to supply.
In pbs_sched log I see many of:
04/28/2012 10:27:24;0040; pbs_sched;Job;170458.localhost;Not enough of the right
type of nodes available
...but actually, there are plenty of nodes available according to xpbsmon
I don't see anything unusual at all in pbs_server log.
Related munge bug report is https://bugzilla.redhat.com/show_bug.cgi?id=817199
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '16'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 16's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 16 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" and open it against that version of Fedora.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.