Bug 817198

Summary: Queueing/dequeueing thousands of jobs causes pbs scheduling to fail
Product: [Fedora] Fedora Reporter: bob mckay <urilabob>
Component: torqueAssignee: Steve Traylen <steve.traylen>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16CC: fotis, steve.traylen
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-13 22:26:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description bob mckay 2012-04-28 02:04:31 UTC
Description of problem:
User queues then dequeues thousands of jobs (using qdel). Any subsequent jobs queued fail to run. I can't find the exact reason - systemctl status of pbs_sched, pbs_server and munge are all OK, and the nodes are shown by xpbsmon as being active.

However munge appears to be in a bad state, despite systemctl status munge.service saying it's OK. If I run systemctl restart munge.service, the restart fails. So this may well be a munge bug at root. Unfortuantely I can't find any diagnostics directly confirming this.

This problem is reasonably serious, because at present, the only way I can recover from it is to reboot the host on which pbs_server is running.

I apologise in advance that it's going to be difficult for me to supply much more information on this, because killing our server to recreate the state is rather disruptive. (the user has been requested to stop queueing and dequeueing so many jobs). 

Version-Release number of selected component (if applicable): 3.0.3


How reproducible: Always


Steps to Reproduce:
1. User queues and dequeues thousands of jobs
2. Queue another job
3.
  
Actual results:
Job fails to run

Expected results:
Job runs
(i.e. user actions shouldn't be able to crash pbs)

Additional info:
Not sure what to supply. 

In pbs_sched log I see many of:
04/28/2012 10:27:24;0040; pbs_sched;Job;170458.localhost;Not enough of the right
 type of nodes available

...but actually, there are plenty of nodes available according to xpbsmon


I don't see anything unusual at all in pbs_server log.

Comment 1 bob mckay 2012-04-28 02:19:20 UTC
Related munge bug report is https://bugzilla.redhat.com/show_bug.cgi?id=817199

Comment 2 Fedora End Of Life 2013-01-16 17:35:38 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Fedora End Of Life 2013-02-13 22:26:50 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.