Description of problem: When submitting jobs they go into a "Q" state and never run. The server log file has this entry. 04/20/2016 10:37:30;0002;PBS_Server.45544;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0 04/20/2016 10:41:32;0040;PBS_Server.45545;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available This is after configuring the system to use the correct hostname and tell pbs_server to use the correct interface and port for pbs_sched. I don't know why the "as supplied" rpms don't work when installed. As you can see, the daemons appear to be talking to each other: [root@redhat-test-02 sched_priv]# netstat -tapn | grep pbs tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 45537/pbs_server tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 45526/pbs_mom tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 45526/pbs_mom tcp 0 0 10.1.252.42:15004 0.0.0.0:* LISTEN 45532/pbs_sched tcp 1 0 10.1.252.42:769 10.1.252.42:15004 CLOSE_WAIT 45537/pbs_server But pbs_sched never schedules a job to run. This is the qmgr configuration. [root@redhat-test-02 server_logs]# qmgr Max open servers: 9 Qmgr: p s # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 4 set queue batch resources_max.ncpus = 4 set queue batch resources_max.nodes = 1 set queue batch resources_min.ncpus = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Create and define queue normal # create queue normal set queue normal queue_type = Execution set queue normal max_running = 4 set queue normal resources_max.ncpus = 4 set queue normal resources_max.nodes = 1 set queue normal resources_min.ncpus = 1 set queue normal resources_default.ncpus = 1 set queue normal resources_default.neednodes = 1:ppn=1 set queue normal resources_default.nodect = 1 set queue normal resources_default.nodes = 1 set queue normal enabled = True set queue normal started = True # # Create and define queue high # create queue high set queue high queue_type = Execution set queue high max_running = 4 set queue high resources_max.ncpus = 4 set queue high resources_max.nodes = 1 set queue high resources_min.ncpus = 1 set queue high resources_default.ncpus = 1 set queue high resources_default.neednodes = 1:ppn=1 set queue high resources_default.nodect = 1 set queue high resources_default.nodes = 1 set queue high enabled = True set queue high started = True # # Create and define queue critical # create queue critical set queue critical queue_type = Execution set queue critical max_running = 4 set queue critical resources_max.ncpus = 4 set queue critical resources_max.nodes = 1 set queue critical resources_min.ncpus = 1 set queue critical resources_default.ncpus = 1 set queue critical resources_default.neednodes = 1:ppn=1 set queue critical resources_default.nodect = 1 set queue critical resources_default.nodes = 1 set queue critical enabled = True set queue critical started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = redhat-test-02.wise.wa-k12.net set server managers = sysadm.wa-k12.net set server operators = sysadm.wa-k12.net set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 300 set server submit_hosts = wem-lmgt-01.wise.wa-k12.net set server next_job_number = 0 set server authorized_users = dbadm.wa-k12.net set server moab_array_compatible = True set server nppcu = 1 Qmgr: Version-Release number of selected component (if applicable): torque-scheduler-4.2.10-9.el7.x86_64 How reproducible: Who knows, the packages are not configured to function correctly when installed. Steps to Reproduce: Just install the packages and try to make them work.
This is also happening on RHEL6. Did you forget to compile in a valid scheduler for pbs_sched in the last update?
Kevin, Sorry you are having issues trying to schedule nodes, I'm not sure if this may be related to the new numa support that I put in a few months ago... I've never run multiple queues before, my testing is primarily with a default queue setup and MPI jobs can be scheduled and run just fine. It will take me some time to try and reproduce your scheduling environment so in the mean time, have you run this issue by the mailing list yet? Thanks, - David Brown
No, the mailing list is not much help, just vague comments that pbs_sched is broken on 4.2 and vague statements that somebody, somehow got it working again. This problem started with the last update from epel. Our RHEL5 boxes are working fine, but they weren't updated. We didn't change our configuration, just updated the software and the scheduler stopped working. Luckily our RHEL6/RHEL7 boxes don't currently use Torque in production. But we are doing a push to RHEL7 this summer and a broken Torque is not good. Two things that I noticed, pbs_sched stopped listening on the loopback device and I had to use the '-l' switch with pbs_server to force it to communicate with pbs_sched. netstat show the connections, but the scheduler doesn't seem to want to schedule. Our RHEL6/RHEL7 boxes show this: [root@wsipc-scm-01 Resource]# netstat -tapn | grep pbs tcp 0 0 0.0.0.0:9501 0.0.0.0:* LISTEN 32891/pbs_server tcp 0 0 0.0.0.0:9502 0.0.0.0:* LISTEN 33329/pbs_mom tcp 0 0 0.0.0.0:9503 0.0.0.0:* LISTEN 33329/pbs_mom tcp 0 0 10.1.254.181:9504 0.0.0.0:* LISTEN 15237/pbs_sched tcp 1 0 10.1.254.181:843 10.1.254.181:9504 CLOSE_WAIT 32891/pbs_server Our RHEL5 boxes show this: [root@redhat-test-03 ~]# netstat -tapn | grep pbs tcp 0 0 10.1.252.43:9504 0.0.0.0:* LISTEN 17168/pbs_sched tcp 0 0 0.0.0.0:9501 0.0.0.0:* LISTEN 17121/pbs_server tcp 0 0 0.0.0.0:9502 0.0.0.0:* LISTEN 17147/pbs_mom tcp 0 0 0.0.0.0:9503 0.0.0.0:* LISTEN 17147/pbs_mom [root@redhat-test-03 ~]#
Updating to v6.0.1 from the Adaptive Computing web site fixes these problems. I compiled the code using the provided spec file. After some minor configurations, it just worked.
torque-4.2.10-11.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670
torque-4.2.10-11.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670
EPEL 7 entered end-of-life (EOL) status on 2024-06-30. EPEL 7 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug.