Bug 1328958 - pbs_sched doesn't appear to work
Summary: pbs_sched doesn't appear to work
Keywords:
Status: ON_QA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: torque
Version: epel7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: David Brown
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-20 17:53 UTC by Kevin L. Esteb
Modified: 2017-08-18 20:23 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description Kevin L. Esteb 2016-04-20 17:53:12 UTC
Description of problem:

When submitting jobs they go into a "Q" state and never run. The server log file has this entry.

04/20/2016 10:37:30;0002;PBS_Server.45544;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0
04/20/2016 10:41:32;0040;PBS_Server.45545;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available

This is after configuring the system to use the correct hostname and tell pbs_server to use the correct interface and port for pbs_sched. I don't know why the "as supplied" rpms don't work when installed.

As you can see, the daemons appear to be talking to each other:

[root@redhat-test-02 sched_priv]# netstat -tapn | grep pbs
tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      45537/pbs_server
tcp        0      0 0.0.0.0:15002           0.0.0.0:*               LISTEN      45526/pbs_mom
tcp        0      0 0.0.0.0:15003           0.0.0.0:*               LISTEN      45526/pbs_mom
tcp        0      0 10.1.252.42:15004       0.0.0.0:*               LISTEN      45532/pbs_sched
tcp        1      0 10.1.252.42:769         10.1.252.42:15004       CLOSE_WAIT  45537/pbs_server

But pbs_sched never schedules a job to run. 

This is the qmgr configuration.

[root@redhat-test-02 server_logs]# qmgr
Max open servers: 9
Qmgr: p s
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 4
set queue batch resources_max.ncpus = 4
set queue batch resources_max.nodes = 1
set queue batch resources_min.ncpus = 1
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1:ppn=1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue normal
#
create queue normal
set queue normal queue_type = Execution
set queue normal max_running = 4
set queue normal resources_max.ncpus = 4
set queue normal resources_max.nodes = 1
set queue normal resources_min.ncpus = 1
set queue normal resources_default.ncpus = 1
set queue normal resources_default.neednodes = 1:ppn=1
set queue normal resources_default.nodect = 1
set queue normal resources_default.nodes = 1
set queue normal enabled = True
set queue normal started = True
#
# Create and define queue high
#
create queue high
set queue high queue_type = Execution
set queue high max_running = 4
set queue high resources_max.ncpus = 4
set queue high resources_max.nodes = 1
set queue high resources_min.ncpus = 1
set queue high resources_default.ncpus = 1
set queue high resources_default.neednodes = 1:ppn=1
set queue high resources_default.nodect = 1
set queue high resources_default.nodes = 1
set queue high enabled = True
set queue high started = True
#
# Create and define queue critical
#
create queue critical
set queue critical queue_type = Execution
set queue critical max_running = 4
set queue critical resources_max.ncpus = 4
set queue critical resources_max.nodes = 1
set queue critical resources_min.ncpus = 1
set queue critical resources_default.ncpus = 1
set queue critical resources_default.neednodes = 1:ppn=1
set queue critical resources_default.nodect = 1
set queue critical resources_default.nodes = 1
set queue critical enabled = True
set queue critical started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = redhat-test-02.wise.wa-k12.net
set server managers = sysadm@redhat-test-02.wise.wa-k12.net
set server operators = sysadm@redhat-test-02.wise.wa-k12.net
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server submit_hosts = wem-lmgt-01.wise.wa-k12.net
set server next_job_number = 0
set server authorized_users = dbadm@wem-lmgt-01.wise.wa-k12.net
set server moab_array_compatible = True
set server nppcu = 1
Qmgr:



Version-Release number of selected component (if applicable):

torque-scheduler-4.2.10-9.el7.x86_64

How reproducible:

Who knows, the packages are not configured to function correctly when installed.

Steps to Reproduce:

Just install the packages and try to make them work.

Comment 1 Kevin L. Esteb 2016-04-26 15:12:25 UTC
This is also happening on RHEL6. Did you forget to compile in a valid scheduler for pbs_sched in the last update?

Comment 2 David Brown 2016-04-26 16:10:37 UTC
Kevin,

Sorry you are having issues trying to schedule nodes, I'm not sure if this may be related to the new numa support that I put in a few months ago... I've never run multiple queues before, my testing is primarily with a default queue setup and MPI jobs can be scheduled and run just fine. It will take me some time to try and reproduce your scheduling environment so in the mean time, have you run this issue by the mailing list yet?

Thanks,
- David Brown

Comment 3 Kevin L. Esteb 2016-04-26 16:26:39 UTC
No, the mailing list is not much help, just vague comments that pbs_sched is broken on 4.2 and vague statements that somebody, somehow got it working again.

This problem started with the last update from epel. Our RHEL5 boxes are working fine, but they weren't updated. We didn't change our configuration, just updated the software and the scheduler stopped working. 

Luckily our RHEL6/RHEL7 boxes don't currently use Torque in production. But we are doing a push to RHEL7 this summer and a broken Torque is not good. 

Two things that I noticed, pbs_sched stopped listening on the loopback device and I had to use the '-l' switch with pbs_server to force it to communicate with pbs_sched. netstat show the connections, but the scheduler doesn't seem to want to schedule.

Our RHEL6/RHEL7 boxes show this:

[root@wsipc-scm-01 Resource]# netstat -tapn | grep pbs
tcp        0      0 0.0.0.0:9501                0.0.0.0:*                   LISTEN      32891/pbs_server
tcp        0      0 0.0.0.0:9502                0.0.0.0:*                   LISTEN      33329/pbs_mom
tcp        0      0 0.0.0.0:9503                0.0.0.0:*                   LISTEN      33329/pbs_mom
tcp        0      0 10.1.254.181:9504           0.0.0.0:*                   LISTEN      15237/pbs_sched
tcp        1      0 10.1.254.181:843            10.1.254.181:9504           CLOSE_WAIT  32891/pbs_server

Our RHEL5 boxes show this:

[root@redhat-test-03 ~]# netstat -tapn | grep pbs
tcp        0      0 10.1.252.43:9504            0.0.0.0:*                   LISTEN      17168/pbs_sched
tcp        0      0 0.0.0.0:9501                0.0.0.0:*                   LISTEN      17121/pbs_server
tcp        0      0 0.0.0.0:9502                0.0.0.0:*                   LISTEN      17147/pbs_mom
tcp        0      0 0.0.0.0:9503                0.0.0.0:*                   LISTEN      17147/pbs_mom
[root@redhat-test-03 ~]#

Comment 4 Kevin L. Esteb 2016-06-06 23:15:49 UTC
Updating to v6.0.1 from the Adaptive Computing web site fixes these problems. I compiled the code using the provided spec file. After some minor configurations, it just worked.

Comment 5 Fedora Update System 2017-08-17 16:58:32 UTC
torque-4.2.10-11.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670

Comment 6 Fedora Update System 2017-08-18 20:23:30 UTC
torque-4.2.10-11.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670


Note You need to log in before you can comment on or make changes to this bug.