Bug 1279565 - Torque will not allocate more than 1 CPU in a cpuset
Torque will not allocate more than 1 CPU in a cpuset
Status: CLOSED ERRATA
Product: Fedora EPEL
Classification: Fedora
Component: torque (Show other bugs)
el6
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: David Brown
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-09 13:31 EST by Steven Ford
Modified: 2016-03-16 09:31 EDT (History)
5 users (show)

See Also:
Fixed In Version: torque-4.2.10-8.fc23 torque-4.2.10-8.fc22 torque-4.2.10-8.el5 torque-4.2.10-8.el7 torque-4.2.10-8.el6 torque-4.2.10-9.el5 torque-4.2.10-9.el7 torque-4.2.10-9.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-23 17:30:49 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steven Ford 2015-11-09 13:31:28 EST
Description of problem:

Jobs allocated to nodes that requested multiple CPUs are limited to one CPU via cpusets, despite the scheduler allocating multiple CPUs.

Version-Release number of selected component (if applicable):
Torque 4.2.10-5.el6

How reproducible:
I can reproduce this problem by submitting a job that runs 'stress -c 4' with the '#PBS -l ncpus=4' directive. The job is submitted to a node, but /dev/cpuset/torque/<jobid>/cpus contains '1' instead of '0-3','0,2,4,6', or any allocation of 4 processors. The 4 threads spawned by stress can be observed to be sharing one core and using 25% each.


Steps to Reproduce:
1. Submit job requesting multiple CPUs
2. Observe content of /dev/cpuset/torque/<jobid>/cpus to see what CPUs are given to the job
3. Observe process CPU usage limited to one core

Actual results:
Job requesting multiple CPUs gets places in a cpuset with only 1 CPU

Expected results:
Job requesting multiple CPUs gets places in a cpuset with only 1 CPU

Additional info:
Comment 1 Fedora Update System 2015-12-06 21:05:50 EST
torque-4.2.10-8.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe
Comment 2 Fedora Update System 2015-12-06 21:06:17 EST
torque-4.2.10-8.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94
Comment 3 Fedora Update System 2015-12-06 21:06:46 EST
torque-4.2.10-8.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd
Comment 4 Fedora Update System 2015-12-06 21:18:02 EST
torque-4.2.10-8.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8
Comment 5 Fedora Update System 2015-12-06 21:18:25 EST
torque-4.2.10-8.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e
Comment 6 Fedora Update System 2015-12-07 18:22:12 EST
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94
Comment 7 Fedora Update System 2015-12-07 23:32:03 EST
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe
Comment 8 Fedora Update System 2015-12-08 02:19:51 EST
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8
Comment 9 Fedora Update System 2015-12-08 02:21:35 EST
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd
Comment 10 Fedora Update System 2015-12-08 02:47:50 EST
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e
Comment 11 Steven Ford 2015-12-08 15:51:42 EST
I installed the latest version from the EPEL testing repo, however, I do not yet know if it solved my problem. 

I have a different problem now. It seems I am running into an error because of torque having been built for NUMA architecture, as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1231148.

I don't have NUMA systems as far as I know, and I am still looking for a way to configure my system to work with this build of Torque despite that.

Any insight would be appreciated.

I will update when I know more.
Comment 12 David Brown 2015-12-08 16:05:16 EST
Yeah I had to make a mom.layout file in the mom_priv directory.

It just has the following in it:

nodes=0

and that's it.

Thanks,
- David Brown
Comment 13 Steven Ford 2015-12-09 11:02:09 EST
Ok, I added "nodes=0" to the mom.layout file, and "num_node_boards=1 numa_board_str=4" to the node properties. PBS mom now starts and Maui recognizes the one node with 4 processors, however, cpuset is still not working correctly. This time, it seems torque does not detect any cpus on the node. /dev/cpuset/torque/cpus is empty, as is the cpus file for any job. Before, /dev/cpuset/torque/cpus contained '0-3' and would only assign one cpu to any given job regardless of what it requested.

Mom logs:

12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;Log;Log opened
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 0
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setpbsserver;torque
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;mom_server_add;server torque added
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setloglevel;7
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setup_program_environment;machine topology contains 0 memory nodes, 4 cpus
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;read_layout_file;nodeboard  0: 1 NUMA nodes: 0
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;read_layout_file;Setting up this mom to function as 1 numa nodes
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;setup_nodeboards;nodeboard  0: 0 cpus (), 1 mems (0)
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;Init cpuset /dev/cpuset/torque
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;setting cpus =
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;setting mems = 0
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;initialize;independent
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;dep_initialize;mom is now oom-killer safe
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;mom_open_poll;started
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;mom_get_sample;proc_array load started
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
12/09/2015 10:51:17;0080;   pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;pbs_mom;before init_abort_jobs
12/09/2015 10:51:17;0001;   pbs_mom.3413;Svr;pbs_mom;init_abort_jobs: recover=2
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;pbs_mom;Is up
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1449454440
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 7
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;mom_server_all_update_stat;composing status update for server
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;sessions;nsessions=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;sessions;nsessions=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;nusers;nusers=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;totmem;totmem: total mem=4294438912
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;availmem;availmem: free mem=3807641600
12/09/2015 10:51:17;0002;   pbs_mom.3417;node;ncpus;ncpus=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;node;cpuact;cpuact=0.00
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "numa0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "opsys=linux"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "uname=Linux torque 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nsessions=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nusers=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "idletime=66466"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "totmem=4193788kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "availmem=3718400kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "physmem=4193788kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "ncpus=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "loadave=0.00"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "gres="
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "netload=? 0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "state=free"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs= "
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "varattr= "
12/09/2015 10:51:17;0008;   pbs_mom.3417;Job;read_tcp_reply;protocol: 4  version: 3  command:4  sock:9
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;status update successfully sent to torque
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;scan_for_terminated;entered
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;mom_get_sample;proc_array load started
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
12/09/2015 10:51:17;0080;   pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;scan_for_terminated;pid 3417 not tracked, statloc=0, exitval=0
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;tcp_request;tcp_request: fd 9 addr 127.0.0.1:522
12/09/2015 10:51:17;0001;   pbs_mom.3413;Job;mom_is_request;stream 9 version 3
12/09/2015 10:51:17;0001;   pbs_mom.3413;Job;mom_is_request;command 2, "CLUSTER_ADDRS", received
12/09/2015 10:51:17;0002;   pbs_mom.3413;node;read_cluster_addresses;Successfully received the mom hierarchy file. My okclients list is '127.0.0.1:0,127.0.0.1:15003', and the hierarchy file is ''


This is all on one test machine running Maui, PBS Server, and PBS Mom.
Comment 14 Fedora Update System 2015-12-14 05:21:45 EST
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
Comment 15 Steven Ford 2015-12-22 09:41:14 EST
This bug was not fixed in the latest version.

If an error was made in my configuration that is causing the bug to persist, I would appreciate some insight into what that might be It seems my comment was disregarded before pushing this version to the stable repository.
Comment 16 David Brown 2015-12-22 13:51:47 EST
Steven,

Have you tried contacting the torque users mailing list about this? ‎

[torqueusers@supercluster.org]‎

I've updated the package to build using the numa and cpuset configure arguments. Which should allow for scheduling of multiple CPUs on a node. I'm not sure about the integration with Maui as I've only done simple torque configurations to verify torque can run.

Thanks,
- David Brown
Comment 17 Steven Ford 2015-12-22 14:33:41 EST
David,

Thank you for the info, I will contact the mailing list. The problem might be with the scheduler in my set up. It might help me better isolate this if I test Torque without the Maui scheduler. You mentioned that you use simple Torque configurations to verify it can run, would you mind sharing what this configuration is? Are you testing it with a different scheduler than Maui? Or is there a way to use torque without a scheduler?

Thank you for your help,

Steve
Comment 18 David Brown 2015-12-22 16:55:17 EST
Steve,

I don't know how familiar you are with Chef but I've been using that to deploy test clusters of torque. My chef cookbook is https://github.com/dmlb2000/torque-cookbook

I thought torque just uses a first in first out form of scheduling without the use of Maui or Moab, but there maybe more.

Thanks,
- David Brown
Comment 19 Fedora Update System 2015-12-26 16:52:09 EST
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
Comment 20 Fedora Update System 2016-02-21 03:09:37 EST
torque-4.2.10-9.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4
Comment 21 Fedora Update System 2016-02-21 03:10:05 EST
torque-4.2.10-9.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054
Comment 22 Fedora Update System 2016-02-21 03:10:32 EST
torque-4.2.10-9.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca
Comment 23 Fedora Update System 2016-02-21 22:48:28 EST
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca
Comment 24 Fedora Update System 2016-02-21 22:48:37 EST
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054
Comment 25 Fedora Update System 2016-02-21 23:51:03 EST
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4
Comment 26 Fedora Update System 2016-02-23 17:30:40 EST
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
Comment 27 Fedora Update System 2016-02-24 12:59:50 EST
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.
Comment 28 Fedora Update System 2016-02-24 13:54:42 EST
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.
Comment 29 Fedora Update System 2016-03-15 18:01:40 EDT
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
Comment 30 Fedora Update System 2016-03-15 18:47:49 EDT
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.
Comment 31 Fedora Update System 2016-03-16 09:30:42 EDT
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.