Bug 1279565 - Torque will not allocate more than 1 CPU in a cpuset
Summary: Torque will not allocate more than 1 CPU in a cpuset
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: torque
Version: el6
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: David Brown
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-09 18:31 UTC by Steven Ford
Modified: 2016-03-16 13:31 UTC (History)
5 users (show)

Fixed In Version: torque-4.2.10-8.fc23 torque-4.2.10-8.fc22 torque-4.2.10-8.el5 torque-4.2.10-8.el7 torque-4.2.10-8.el6 torque-4.2.10-9.el5 torque-4.2.10-9.el7 torque-4.2.10-9.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-23 22:30:49 UTC
Type: Bug


Attachments (Terms of Use)

Description Steven Ford 2015-11-09 18:31:28 UTC
Description of problem:

Jobs allocated to nodes that requested multiple CPUs are limited to one CPU via cpusets, despite the scheduler allocating multiple CPUs.

Version-Release number of selected component (if applicable):
Torque 4.2.10-5.el6

How reproducible:
I can reproduce this problem by submitting a job that runs 'stress -c 4' with the '#PBS -l ncpus=4' directive. The job is submitted to a node, but /dev/cpuset/torque/<jobid>/cpus contains '1' instead of '0-3','0,2,4,6', or any allocation of 4 processors. The 4 threads spawned by stress can be observed to be sharing one core and using 25% each.


Steps to Reproduce:
1. Submit job requesting multiple CPUs
2. Observe content of /dev/cpuset/torque/<jobid>/cpus to see what CPUs are given to the job
3. Observe process CPU usage limited to one core

Actual results:
Job requesting multiple CPUs gets places in a cpuset with only 1 CPU

Expected results:
Job requesting multiple CPUs gets places in a cpuset with only 1 CPU

Additional info:

Comment 1 Fedora Update System 2015-12-07 02:05:50 UTC
torque-4.2.10-8.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe

Comment 2 Fedora Update System 2015-12-07 02:06:17 UTC
torque-4.2.10-8.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94

Comment 3 Fedora Update System 2015-12-07 02:06:46 UTC
torque-4.2.10-8.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd

Comment 4 Fedora Update System 2015-12-07 02:18:02 UTC
torque-4.2.10-8.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8

Comment 5 Fedora Update System 2015-12-07 02:18:25 UTC
torque-4.2.10-8.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e

Comment 6 Fedora Update System 2015-12-07 23:22:12 UTC
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94

Comment 7 Fedora Update System 2015-12-08 04:32:03 UTC
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe

Comment 8 Fedora Update System 2015-12-08 07:19:51 UTC
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8

Comment 9 Fedora Update System 2015-12-08 07:21:35 UTC
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd

Comment 10 Fedora Update System 2015-12-08 07:47:50 UTC
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'yum --enablerepo=epel-testing update torque'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e

Comment 11 Steven Ford 2015-12-08 20:51:42 UTC
I installed the latest version from the EPEL testing repo, however, I do not yet know if it solved my problem. 

I have a different problem now. It seems I am running into an error because of torque having been built for NUMA architecture, as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1231148.

I don't have NUMA systems as far as I know, and I am still looking for a way to configure my system to work with this build of Torque despite that.

Any insight would be appreciated.

I will update when I know more.

Comment 12 David Brown 2015-12-08 21:05:16 UTC
Yeah I had to make a mom.layout file in the mom_priv directory.

It just has the following in it:

nodes=0

and that's it.

Thanks,
- David Brown

Comment 13 Steven Ford 2015-12-09 16:02:09 UTC
Ok, I added "nodes=0" to the mom.layout file, and "num_node_boards=1 numa_board_str=4" to the node properties. PBS mom now starts and Maui recognizes the one node with 4 processors, however, cpuset is still not working correctly. This time, it seems torque does not detect any cpus on the node. /dev/cpuset/torque/cpus is empty, as is the cpus file for any job. Before, /dev/cpuset/torque/cpus contained '0-3' and would only assign one cpu to any given job regardless of what it requested.

Mom logs:

12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;Log;Log opened
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 0
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setpbsserver;torque
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;mom_server_add;server torque added
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setloglevel;7
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;setup_program_environment;machine topology contains 0 memory nodes, 4 cpus
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;read_layout_file;nodeboard  0: 1 NUMA nodes: 0
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;read_layout_file;Setting up this mom to function as 1 numa nodes
12/09/2015 10:51:17;0002;   pbs_mom.3412;node;setup_nodeboards;nodeboard  0: 0 cpus (), 1 mems (0)
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;Init cpuset /dev/cpuset/torque
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;setting cpus =
12/09/2015 10:51:17;0002;   pbs_mom.3412;Svr;init_torque_cpuset;setting mems = 0
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;initialize;independent
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;dep_initialize;mom is now oom-killer safe
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;mom_open_poll;started
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;mom_get_sample;proc_array load started
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
12/09/2015 10:51:17;0080;   pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;pbs_mom;before init_abort_jobs
12/09/2015 10:51:17;0001;   pbs_mom.3413;Svr;pbs_mom;init_abort_jobs: recover=2
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;pbs_mom;Is up
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1449454440
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 7
12/09/2015 10:51:17;0002;   pbs_mom.3413;n/a;mom_server_all_update_stat;composing status update for server
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;sessions;nsessions=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;sessions;nsessions=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;nusers;nusers=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;totmem;totmem: total mem=4294438912
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;availmem;availmem: free mem=3807641600
12/09/2015 10:51:17;0002;   pbs_mom.3417;node;ncpus;ncpus=0
12/09/2015 10:51:17;0002;   pbs_mom.3417;node;cpuact;cpuact=0.00
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "numa0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "opsys=linux"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "uname=Linux torque 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nsessions=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nusers=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "idletime=66466"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "totmem=4193788kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "availmem=3718400kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "physmem=4193788kb"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "ncpus=0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "loadave=0.00"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "gres="
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "netload=? 0"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "state=free"
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs= "
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "varattr= "
12/09/2015 10:51:17;0008;   pbs_mom.3417;Job;read_tcp_reply;protocol: 4  version: 3  command:4  sock:9
12/09/2015 10:51:17;0002;   pbs_mom.3417;n/a;mom_server_update_stat;status update successfully sent to torque
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;scan_for_terminated;entered
12/09/2015 10:51:17;0080;   pbs_mom.3413;Svr;mom_get_sample;proc_array load started
12/09/2015 10:51:17;0002;   pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
12/09/2015 10:51:17;0080;   pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;scan_for_terminated;pid 3417 not tracked, statloc=0, exitval=0
12/09/2015 10:51:17;0008;   pbs_mom.3413;Job;tcp_request;tcp_request: fd 9 addr 127.0.0.1:522
12/09/2015 10:51:17;0001;   pbs_mom.3413;Job;mom_is_request;stream 9 version 3
12/09/2015 10:51:17;0001;   pbs_mom.3413;Job;mom_is_request;command 2, "CLUSTER_ADDRS", received
12/09/2015 10:51:17;0002;   pbs_mom.3413;node;read_cluster_addresses;Successfully received the mom hierarchy file. My okclients list is '127.0.0.1:0,127.0.0.1:15003', and the hierarchy file is ''


This is all on one test machine running Maui, PBS Server, and PBS Mom.

Comment 14 Fedora Update System 2015-12-14 10:21:45 UTC
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 15 Steven Ford 2015-12-22 14:41:14 UTC
This bug was not fixed in the latest version.

If an error was made in my configuration that is causing the bug to persist, I would appreciate some insight into what that might be It seems my comment was disregarded before pushing this version to the stable repository.

Comment 16 David Brown 2015-12-22 18:51:47 UTC
Steven,

Have you tried contacting the torque users mailing list about this? ‎

[torqueusers@supercluster.org]‎

I've updated the package to build using the numa and cpuset configure arguments. Which should allow for scheduling of multiple CPUs on a node. I'm not sure about the integration with Maui as I've only done simple torque configurations to verify torque can run.

Thanks,
- David Brown

Comment 17 Steven Ford 2015-12-22 19:33:41 UTC
David,

Thank you for the info, I will contact the mailing list. The problem might be with the scheduler in my set up. It might help me better isolate this if I test Torque without the Maui scheduler. You mentioned that you use simple Torque configurations to verify it can run, would you mind sharing what this configuration is? Are you testing it with a different scheduler than Maui? Or is there a way to use torque without a scheduler?

Thank you for your help,

Steve

Comment 18 David Brown 2015-12-22 21:55:17 UTC
Steve,

I don't know how familiar you are with Chef but I've been using that to deploy test clusters of torque. My chef cookbook is https://github.com/dmlb2000/torque-cookbook

I thought torque just uses a first in first out form of scheduling without the use of Maui or Moab, but there maybe more.

Thanks,
- David Brown

Comment 19 Fedora Update System 2015-12-26 21:52:09 UTC
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2016-02-21 08:09:37 UTC
torque-4.2.10-9.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4

Comment 21 Fedora Update System 2016-02-21 08:10:05 UTC
torque-4.2.10-9.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054

Comment 22 Fedora Update System 2016-02-21 08:10:32 UTC
torque-4.2.10-9.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca

Comment 23 Fedora Update System 2016-02-22 03:48:28 UTC
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca

Comment 24 Fedora Update System 2016-02-22 03:48:37 UTC
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054

Comment 25 Fedora Update System 2016-02-22 04:51:03 UTC
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4

Comment 26 Fedora Update System 2016-02-23 22:30:40 UTC
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.

Comment 27 Fedora Update System 2016-02-24 17:59:50 UTC
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2016-02-24 18:54:42 UTC
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.

Comment 29 Fedora Update System 2016-03-15 22:01:40 UTC
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.

Comment 30 Fedora Update System 2016-03-15 22:47:49 UTC
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.

Comment 31 Fedora Update System 2016-03-16 13:30:42 UTC
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.