Description of problem: Jobs allocated to nodes that requested multiple CPUs are limited to one CPU via cpusets, despite the scheduler allocating multiple CPUs. Version-Release number of selected component (if applicable): Torque 4.2.10-5.el6 How reproducible: I can reproduce this problem by submitting a job that runs 'stress -c 4' with the '#PBS -l ncpus=4' directive. The job is submitted to a node, but /dev/cpuset/torque/<jobid>/cpus contains '1' instead of '0-3','0,2,4,6', or any allocation of 4 processors. The 4 threads spawned by stress can be observed to be sharing one core and using 25% each. Steps to Reproduce: 1. Submit job requesting multiple CPUs 2. Observe content of /dev/cpuset/torque/<jobid>/cpus to see what CPUs are given to the job 3. Observe process CPU usage limited to one core Actual results: Job requesting multiple CPUs gets places in a cpuset with only 1 CPU Expected results: Job requesting multiple CPUs gets places in a cpuset with only 1 CPU Additional info:
torque-4.2.10-8.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe
torque-4.2.10-8.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94
torque-4.2.10-8.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd
torque-4.2.10-8.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8
torque-4.2.10-8.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e
I installed the latest version from the EPEL testing repo, however, I do not yet know if it solved my problem. I have a different problem now. It seems I am running into an error because of torque having been built for NUMA architecture, as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1231148. I don't have NUMA systems as far as I know, and I am still looking for a way to configure my system to work with this build of Torque despite that. Any insight would be appreciated. I will update when I know more.
Yeah I had to make a mom.layout file in the mom_priv directory. It just has the following in it: nodes=0 and that's it. Thanks, - David Brown
Ok, I added "nodes=0" to the mom.layout file, and "num_node_boards=1 numa_board_str=4" to the node properties. PBS mom now starts and Maui recognizes the one node with 4 processors, however, cpuset is still not working correctly. This time, it seems torque does not detect any cpus on the node. /dev/cpuset/torque/cpus is empty, as is the cpus file for any job. Before, /dev/cpuset/torque/cpus contained '0-3' and would only assign one cpu to any given job regardless of what it requested. Mom logs: 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;Log;Log opened 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 0 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setpbsserver;torque 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;mom_server_add;server torque added 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setloglevel;7 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setup_program_environment;machine topology contains 0 memory nodes, 4 cpus 12/09/2015 10:51:17;0002; pbs_mom.3412;node;read_layout_file;nodeboard 0: 1 NUMA nodes: 0 12/09/2015 10:51:17;0002; pbs_mom.3412;node;read_layout_file;Setting up this mom to function as 1 numa nodes 12/09/2015 10:51:17;0002; pbs_mom.3412;node;setup_nodeboards;nodeboard 0: 0 cpus (), 1 mems (0) 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;Init cpuset /dev/cpuset/torque 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;setting cpus = 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;setting mems = 0 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;initialize;independent 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;dep_initialize;mom is now oom-killer safe 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server. 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;mom_open_poll;started 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;mom_get_sample;proc_array load started 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 12/09/2015 10:51:17;0080; pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;pbs_mom;before init_abort_jobs 12/09/2015 10:51:17;0001; pbs_mom.3413;Svr;pbs_mom;init_abort_jobs: recover=2 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;pbs_mom;Is up 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1449454440 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 7 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;mom_server_all_update_stat;composing status update for server 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;sessions;nsessions=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;sessions;nsessions=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;nusers;nusers=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;totmem;totmem: total mem=4294438912 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;availmem;availmem: free mem=3807641600 12/09/2015 10:51:17;0002; pbs_mom.3417;node;ncpus;ncpus=0 12/09/2015 10:51:17;0002; pbs_mom.3417;node;cpuact;cpuact=0.00 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "numa0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "opsys=linux" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "uname=Linux torque 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nsessions=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nusers=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "idletime=66466" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "totmem=4193788kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "availmem=3718400kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "physmem=4193788kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "ncpus=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "loadave=0.00" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "gres=" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "netload=? 0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "state=free" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs= " 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "varattr= " 12/09/2015 10:51:17;0008; pbs_mom.3417;Job;read_tcp_reply;protocol: 4 version: 3 command:4 sock:9 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;status update successfully sent to torque 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;scan_for_terminated;entered 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;mom_get_sample;proc_array load started 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 12/09/2015 10:51:17;0080; pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;scan_for_terminated;pid 3417 not tracked, statloc=0, exitval=0 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;tcp_request;tcp_request: fd 9 addr 127.0.0.1:522 12/09/2015 10:51:17;0001; pbs_mom.3413;Job;mom_is_request;stream 9 version 3 12/09/2015 10:51:17;0001; pbs_mom.3413;Job;mom_is_request;command 2, "CLUSTER_ADDRS", received 12/09/2015 10:51:17;0002; pbs_mom.3413;node;read_cluster_addresses;Successfully received the mom hierarchy file. My okclients list is '127.0.0.1:0,127.0.0.1:15003', and the hierarchy file is '' This is all on one test machine running Maui, PBS Server, and PBS Mom.
torque-4.2.10-8.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
This bug was not fixed in the latest version. If an error was made in my configuration that is causing the bug to persist, I would appreciate some insight into what that might be It seems my comment was disregarded before pushing this version to the stable repository.
Steven, Have you tried contacting the torque users mailing list about this? [torqueusers] I've updated the package to build using the numa and cpuset configure arguments. Which should allow for scheduling of multiple CPUs on a node. I'm not sure about the integration with Maui as I've only done simple torque configurations to verify torque can run. Thanks, - David Brown
David, Thank you for the info, I will contact the mailing list. The problem might be with the scheduler in my set up. It might help me better isolate this if I test Torque without the Maui scheduler. You mentioned that you use simple Torque configurations to verify it can run, would you mind sharing what this configuration is? Are you testing it with a different scheduler than Maui? Or is there a way to use torque without a scheduler? Thank you for your help, Steve
Steve, I don't know how familiar you are with Chef but I've been using that to deploy test clusters of torque. My chef cookbook is https://github.com/dmlb2000/torque-cookbook I thought torque just uses a first in first out form of scheduling without the use of Maui or Moab, but there maybe more. Thanks, - David Brown
torque-4.2.10-8.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-9.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4
torque-4.2.10-9.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054
torque-4.2.10-9.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4
torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.
torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.