Bug 1279565
Summary: | Torque will not allocate more than 1 CPU in a cpuset | ||
---|---|---|---|
Product: | [Fedora] Fedora EPEL | Reporter: | Steven Ford <sford123> |
Component: | torque | Assignee: | David Brown <david.brown> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | el6 | CC: | david.brown, fotis, garrick, karlthered, sford123 |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | torque-4.2.10-8.fc23 torque-4.2.10-8.fc22 torque-4.2.10-8.el5 torque-4.2.10-8.el7 torque-4.2.10-8.el6 torque-4.2.10-9.el5 torque-4.2.10-9.el7 torque-4.2.10-9.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-02-23 22:30:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Steven Ford
2015-11-09 18:31:28 UTC
torque-4.2.10-8.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe torque-4.2.10-8.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94 torque-4.2.10-8.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd torque-4.2.10-8.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8 torque-4.2.10-8.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e torque-4.2.10-8.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-fed8081a94 torque-4.2.10-8.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-7fab0e17fe torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-1855821fe8 torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-291545b6fd torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'yum --enablerepo=epel-testing update torque' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-22cbfeb20e I installed the latest version from the EPEL testing repo, however, I do not yet know if it solved my problem. I have a different problem now. It seems I am running into an error because of torque having been built for NUMA architecture, as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1231148. I don't have NUMA systems as far as I know, and I am still looking for a way to configure my system to work with this build of Torque despite that. Any insight would be appreciated. I will update when I know more. Yeah I had to make a mom.layout file in the mom_priv directory. It just has the following in it: nodes=0 and that's it. Thanks, - David Brown Ok, I added "nodes=0" to the mom.layout file, and "num_node_boards=1 numa_board_str=4" to the node properties. PBS mom now starts and Maui recognizes the one node with 4 processors, however, cpuset is still not working correctly. This time, it seems torque does not detect any cpus on the node. /dev/cpuset/torque/cpus is empty, as is the cpus file for any job. Before, /dev/cpuset/torque/cpus contained '0-3' and would only assign one cpu to any given job regardless of what it requested. Mom logs: 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;Log;Log opened 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 0 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setpbsserver;torque 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;mom_server_add;server torque added 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setloglevel;7 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;setup_program_environment;machine topology contains 0 memory nodes, 4 cpus 12/09/2015 10:51:17;0002; pbs_mom.3412;node;read_layout_file;nodeboard 0: 1 NUMA nodes: 0 12/09/2015 10:51:17;0002; pbs_mom.3412;node;read_layout_file;Setting up this mom to function as 1 numa nodes 12/09/2015 10:51:17;0002; pbs_mom.3412;node;setup_nodeboards;nodeboard 0: 0 cpus (), 1 mems (0) 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;Init cpuset /dev/cpuset/torque 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;setting cpus = 12/09/2015 10:51:17;0002; pbs_mom.3412;Svr;init_torque_cpuset;setting mems = 0 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;initialize;independent 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;dep_initialize;mom is now oom-killer safe 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server. 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;mom_open_poll;started 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;mom_get_sample;proc_array load started 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 12/09/2015 10:51:17;0080; pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;pbs_mom;before init_abort_jobs 12/09/2015 10:51:17;0001; pbs_mom.3413;Svr;pbs_mom;init_abort_jobs: recover=2 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;pbs_mom;Is up 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1449454440 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;pbs_mom;Torque Mom Version = 4.2.10, loglevel = 7 12/09/2015 10:51:17;0002; pbs_mom.3413;n/a;mom_server_all_update_stat;composing status update for server 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;sessions;nsessions=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;sessions;nsessions=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;nusers;nusers=0 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;totmem;totmem: total mem=4294438912 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;availmem;availmem: free mem=3807641600 12/09/2015 10:51:17;0002; pbs_mom.3417;node;ncpus;ncpus=0 12/09/2015 10:51:17;0002; pbs_mom.3417;node;cpuact;cpuact=0.00 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "numa0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "opsys=linux" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "uname=Linux torque 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nsessions=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "nusers=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "idletime=66466" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "totmem=4193788kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "availmem=3718400kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "physmem=4193788kb" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "ncpus=0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "loadave=0.00" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "gres=" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "netload=? 0" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "state=free" 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "jobs= " 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;mom_server_update_stat: sending to server "varattr= " 12/09/2015 10:51:17;0008; pbs_mom.3417;Job;read_tcp_reply;protocol: 4 version: 3 command:4 sock:9 12/09/2015 10:51:17;0002; pbs_mom.3417;n/a;mom_server_update_stat;status update successfully sent to torque 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;scan_for_terminated;entered 12/09/2015 10:51:17;0080; pbs_mom.3413;Svr;mom_get_sample;proc_array load started 12/09/2015 10:51:17;0002; pbs_mom.3413;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 12/09/2015 10:51:17;0080; pbs_mom.3413;n/a;mom_get_sample;proc_array loaded - nproc=0 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;scan_for_terminated;pid 3417 not tracked, statloc=0, exitval=0 12/09/2015 10:51:17;0008; pbs_mom.3413;Job;tcp_request;tcp_request: fd 9 addr 127.0.0.1:522 12/09/2015 10:51:17;0001; pbs_mom.3413;Job;mom_is_request;stream 9 version 3 12/09/2015 10:51:17;0001; pbs_mom.3413;Job;mom_is_request;command 2, "CLUSTER_ADDRS", received 12/09/2015 10:51:17;0002; pbs_mom.3413;node;read_cluster_addresses;Successfully received the mom hierarchy file. My okclients list is '127.0.0.1:0,127.0.0.1:15003', and the hierarchy file is '' This is all on one test machine running Maui, PBS Server, and PBS Mom. torque-4.2.10-8.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. This bug was not fixed in the latest version. If an error was made in my configuration that is causing the bug to persist, I would appreciate some insight into what that might be It seems my comment was disregarded before pushing this version to the stable repository. Steven, Have you tried contacting the torque users mailing list about this? [torqueusers] I've updated the package to build using the numa and cpuset configure arguments. Which should allow for scheduling of multiple CPUs on a node. I'm not sure about the integration with Maui as I've only done simple torque configurations to verify torque can run. Thanks, - David Brown David, Thank you for the info, I will contact the mailing list. The problem might be with the scheduler in my set up. It might help me better isolate this if I test Torque without the Maui scheduler. You mentioned that you use simple Torque configurations to verify it can run, would you mind sharing what this configuration is? Are you testing it with a different scheduler than Maui? Or is there a way to use torque without a scheduler? Thank you for your help, Steve Steve, I don't know how familiar you are with Chef but I've been using that to deploy test clusters of torque. My chef cookbook is https://github.com/dmlb2000/torque-cookbook I thought torque just uses a first in first out form of scheduling without the use of Maui or Moab, but there maybe more. Thanks, - David Brown torque-4.2.10-8.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-9.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4 torque-4.2.10-9.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054 torque-4.2.10-9.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-f2f3898eca torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-652d4f3054 torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-80daa121a4 torque-4.2.10-8.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-8.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-8.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-9.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-9.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-9.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report. |