Created attachment 539211 [details] core dump of server Description of problem: Testing the new version torque-2.5.7-6.el5.x86_64.rpm. Running pbs_server (with munge) on one machine, and doing pbsnodes from another machine (where munge isn't running) results in a crash of pbs_server. Version-Release number of selected component (if applicable): torque-2.5.7-6.el5.x86_64.rpm How reproducible: need two machines, one torque server, one client. Steps to Reproduce: 1. Installed torque, torque-server, libtorque, torque-client from epel-testing ========================================================================================== Package Arch Version Repository Size ========================================================================================== Updating: torque-server x86_64 2.5.7-6.el5 epel-testing 184 k Updating for dependencies: libtorque x86_64 2.5.7-6.el5 epel-testing 93 k torque x86_64 2.5.7-6.el5 epel-testing 48 k torque-client x86_64 2.5.7-6.el5 epel-testing 198 k Transaction Summary ========================================================================================== Install 0 Package(s) Upgrade 4 Package(s) Install torque-mom, torque-client on the other machine: ========================================================================================== Package Arch Version Repository Size ========================================================================================== Updating: torque x86_64 2.5.7-6.el5 epel-testing 48 k Updating for dependencies: libtorque x86_64 2.5.7-6.el5 epel-testing 93 k torque-client x86_64 2.5.7-6.el5 epel-testing 198 k torque-mom x86_64 2.5.7-6.el5 epel-testing 164 k Transaction Summary ========================================================================================== Install 0 Package(s) Upgrade 4 Package(s) This machine is configured as a worker node for the cluster. 2. Starting pbs_server # /usr/sbin/pbs_server -d /var/torque -D 3. on the other machine, run pbsnodes Actual results: On the server: *** glibc detected *** /usr/sbin/pbs_server: double free or corruption (!prev): 0x00000000095ca1d0 *** Segmentation fault (core dumped) On the client: munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory pbsnodes: PBS_Server System error: Success MSG=munge failure Expected results: Node output Additional info: Core was generated by `/usr/sbin/pbs_server -d /var/torque -D'. Program terminated with signal 11, Segmentation fault. #0 0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6 (gdb) where #0 0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6 #1 0x0000003e2f474e2e in malloc () from /lib64/libc.so.6 #2 0x0000003e2f00a151 in _dl_new_object () from /lib64/ld-linux-x86-64.so.2 #3 0x0000003e2f005a6c in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2 #4 0x0000003e2f0077d3 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #5 0x0000003e2f010dc2 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #6 0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #7 0x0000003e2f0107dc in _dl_open () from /lib64/ld-linux-x86-64.so.2 #8 0x0000003e2f509380 in do_dlopen () from /lib64/libc.so.6 #9 0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #10 0x0000003e2f5094e7 in __libc_dlopen_mode () from /lib64/libc.so.6 #11 0x0000003e2f4e67cf in backtrace () from /lib64/libc.so.6 #12 0x0000003e2f46a9af in __libc_message () from /lib64/libc.so.6 #13 0x0000003e2f47245f in _int_free () from /lib64/libc.so.6 #14 0x0000003e2f4728bb in free () from /lib64/libc.so.6 #15 0x000000000041dac2 in reply_send (request=0x95ca1d0) at reply_send.c:308 #16 0x000000000041df29 in req_reject (code=15012, aux=0, preq=0x95ca1d0, HostName=0x0, Msg=0x448b07 "munge failure") at reply_send.c:464 #17 0x000000000042026d in req_altauthenuser (preq=0x95ca1d0) at req_getcred.c:474 #18 0x000000000041d02a in process_request (sfds=10) at process_request.c:553 #19 0x00002b45ff87cc0e in wait_request (waittime=<value optimized out>, SState=0x728d38) at ../Libnet/net_server.c:508 #20 0x000000000041b06f in main_loop () at pbsd_main.c:1198 #21 0x000000000041ba87 in main (argc=4, argv=<value optimized out>) at pbsd_main.c:1754 (gdb)
Thanks for checking. Could you give this scratch build ago: http://koji.fedoraproject.org/koji/taskinfo?taskID=3555507 I can't look at this till later. I'm not necessarily releasing 2.5.9 but its good to check if there's something in there we can take. Many Thanks Steve.
Steve, after updating both server and client to 2.5.9, I don't get the crash. The message on the client is as expected: munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory Communication failure. pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge executable not found, unable to authenticate) This is normal as munge isn't used on worker nodes.
Great, that gives me something to work with, will try and get something done by or on the weekend. If necessary 2.5.9 goes out. We should add a comment to the karma pages about this. Thanks again. Steve.
>pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge >executable not found, unable to authenticate) is the munge executable really not on a host where pbsnodes is? That makes no sense.
> is the munge executable really not on a host where pbsnodes is? That makes no > sense. Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a bit silly in the first place; I did that just out of curiosity.
(In reply to comment #5) > > is the munge executable really not on a host where pbsnodes is? That makes no > > sense. > > Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a > bit silly in the first place; I did that just out of curiosity. Okay but pbsnodes is in a package then munge should at least be isntalled even if it's not running.
(In reply to comment #6) > Okay but pbsnodes is in a package then munge should at least be isntalled even > if it's not running. Looks like a missing dependency of torque-client on munge.
>Testing the new version torque-2.5.7-6.el5.x86_64.rpm. > > Running pbs_server (with munge) on one machine, and doing pbsnodes from another > machine (where munge isn't running) results in a crash of pbs_server. > Have reproduced now on el6 with torque-2.5.7-8.el6 which is the same patch set. Steve.
torque-2.5.7-7.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/torque-2.5.7-7.el5
torque-2.5.7-9.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/torque-2.5.7-9.el6
torque-2.5.7-7.el4 has been submitted as an update for Fedora EPEL 4. https://admin.fedoraproject.org/updates/torque-2.5.7-7.el4
Package torque-2.5.7-9.el6: * should fix your issue, * was pushed to the Fedora EPEL 6 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing torque-2.5.7-9.el6' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2011-5157/torque-2.5.7-9.el6 then log in and leave karma (feedback).
torque-2.5.7-7.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
torque-2.5.7-7.el4 has been pushed to the Fedora EPEL 4 stable repository. If problems still persist, please make note of it in this bug report.
torque-2.5.7-9.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.