Bug 759141

Summary: pbs_server crash on 'pbsnodes' from client without munge
Product: [Fedora] Fedora EPEL Reporter: Dennis van Dok <dennisvd>
Component: torqueAssignee: Steve Traylen <steve.traylen>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: el5CC: fotis, garrick, maarten.litmaath, steve.traylen
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: torque-2.5.7-9.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-20 20:02:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
core dump of server none

Description Dennis van Dok 2011-12-01 14:12:48 UTC
Created attachment 539211 [details]
core dump of server

Description of problem:

Testing the new version torque-2.5.7-6.el5.x86_64.rpm.

Running pbs_server (with munge) on one machine, and doing pbsnodes from another machine (where munge isn't running) results in a crash of pbs_server.


Version-Release number of selected component (if applicable):

torque-2.5.7-6.el5.x86_64.rpm

How reproducible:

need two machines, one torque server, one client.

Steps to Reproduce:
1.
Installed torque, torque-server, libtorque, torque-client from epel-testing

==========================================================================================
 Package                Arch            Version               Repository             Size
==========================================================================================
Updating:
 torque-server          x86_64          2.5.7-6.el5           epel-testing          184 k
Updating for dependencies:
 libtorque              x86_64          2.5.7-6.el5           epel-testing           93 k
 torque                 x86_64          2.5.7-6.el5           epel-testing           48 k
 torque-client          x86_64          2.5.7-6.el5           epel-testing          198 k

Transaction Summary
==========================================================================================
Install       0 Package(s)
Upgrade       4 Package(s)

Install torque-mom, torque-client on the other machine:
==========================================================================================
 Package                Arch            Version               Repository             Size
==========================================================================================
Updating:
 torque                 x86_64          2.5.7-6.el5           epel-testing           48 k
Updating for dependencies:
 libtorque              x86_64          2.5.7-6.el5           epel-testing           93 k
 torque-client          x86_64          2.5.7-6.el5           epel-testing          198 k
 torque-mom             x86_64          2.5.7-6.el5           epel-testing          164 k

Transaction Summary
==========================================================================================
Install       0 Package(s)
Upgrade       4 Package(s)

This machine is configured as a worker node for the cluster.

2. Starting pbs_server
# /usr/sbin/pbs_server -d /var/torque -D

3. on the other machine, run pbsnodes
  
Actual results:

On the server:

*** glibc detected *** /usr/sbin/pbs_server: double free or corruption (!prev): 0x00000000095ca1d0 ***
Segmentation fault (core dumped)

On the client:

munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
pbsnodes: PBS_Server System error: Success MSG=munge failure


Expected results:

Node output

Additional info:
Core was generated by `/usr/sbin/pbs_server -d /var/torque -D'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6
(gdb) where
#0  0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6
#1  0x0000003e2f474e2e in malloc () from /lib64/libc.so.6
#2  0x0000003e2f00a151 in _dl_new_object () from /lib64/ld-linux-x86-64.so.2
#3  0x0000003e2f005a6c in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2
#4  0x0000003e2f0077d3 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#5  0x0000003e2f010dc2 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x0000003e2f0107dc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x0000003e2f509380 in do_dlopen () from /lib64/libc.so.6
#9  0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x0000003e2f5094e7 in __libc_dlopen_mode () from /lib64/libc.so.6
#11 0x0000003e2f4e67cf in backtrace () from /lib64/libc.so.6
#12 0x0000003e2f46a9af in __libc_message () from /lib64/libc.so.6
#13 0x0000003e2f47245f in _int_free () from /lib64/libc.so.6
#14 0x0000003e2f4728bb in free () from /lib64/libc.so.6
#15 0x000000000041dac2 in reply_send (request=0x95ca1d0) at reply_send.c:308
#16 0x000000000041df29 in req_reject (code=15012, aux=0, preq=0x95ca1d0, HostName=0x0, 
    Msg=0x448b07 "munge failure") at reply_send.c:464
#17 0x000000000042026d in req_altauthenuser (preq=0x95ca1d0) at req_getcred.c:474
#18 0x000000000041d02a in process_request (sfds=10) at process_request.c:553
#19 0x00002b45ff87cc0e in wait_request (waittime=<value optimized out>, SState=0x728d38)
    at ../Libnet/net_server.c:508
#20 0x000000000041b06f in main_loop () at pbsd_main.c:1198
#21 0x000000000041ba87 in main (argc=4, argv=<value optimized out>) at pbsd_main.c:1754
(gdb)

Comment 1 Steve Traylen 2011-12-01 14:38:51 UTC
Thanks for checking.

Could you give this scratch build ago:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3555507

I can't look at this till later.

I'm not necessarily releasing 2.5.9 but its good to check if there's something in there we can take.

Many Thanks 
Steve.

Comment 2 Dennis van Dok 2011-12-01 15:21:01 UTC
Steve,

after updating both server and client to 2.5.9, I don't get the crash. The message on the client is as expected:

munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
Communication failure.
pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge executable not found, unable to authenticate)

This is normal as munge isn't used on worker nodes.

Comment 3 Steve Traylen 2011-12-01 15:27:54 UTC
Great, that gives me something to work with, will try and get something done by or on the weekend.

If necessary 2.5.9 goes out.

We should add a comment to the karma pages about this.

Thanks again.

Steve.

Comment 4 Steve Traylen 2011-12-01 15:29:55 UTC
>pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge
>executable not found, unable to authenticate)

is the munge executable really not on a host where pbsnodes is? That makes no sense.

Comment 5 Dennis van Dok 2011-12-01 15:37:38 UTC
> is the munge executable really not on a host where pbsnodes is? That makes no
> sense.

Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a bit silly in the first place; I did that just out of curiosity.

Comment 6 Steve Traylen 2011-12-01 15:50:49 UTC
(In reply to comment #5)
> > is the munge executable really not on a host where pbsnodes is? That makes no
> > sense.
> 
> Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a
> bit silly in the first place; I did that just out of curiosity.

Okay but pbsnodes is in a package then munge should at least be isntalled even if it's not running.

Comment 7 Dennis van Dok 2011-12-01 21:42:59 UTC
(In reply to comment #6)
> Okay but pbsnodes is in a package then munge should at least be isntalled even
> if it's not running.
Looks like a missing dependency of torque-client on munge.

Comment 8 Steve Traylen 2011-12-03 17:52:47 UTC
>Testing the new version torque-2.5.7-6.el5.x86_64.rpm.
> 
> Running pbs_server (with munge) on one machine, and doing pbsnodes from another
> machine (where munge isn't running) results in a crash of pbs_server.
> 

Have reproduced now on el6 with torque-2.5.7-8.el6 which is the same patch set.

Steve.

Comment 9 Fedora Update System 2011-12-03 18:51:44 UTC
torque-2.5.7-7.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/torque-2.5.7-7.el5

Comment 10 Fedora Update System 2011-12-03 18:52:02 UTC
torque-2.5.7-9.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/torque-2.5.7-9.el6

Comment 11 Fedora Update System 2011-12-03 18:52:20 UTC
torque-2.5.7-7.el4 has been submitted as an update for Fedora EPEL 4.
https://admin.fedoraproject.org/updates/torque-2.5.7-7.el4

Comment 12 Fedora Update System 2011-12-04 20:00:17 UTC
Package torque-2.5.7-9.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing torque-2.5.7-9.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2011-5157/torque-2.5.7-9.el6
then log in and leave karma (feedback).

Comment 13 Fedora Update System 2011-12-20 20:02:38 UTC
torque-2.5.7-7.el5 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2011-12-20 20:03:47 UTC
torque-2.5.7-7.el4 has been pushed to the Fedora EPEL 4 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2011-12-20 20:04:14 UTC
torque-2.5.7-9.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.