Bug 759141 - pbs_server crash on 'pbsnodes' from client without munge
Summary: pbs_server crash on 'pbsnodes' from client without munge
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: torque
Version: el5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Steve Traylen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-01 14:12 UTC by Dennis van Dok
Modified: 2011-12-20 20:04 UTC (History)
4 users (show)

Fixed In Version: torque-2.5.7-9.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-20 20:02:38 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
core dump of server (13.23 MB, application/octet-stream)
2011-12-01 14:12 UTC, Dennis van Dok
no flags Details

Description Dennis van Dok 2011-12-01 14:12:48 UTC
Created attachment 539211 [details]
core dump of server

Description of problem:

Testing the new version torque-2.5.7-6.el5.x86_64.rpm.

Running pbs_server (with munge) on one machine, and doing pbsnodes from another machine (where munge isn't running) results in a crash of pbs_server.


Version-Release number of selected component (if applicable):

torque-2.5.7-6.el5.x86_64.rpm

How reproducible:

need two machines, one torque server, one client.

Steps to Reproduce:
1.
Installed torque, torque-server, libtorque, torque-client from epel-testing

==========================================================================================
 Package                Arch            Version               Repository             Size
==========================================================================================
Updating:
 torque-server          x86_64          2.5.7-6.el5           epel-testing          184 k
Updating for dependencies:
 libtorque              x86_64          2.5.7-6.el5           epel-testing           93 k
 torque                 x86_64          2.5.7-6.el5           epel-testing           48 k
 torque-client          x86_64          2.5.7-6.el5           epel-testing          198 k

Transaction Summary
==========================================================================================
Install       0 Package(s)
Upgrade       4 Package(s)

Install torque-mom, torque-client on the other machine:
==========================================================================================
 Package                Arch            Version               Repository             Size
==========================================================================================
Updating:
 torque                 x86_64          2.5.7-6.el5           epel-testing           48 k
Updating for dependencies:
 libtorque              x86_64          2.5.7-6.el5           epel-testing           93 k
 torque-client          x86_64          2.5.7-6.el5           epel-testing          198 k
 torque-mom             x86_64          2.5.7-6.el5           epel-testing          164 k

Transaction Summary
==========================================================================================
Install       0 Package(s)
Upgrade       4 Package(s)

This machine is configured as a worker node for the cluster.

2. Starting pbs_server
# /usr/sbin/pbs_server -d /var/torque -D

3. on the other machine, run pbsnodes
  
Actual results:

On the server:

*** glibc detected *** /usr/sbin/pbs_server: double free or corruption (!prev): 0x00000000095ca1d0 ***
Segmentation fault (core dumped)

On the client:

munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
pbsnodes: PBS_Server System error: Success MSG=munge failure


Expected results:

Node output

Additional info:
Core was generated by `/usr/sbin/pbs_server -d /var/torque -D'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6
(gdb) where
#0  0x0000003e2f472d5a in _int_malloc () from /lib64/libc.so.6
#1  0x0000003e2f474e2e in malloc () from /lib64/libc.so.6
#2  0x0000003e2f00a151 in _dl_new_object () from /lib64/ld-linux-x86-64.so.2
#3  0x0000003e2f005a6c in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2
#4  0x0000003e2f0077d3 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#5  0x0000003e2f010dc2 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x0000003e2f0107dc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x0000003e2f509380 in do_dlopen () from /lib64/libc.so.6
#9  0x0000003e2f00d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x0000003e2f5094e7 in __libc_dlopen_mode () from /lib64/libc.so.6
#11 0x0000003e2f4e67cf in backtrace () from /lib64/libc.so.6
#12 0x0000003e2f46a9af in __libc_message () from /lib64/libc.so.6
#13 0x0000003e2f47245f in _int_free () from /lib64/libc.so.6
#14 0x0000003e2f4728bb in free () from /lib64/libc.so.6
#15 0x000000000041dac2 in reply_send (request=0x95ca1d0) at reply_send.c:308
#16 0x000000000041df29 in req_reject (code=15012, aux=0, preq=0x95ca1d0, HostName=0x0, 
    Msg=0x448b07 "munge failure") at reply_send.c:464
#17 0x000000000042026d in req_altauthenuser (preq=0x95ca1d0) at req_getcred.c:474
#18 0x000000000041d02a in process_request (sfds=10) at process_request.c:553
#19 0x00002b45ff87cc0e in wait_request (waittime=<value optimized out>, SState=0x728d38)
    at ../Libnet/net_server.c:508
#20 0x000000000041b06f in main_loop () at pbsd_main.c:1198
#21 0x000000000041ba87 in main (argc=4, argv=<value optimized out>) at pbsd_main.c:1754
(gdb)

Comment 1 Steve Traylen 2011-12-01 14:38:51 UTC
Thanks for checking.

Could you give this scratch build ago:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3555507

I can't look at this till later.

I'm not necessarily releasing 2.5.9 but its good to check if there's something in there we can take.

Many Thanks 
Steve.

Comment 2 Dennis van Dok 2011-12-01 15:21:01 UTC
Steve,

after updating both server and client to 2.5.9, I don't get the crash. The message on the client is as expected:

munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
Communication failure.
pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge executable not found, unable to authenticate)

This is normal as munge isn't used on worker nodes.

Comment 3 Steve Traylen 2011-12-01 15:27:54 UTC
Great, that gives me something to work with, will try and get something done by or on the weekend.

If necessary 2.5.9 goes out.

We should add a comment to the karma pages about this.

Thanks again.

Steve.

Comment 4 Steve Traylen 2011-12-01 15:29:55 UTC
>pbsnodes: cannot connect to server torqueserver.testbed, error=15009 (munge
>executable not found, unable to authenticate)

is the munge executable really not on a host where pbsnodes is? That makes no sense.

Comment 5 Dennis van Dok 2011-12-01 15:37:38 UTC
> is the munge executable really not on a host where pbsnodes is? That makes no
> sense.

Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a bit silly in the first place; I did that just out of curiosity.

Comment 6 Steve Traylen 2011-12-01 15:50:49 UTC
(In reply to comment #5)
> > is the munge executable really not on a host where pbsnodes is? That makes no
> > sense.
> 
> Munge isn't required for worker nodes AFAIK. So my 'test' to run pbsnodes was a
> bit silly in the first place; I did that just out of curiosity.

Okay but pbsnodes is in a package then munge should at least be isntalled even if it's not running.

Comment 7 Dennis van Dok 2011-12-01 21:42:59 UTC
(In reply to comment #6)
> Okay but pbsnodes is in a package then munge should at least be isntalled even
> if it's not running.
Looks like a missing dependency of torque-client on munge.

Comment 8 Steve Traylen 2011-12-03 17:52:47 UTC
>Testing the new version torque-2.5.7-6.el5.x86_64.rpm.
> 
> Running pbs_server (with munge) on one machine, and doing pbsnodes from another
> machine (where munge isn't running) results in a crash of pbs_server.
> 

Have reproduced now on el6 with torque-2.5.7-8.el6 which is the same patch set.

Steve.

Comment 9 Fedora Update System 2011-12-03 18:51:44 UTC
torque-2.5.7-7.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/torque-2.5.7-7.el5

Comment 10 Fedora Update System 2011-12-03 18:52:02 UTC
torque-2.5.7-9.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/torque-2.5.7-9.el6

Comment 11 Fedora Update System 2011-12-03 18:52:20 UTC
torque-2.5.7-7.el4 has been submitted as an update for Fedora EPEL 4.
https://admin.fedoraproject.org/updates/torque-2.5.7-7.el4

Comment 12 Fedora Update System 2011-12-04 20:00:17 UTC
Package torque-2.5.7-9.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing torque-2.5.7-9.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2011-5157/torque-2.5.7-9.el6
then log in and leave karma (feedback).

Comment 13 Fedora Update System 2011-12-20 20:02:38 UTC
torque-2.5.7-7.el5 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2011-12-20 20:03:47 UTC
torque-2.5.7-7.el4 has been pushed to the Fedora EPEL 4 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2011-12-20 20:04:14 UTC
torque-2.5.7-9.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.