Bug 717069

Summary: nis ypbind crash because SIGPIPE is not caught
Product: Red Hat Enterprise Linux 5 Reporter: Konstantin Olchanski <olchansk>
Component: ypbindAssignee: Honza Horak <hhorak>
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.5CC: ovasik
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 842228 (view as bug list) Environment:
Last Closed: 2013-03-13 17:49:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 842228    
Attachments:
Description Flags
patch to not catch sigsegv signal
none
testing build ypbind-1.19-13.nosigseg.el5.x86_64
none
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.x86_64
none
testing build ypbind-1.19-13.nosigseg.el5.i386
none
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.i386
none
proposed patch backported from recent version
none
patched version for testing purposes only
none
patched version for testing purposes only none

Description Konstantin Olchanski 2011-06-27 20:52:23 UTC
I have seen this problem before, but I do not see the bugzilla bug report for it, so here it goes again.

Due to a hardware problem, a machine in a remote location lost all network connectivity. Several days later, the network connection was restored, but users cannot login because ypbind is not running anymore, machine is unusable.

System examination shows:
/var/log/messages: shows abouy 5 hours worth of "ypbind[2539]: broadcast rpc timed out" messages, they stop well before network connection is restored (expected behaviour - they continue until network connection comes back)
/var/run/ypbind.pid: contains the pid 2539 matching syslog messages
/var/lock/subsys/ypbind: exists, a zero size file
pid 2539 is not running, there is no process ypbind in the system.

From this I conclude that ypbind crashed - normal shutdown would have cleared the lock and pid files.

Examination of ypbind changelog at http://www.linux-nis.org/nis/ypbind-mt/ChangeLog does not show anything like this problem.

Ideally, ypbind should sit around forever waiting for the network connection to come back.
K.O.

Comment 1 Honza Horak 2011-06-28 12:34:41 UTC
(In reply to comment #0)
> I have seen this problem before, but I do not see the bugzilla bug report for
> it, so here it goes again.

If you find the corresponding bug, it would be superb.
 
> From this I conclude that ypbind crashed - normal shutdown would have cleared
> the lock and pid files.

I'm not able to reproduce this bug so far. If you can provide a backtrace, it could help a lot.

Comment 2 Honza Horak 2012-05-09 10:59:22 UTC
Konstantin, have you encountered this issue only once or you can reproduce it? If it's reproducible, I'd need some more info, backtrace for example.

Comment 3 Konstantin Olchanski 2012-05-09 14:52:24 UTC
Yes, I have seen this more than once. It is too bad that you cannot reproduce it, in the lucky case
you would unplug the network cable, wait, observe ypbind crash or disappear. But perhaps it takes more to see it - maybe expiration of DHCP lease plays a role, or ypbind crashes when the network comes back, etc. Perhaps I can find time to reproduce it on RHEL/SL 6 - I have seen RHEL6 do strange stuff
after network outages and maybe will look more into this.
K.O.

Comment 4 Honza Horak 2012-05-10 12:56:15 UTC
Something suspicious in the syslog (e.g. DHCP related) near the place you expect ypbind crashed?

Comment 5 Konstantin Olchanski 2012-05-14 20:43:59 UTC
Okey, I have reproduced the problem with 32-bit SL5.8 -
May 11 15:41 - unplug network cable
15:42 - start of regular messages "ypbind: broadcast: RPC: timed out."
18:50 - last "ypbind" message
login today: ypbind is not running.
no other "ypbind"-related messages in the system log or dmesg, no core dumps, nothing.

I will now try to reproduce it with SL6.2, will file a separate bug if it shows up there, too.
K.O.

Comment 6 Honza Horak 2012-05-15 08:21:30 UTC
Created attachment 584587 [details]
patch to not catch sigsegv signal

(In reply to comment #5)
> no other "ypbind"-related messages in the system log or dmesg, no core dumps,
> nothing.

Coredumps are not generated, because sigsegv signal is blocked. But there is a way how to get the coredump. The attached patch removes blocking sigsegv signal. Could you run ypbind with this patch? 

However, that's not enough. The core dump is also not generated when service is run using "service ypbind start" command. It is correctly generated only when run from the command-line (ideally with -d option to see more debug messages). 

So here is what you can try:
* set up ulimit -c
* build ypbind with the patch attached
* run ypbind from the command-line with -d option and redirect all output to a file: 
  # ypbind -d >ypbind.log 2>&1
* do all the things to reproduce the failure

If you're lucky, you'll get the coredump, which can help with resolving the issue.

To test if the core dump is being generated, you can use "kill -sigsegv [pid]".

Comment 7 Honza Horak 2012-06-26 13:42:04 UTC
Any progress in getting the coredump or backtrace?

Comment 8 Konstantin Olchanski 2012-07-07 01:53:22 UTC
I am sorry I do not presently have time available for creating a core-dump enabled ypbind executable. If you can provide an executable, I can run it and post the core dump stack trace here. K.O.

Comment 9 Honza Horak 2012-07-11 13:51:29 UTC
Created attachment 597579 [details]
testing build ypbind-1.19-13.nosigseg.el5.x86_64

This is an unofficial testing build for x86_64, which includes a patch to not catch SIGSEGV signal. Please mind, that this build is unsupported and should be used only for testing purposes.

Comment 10 Honza Horak 2012-07-11 13:53:29 UTC
Created attachment 597582 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.x86_64

This is an unofficial testing build, that contains debug info for the rpm above. Please mind, that this build is unsupported and should be used only for testing purposes.

Comment 11 Konstantin Olchanski 2012-07-11 18:03:50 UTC
Thanks, I will try it. There is a small blooper, though - my SL5 test machine is 32-bit. But I can find another test machine to test your 64-bit RPMs. Also things are busy here, it will take me a few days to get to this. K.O.

Comment 12 Honza Horak 2012-07-12 06:19:18 UTC
Created attachment 597715 [details]
testing build ypbind-1.19-13.nosigseg.el5.i386

Comment 13 Honza Horak 2012-07-12 06:20:26 UTC
Created attachment 597716 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.i386

Comment 14 Konstantin Olchanski 2012-07-13 23:30:12 UTC
Got it. Have crashed ypbind inside gdb right now. Crash is from SIGPIPE. Here is the stack traces. Anything else I should capture and post here?

...
5226: trylock = success
5226: do_broadcast() for domain 'isac' is called
5226: broadcast: RPC: Timed out.
5226: leave do_broadcast() for domain 'isac'
5230: Pinging all active server.
5230: do_broadcast() for domain 'isac' is called
5226: Status: YPBIND_FAIL_VAL

Program received signal SIGPIPE, Broken pipe.
0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00b0fb8b in write () from /lib/libc.so.6
#2  0x00b475f4 in writetcp () from /lib/libc.so.6
#3  0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6
#4  0x00b47510 in svctcp_reply () from /lib/libc.so.6
#5  0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6
#6  0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140
#7  0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6
#8  0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6
#9  0x00b46e0a in svc_run () from /lib/libc.so.6
#10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808


(gdb) info thr
  3 Thread 0xb73e0b90 (LWP 5230)  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
  2 Thread 0xb7de1b90 (LWP 5229)  0x00776402 in __kernel_vsyscall ()
* 1 Thread 0xb7fe26c0 (LWP 5226)  0x00776402 in __kernel_vsyscall ()

(gdb) thr 1
[Switching to thread 1 (Thread 0xb7fe26c0 (LWP 5226))]#0  0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00b0fb8b in write () from /lib/libc.so.6
#2  0x00b475f4 in writetcp () from /lib/libc.so.6
#3  0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6
#4  0x00b47510 in svctcp_reply () from /lib/libc.so.6
#5  0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6
#6  0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140
#7  0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6
#8  0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6
#9  0x00b46e0a in svc_run () from /lib/libc.so.6
#10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808

(gdb) thr 2
[Switching to thread 2 (Thread 0xb7de1b90 (LWP 5229))]#0  0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00be8cde in do_sigwait () from /lib/libpthread.so.0
#2  0x00be8d7f in sigwait () from /lib/libpthread.so.0
#3  0x0804a56e in sig_handler (v_param=0x0) at ypbind-mt.c:414
#4  0x00be0852 in start_thread () from /lib/libpthread.so.0
#5  0x00b1f04e in clone () from /lib/libc.so.6

(gdb) thr 3
[Switching to thread 3 (Thread 0xb73e0b90 (LWP 5230))]#0  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
(gdb) where
#0  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
#1  0x00b44ee3 in clnt_broadcast () from /lib/libc.so.6
#2  0x0804d00d in do_broadcast (list=0x80528f8) at serv_list.c:669
#3  0x0804e3cb in test_bindings_once (lastcheck=660, req_domain=0x0) at serv_list.c:1226
#4  0x0804e528 in test_bindings (param=0x0) at serv_list.c:1073
#5  0x00be0852 in start_thread () from /lib/libpthread.so.0
#6  0x00b1f04e in clone () from /lib/libc.so.6

K.O.

Comment 15 Honza Horak 2012-07-16 07:20:07 UTC
Created attachment 598383 [details]
proposed patch backported from recent version

Thanks for the backtrace. I believe the error is caused by missing SIGPIPE handling. When a connection with a client gets broken and svc_sendreplay fails with EPIPE, SIGPIPE is delivered to the daemon. We need to catch that signal since daemon will fail otherwise.

This signal is already caught in recent version, so we should backport this behavior.

Comment 16 Honza Horak 2012-07-16 12:49:52 UTC
Created attachment 598444 [details]
patched version for testing purposes only

This is an unofficial testing build for x86_64, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.

Konstantin, please, can you verify if it fixes the failure?

Comment 17 Konstantin Olchanski 2012-07-16 15:39:24 UTC
Yes, I can test the 64-bit test package, but it would be better if I can test the 32-bit package to confirm the fix on the same computer. K.O.

Comment 18 Honza Horak 2012-07-16 15:45:44 UTC
Created attachment 598486 [details]
patched version for testing purposes only

Ah, I forgot you need i386. This is an unofficial testing build for i386, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.

Comment 19 Konstantin Olchanski 2012-07-20 20:12:06 UTC
Test is successful. Unplug network cable, observe loss of network connectivity, wait 24 hours, observe ypbind still running, reconnect network cable, observe network connection is up, ypwhich is happy, ypcat is happy, autofs and nfs are happy, users can login. (takes a few minutes for the NFS TCP connections to come back).

Assuming you can rank the changes as "low risk", any chance this fix can be pushed into the 5.x updates soon?

K.O.

Comment 20 Honza Horak 2012-07-23 06:22:51 UTC
(In reply to comment #19)
> Test is successful. Unplug network cable, observe loss of network
> connectivity, wait 24 hours, observe ypbind still running, reconnect network
> cable, observe network connection is up, ypwhich is happy, ypcat is happy,
> autofs and nfs are happy, users can login. (takes a few minutes for the NFS
> TCP connections to come back).

Thanks for testing.

> Assuming you can rank the changes as "low risk", any chance this fix can be
> pushed into the 5.x updates soon?

I think it could be feasible through fastrack process, since this fix is quite simple and test-able, so proposing as fast.

Comment 22 Honza Horak 2013-03-13 17:49:22 UTC
There is no chance to fix this in RHEL-5 any more, since RHEL 5.10 is going to include only serious fixes. Thus, closing as WONTFIX.