Bug 842228

Summary: nis ypbind crash because SIGPIPE is not caught
Product: Red Hat Enterprise Linux 6 Reporter: Honza Horak <hhorak>
Component: ypbindAssignee: Honza Horak <hhorak>
Status: CLOSED ERRATA QA Contact: Jakub Prokes <jprokes>
Severity: low Docs Contact:
Priority: low    
Version: 6.4CC: jprokes, mmuzila, olchansk, ovasik, psklenar, tlavigne, todoleza
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: SIGPIPE is not in the proper signal set Consequence: ypbind crash when network connectivity is lost Fix: Add SIGPIPE to the proper signal set Result: ypbind does not crash
Story Points: ---
Clone Of: 717069 Environment:
Last Closed: 2015-07-22 06:44:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 717069    
Bug Blocks: 947782, 1159825    
Attachments:
Description Flags
add SIGPIPE to proper signal set none

Description Honza Horak 2012-07-23 08:06:38 UTC
+++ This bug was initially created as a clone of Bug #717069 +++

I have seen this problem before, but I do not see the bugzilla bug report for it, so here it goes again.

Due to a hardware problem, a machine in a remote location lost all network connectivity. Several days later, the network connection was restored, but users cannot login because ypbind is not running anymore, machine is unusable.

System examination shows:
/var/log/messages: shows abouy 5 hours worth of "ypbind[2539]: broadcast rpc timed out" messages, they stop well before network connection is restored (expected behaviour - they continue until network connection comes back)
/var/run/ypbind.pid: contains the pid 2539 matching syslog messages
/var/lock/subsys/ypbind: exists, a zero size file
pid 2539 is not running, there is no process ypbind in the system.

From this I conclude that ypbind crashed - normal shutdown would have cleared the lock and pid files.

Examination of ypbind changelog at http://www.linux-nis.org/nis/ypbind-mt/ChangeLog does not show anything like this problem.

Ideally, ypbind should sit around forever waiting for the network connection to come back.
K.O.

--- Additional comment from hhorak on 2011-06-28 08:34:41 EDT ---

(In reply to comment #0)
> I have seen this problem before, but I do not see the bugzilla bug report for
> it, so here it goes again.

If you find the corresponding bug, it would be superb.
 
> From this I conclude that ypbind crashed - normal shutdown would have cleared
> the lock and pid files.

I'm not able to reproduce this bug so far. If you can provide a backtrace, it could help a lot.

--- Additional comment from hhorak on 2012-05-09 06:59:22 EDT ---

Konstantin, have you encountered this issue only once or you can reproduce it? If it's reproducible, I'd need some more info, backtrace for example.

--- Additional comment from olchansk on 2012-05-09 10:52:24 EDT ---

Yes, I have seen this more than once. It is too bad that you cannot reproduce it, in the lucky case
you would unplug the network cable, wait, observe ypbind crash or disappear. But perhaps it takes more to see it - maybe expiration of DHCP lease plays a role, or ypbind crashes when the network comes back, etc. Perhaps I can find time to reproduce it on RHEL/SL 6 - I have seen RHEL6 do strange stuff
after network outages and maybe will look more into this.
K.O.

--- Additional comment from hhorak on 2012-05-10 08:56:15 EDT ---

Something suspicious in the syslog (e.g. DHCP related) near the place you expect ypbind crashed?

--- Additional comment from olchansk on 2012-05-14 16:43:59 EDT ---

Okey, I have reproduced the problem with 32-bit SL5.8 -
May 11 15:41 - unplug network cable
15:42 - start of regular messages "ypbind: broadcast: RPC: timed out."
18:50 - last "ypbind" message
login today: ypbind is not running.
no other "ypbind"-related messages in the system log or dmesg, no core dumps, nothing.

I will now try to reproduce it with SL6.2, will file a separate bug if it shows up there, too.
K.O.

--- Additional comment from hhorak on 2012-05-15 04:21:30 EDT ---

Created attachment 584587 [details]
patch to not catch sigsegv signal

(In reply to comment #5)
> no other "ypbind"-related messages in the system log or dmesg, no core dumps,
> nothing.

Coredumps are not generated, because sigsegv signal is blocked. But there is a way how to get the coredump. The attached patch removes blocking sigsegv signal. Could you run ypbind with this patch? 

However, that's not enough. The core dump is also not generated when service is run using "service ypbind start" command. It is correctly generated only when run from the command-line (ideally with -d option to see more debug messages). 

So here is what you can try:
* set up ulimit -c
* build ypbind with the patch attached
* run ypbind from the command-line with -d option and redirect all output to a file: 
  # ypbind -d >ypbind.log 2>&1
* do all the things to reproduce the failure

If you're lucky, you'll get the coredump, which can help with resolving the issue.

To test if the core dump is being generated, you can use "kill -sigsegv [pid]".

--- Additional comment from hhorak on 2012-06-26 09:42:04 EDT ---

Any progress in getting the coredump or backtrace?

--- Additional comment from olchansk on 2012-07-06 21:53:22 EDT ---

I am sorry I do not presently have time available for creating a core-dump enabled ypbind executable. If you can provide an executable, I can run it and post the core dump stack trace here. K.O.

--- Additional comment from hhorak on 2012-07-11 09:51:29 EDT ---

Created attachment 597579 [details]
testing build ypbind-1.19-13.nosigseg.el5.x86_64

This is an unofficial testing build for x86_64, which includes a patch to not catch SIGSEGV signal. Please mind, that this build is unsupported and should be used only for testing purposes.

--- Additional comment from hhorak on 2012-07-11 09:53:29 EDT ---

Created attachment 597582 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.x86_64

This is an unofficial testing build, that contains debug info for the rpm above. Please mind, that this build is unsupported and should be used only for testing purposes.

--- Additional comment from olchansk on 2012-07-11 14:03:50 EDT ---

Thanks, I will try it. There is a small blooper, though - my SL5 test machine is 32-bit. But I can find another test machine to test your 64-bit RPMs. Also things are busy here, it will take me a few days to get to this. K.O.

--- Additional comment from hhorak on 2012-07-12 02:19:18 EDT ---

Created attachment 597715 [details]
testing build ypbind-1.19-13.nosigseg.el5.i386

--- Additional comment from hhorak on 2012-07-12 02:20:26 EDT ---

Created attachment 597716 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.i386

--- Additional comment from olchansk on 2012-07-13 19:30:12 EDT ---

Got it. Have crashed ypbind inside gdb right now. Crash is from SIGPIPE. Here is the stack traces. Anything else I should capture and post here?

...
5226: trylock = success
5226: do_broadcast() for domain 'isac' is called
5226: broadcast: RPC: Timed out.
5226: leave do_broadcast() for domain 'isac'
5230: Pinging all active server.
5230: do_broadcast() for domain 'isac' is called
5226: Status: YPBIND_FAIL_VAL

Program received signal SIGPIPE, Broken pipe.
0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00b0fb8b in write () from /lib/libc.so.6
#2  0x00b475f4 in writetcp () from /lib/libc.so.6
#3  0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6
#4  0x00b47510 in svctcp_reply () from /lib/libc.so.6
#5  0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6
#6  0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140
#7  0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6
#8  0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6
#9  0x00b46e0a in svc_run () from /lib/libc.so.6
#10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808


(gdb) info thr
  3 Thread 0xb73e0b90 (LWP 5230)  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
  2 Thread 0xb7de1b90 (LWP 5229)  0x00776402 in __kernel_vsyscall ()
* 1 Thread 0xb7fe26c0 (LWP 5226)  0x00776402 in __kernel_vsyscall ()

(gdb) thr 1
[Switching to thread 1 (Thread 0xb7fe26c0 (LWP 5226))]#0  0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00b0fb8b in write () from /lib/libc.so.6
#2  0x00b475f4 in writetcp () from /lib/libc.so.6
#3  0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6
#4  0x00b47510 in svctcp_reply () from /lib/libc.so.6
#5  0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6
#6  0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140
#7  0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6
#8  0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6
#9  0x00b46e0a in svc_run () from /lib/libc.so.6
#10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808

(gdb) thr 2
[Switching to thread 2 (Thread 0xb7de1b90 (LWP 5229))]#0  0x00776402 in __kernel_vsyscall ()
(gdb) where
#0  0x00776402 in __kernel_vsyscall ()
#1  0x00be8cde in do_sigwait () from /lib/libpthread.so.0
#2  0x00be8d7f in sigwait () from /lib/libpthread.so.0
#3  0x0804a56e in sig_handler (v_param=0x0) at ypbind-mt.c:414
#4  0x00be0852 in start_thread () from /lib/libpthread.so.0
#5  0x00b1f04e in clone () from /lib/libc.so.6

(gdb) thr 3
[Switching to thread 3 (Thread 0xb73e0b90 (LWP 5230))]#0  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
(gdb) where
#0  0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6
#1  0x00b44ee3 in clnt_broadcast () from /lib/libc.so.6
#2  0x0804d00d in do_broadcast (list=0x80528f8) at serv_list.c:669
#3  0x0804e3cb in test_bindings_once (lastcheck=660, req_domain=0x0) at serv_list.c:1226
#4  0x0804e528 in test_bindings (param=0x0) at serv_list.c:1073
#5  0x00be0852 in start_thread () from /lib/libpthread.so.0
#6  0x00b1f04e in clone () from /lib/libc.so.6

K.O.

--- Additional comment from hhorak on 2012-07-16 03:20:07 EDT ---

Created attachment 598383 [details]
proposed patch backported from recent version

Thanks for the backtrace. I believe the error is caused by missing SIGPIPE handling. When a connection with a client gets broken and svc_sendreplay fails with EPIPE, SIGPIPE is delivered to the daemon. We need to catch that signal since daemon will fail otherwise.

This signal is already caught in recent version, so we should backport this behavior.

--- Additional comment from hhorak on 2012-07-16 08:49:52 EDT ---

Created attachment 598444 [details]
patched version for testing purposes only

This is an unofficial testing build for x86_64, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.

Konstantin, please, can you verify if it fixes the failure?

--- Additional comment from olchansk on 2012-07-16 11:39:24 EDT ---

Yes, I can test the 64-bit test package, but it would be better if I can test the 32-bit package to confirm the fix on the same computer. K.O.

--- Additional comment from hhorak on 2012-07-16 11:45:44 EDT ---

Created attachment 598486 [details]
patched version for testing purposes only

Ah, I forgot you need i386. This is an unofficial testing build for i386, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.

--- Additional comment from olchansk on 2012-07-20 16:12:06 EDT ---

Test is successful. Unplug network cable, observe loss of network connectivity, wait 24 hours, observe ypbind still running, reconnect network cable, observe network connection is up, ypwhich is happy, ypcat is happy, autofs and nfs are happy, users can login. (takes a few minutes for the NFS TCP connections to come back).

Assuming you can rank the changes as "low risk", any chance this fix can be pushed into the 5.x updates soon?

K.O.

--- Additional comment from hhorak on 2012-07-23 02:22:51 EDT ---

(In reply to comment #19)
> Test is successful. Unplug network cable, observe loss of network
> connectivity, wait 24 hours, observe ypbind still running, reconnect network
> cable, observe network connection is up, ypwhich is happy, ypcat is happy,
> autofs and nfs are happy, users can login. (takes a few minutes for the NFS
> TCP connections to come back).

Thanks for testing.

> Assuming you can rank the changes as "low risk", any chance this fix can be
> pushed into the 5.x updates soon?

I think it could be feasible through fastrack process, since this fix is quite simple and test-able, so proposing as fast.

Comment 2 Honza Horak 2012-07-23 08:33:53 UTC
In RHEL-6 there is already SIGPIPE masked, but is missing in sigwait call.

Unlike in RHEL-5, where ypbind crashes when receiving SIGPIPE, in RHEL-6 only a message "Ignoring SIGPIPE" is missing, but ypbind doesn't crash, because the signal is masked properly. So fixing this is only a cosmetic and missing fix wouldn't be actually a regression.

Comment 3 Honza Horak 2012-07-23 08:34:41 UTC
Created attachment 599711 [details]
add SIGPIPE to proper signal set

Comment 16 Honza Horak 2015-02-19 14:55:29 UTC
There is no reproducer, but it has been fixed upstream for long time, so let's take this as sanity fix.

Comment 19 Konstantin Olchanski 2015-02-19 22:25:44 UTC
I am the original reporter of this problem. Thank you for pushing the patch through.

FWIW, with el6 I see evidence that ypbind sometimes quits when network connection is down for long periods of time. This is with releases 6.5, 6.6. Was unable to reproduce this under controlled conditions yet. In el7 same problem? Cannot tell, only have one machine in testing.

K.O.

Comment 20 Honza Horak 2015-02-20 10:01:26 UTC
(In reply to Konstantin Olchanski from comment #19)
> FWIW, with el6 I see evidence that ypbind sometimes quits when network
> connection is down for long periods of time. This is with releases 6.5, 6.6.
> Was unable to reproduce this under controlled conditions yet. In el7 same
> problem? Cannot tell, only have one machine in testing.

First time hearing that, so would be glad for any further info, in case you're able to find something.

Comment 22 errata-xmlrpc 2015-07-22 06:44:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1332.html