Bug 717069
Description
Konstantin Olchanski
2011-06-27 20:52:23 UTC
(In reply to comment #0) > I have seen this problem before, but I do not see the bugzilla bug report for > it, so here it goes again. If you find the corresponding bug, it would be superb. > From this I conclude that ypbind crashed - normal shutdown would have cleared > the lock and pid files. I'm not able to reproduce this bug so far. If you can provide a backtrace, it could help a lot. Konstantin, have you encountered this issue only once or you can reproduce it? If it's reproducible, I'd need some more info, backtrace for example. Yes, I have seen this more than once. It is too bad that you cannot reproduce it, in the lucky case you would unplug the network cable, wait, observe ypbind crash or disappear. But perhaps it takes more to see it - maybe expiration of DHCP lease plays a role, or ypbind crashes when the network comes back, etc. Perhaps I can find time to reproduce it on RHEL/SL 6 - I have seen RHEL6 do strange stuff after network outages and maybe will look more into this. K.O. Something suspicious in the syslog (e.g. DHCP related) near the place you expect ypbind crashed? Okey, I have reproduced the problem with 32-bit SL5.8 - May 11 15:41 - unplug network cable 15:42 - start of regular messages "ypbind: broadcast: RPC: timed out." 18:50 - last "ypbind" message login today: ypbind is not running. no other "ypbind"-related messages in the system log or dmesg, no core dumps, nothing. I will now try to reproduce it with SL6.2, will file a separate bug if it shows up there, too. K.O. Created attachment 584587 [details] patch to not catch sigsegv signal (In reply to comment #5) > no other "ypbind"-related messages in the system log or dmesg, no core dumps, > nothing. Coredumps are not generated, because sigsegv signal is blocked. But there is a way how to get the coredump. The attached patch removes blocking sigsegv signal. Could you run ypbind with this patch? However, that's not enough. The core dump is also not generated when service is run using "service ypbind start" command. It is correctly generated only when run from the command-line (ideally with -d option to see more debug messages). So here is what you can try: * set up ulimit -c * build ypbind with the patch attached * run ypbind from the command-line with -d option and redirect all output to a file: # ypbind -d >ypbind.log 2>&1 * do all the things to reproduce the failure If you're lucky, you'll get the coredump, which can help with resolving the issue. To test if the core dump is being generated, you can use "kill -sigsegv [pid]". Any progress in getting the coredump or backtrace? I am sorry I do not presently have time available for creating a core-dump enabled ypbind executable. If you can provide an executable, I can run it and post the core dump stack trace here. K.O. Created attachment 597579 [details]
testing build ypbind-1.19-13.nosigseg.el5.x86_64
This is an unofficial testing build for x86_64, which includes a patch to not catch SIGSEGV signal. Please mind, that this build is unsupported and should be used only for testing purposes.
Created attachment 597582 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.x86_64
This is an unofficial testing build, that contains debug info for the rpm above. Please mind, that this build is unsupported and should be used only for testing purposes.
Thanks, I will try it. There is a small blooper, though - my SL5 test machine is 32-bit. But I can find another test machine to test your 64-bit RPMs. Also things are busy here, it will take me a few days to get to this. K.O. Created attachment 597715 [details]
testing build ypbind-1.19-13.nosigseg.el5.i386
Created attachment 597716 [details]
testing build ypbind-debuginfo-1.19-13.nosigseg.el5.i386
Got it. Have crashed ypbind inside gdb right now. Crash is from SIGPIPE. Here is the stack traces. Anything else I should capture and post here? ... 5226: trylock = success 5226: do_broadcast() for domain 'isac' is called 5226: broadcast: RPC: Timed out. 5226: leave do_broadcast() for domain 'isac' 5230: Pinging all active server. 5230: do_broadcast() for domain 'isac' is called 5226: Status: YPBIND_FAIL_VAL Program received signal SIGPIPE, Broken pipe. 0x00776402 in __kernel_vsyscall () (gdb) where #0 0x00776402 in __kernel_vsyscall () #1 0x00b0fb8b in write () from /lib/libc.so.6 #2 0x00b475f4 in writetcp () from /lib/libc.so.6 #3 0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6 #4 0x00b47510 in svctcp_reply () from /lib/libc.so.6 #5 0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6 #6 0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140 #7 0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6 #8 0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6 #9 0x00b46e0a in svc_run () from /lib/libc.so.6 #10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808 (gdb) info thr 3 Thread 0xb73e0b90 (LWP 5230) 0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6 2 Thread 0xb7de1b90 (LWP 5229) 0x00776402 in __kernel_vsyscall () * 1 Thread 0xb7fe26c0 (LWP 5226) 0x00776402 in __kernel_vsyscall () (gdb) thr 1 [Switching to thread 1 (Thread 0xb7fe26c0 (LWP 5226))]#0 0x00776402 in __kernel_vsyscall () (gdb) where #0 0x00776402 in __kernel_vsyscall () #1 0x00b0fb8b in write () from /lib/libc.so.6 #2 0x00b475f4 in writetcp () from /lib/libc.so.6 #3 0x00b49d20 in xdrrec_endofrecord_internal () from /lib/libc.so.6 #4 0x00b47510 in svctcp_reply () from /lib/libc.so.6 #5 0x00b45f5c in svc_sendreply_internal () from /lib/libc.so.6 #6 0x0804ba98 in ypbindprog_2 (rqstp=0xbfffe58c, transp=0xb6800590) at ypbind_svc.c:140 #7 0x00b46832 in svc_getreq_common_internal () from /lib/libc.so.6 #8 0x00b463a5 in svc_getreq_poll_internal () from /lib/libc.so.6 #9 0x00b46e0a in svc_run () from /lib/libc.so.6 #10 0x0804b515 in main (argc=3, argv=0xbfffe7e4) at ypbind-mt.c:808 (gdb) thr 2 [Switching to thread 2 (Thread 0xb7de1b90 (LWP 5229))]#0 0x00776402 in __kernel_vsyscall () (gdb) where #0 0x00776402 in __kernel_vsyscall () #1 0x00be8cde in do_sigwait () from /lib/libpthread.so.0 #2 0x00be8d7f in sigwait () from /lib/libpthread.so.0 #3 0x0804a56e in sig_handler (v_param=0x0) at ypbind-mt.c:414 #4 0x00be0852 in start_thread () from /lib/libpthread.so.0 #5 0x00b1f04e in clone () from /lib/libc.so.6 (gdb) thr 3 [Switching to thread 3 (Thread 0xb73e0b90 (LWP 5230))]#0 0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6 (gdb) where #0 0x00b45c02 in xdr_callmsg_internal () from /lib/libc.so.6 #1 0x00b44ee3 in clnt_broadcast () from /lib/libc.so.6 #2 0x0804d00d in do_broadcast (list=0x80528f8) at serv_list.c:669 #3 0x0804e3cb in test_bindings_once (lastcheck=660, req_domain=0x0) at serv_list.c:1226 #4 0x0804e528 in test_bindings (param=0x0) at serv_list.c:1073 #5 0x00be0852 in start_thread () from /lib/libpthread.so.0 #6 0x00b1f04e in clone () from /lib/libc.so.6 K.O. Created attachment 598383 [details]
proposed patch backported from recent version
Thanks for the backtrace. I believe the error is caused by missing SIGPIPE handling. When a connection with a client gets broken and svc_sendreplay fails with EPIPE, SIGPIPE is delivered to the daemon. We need to catch that signal since daemon will fail otherwise.
This signal is already caught in recent version, so we should backport this behavior.
Created attachment 598444 [details]
patched version for testing purposes only
This is an unofficial testing build for x86_64, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.
Konstantin, please, can you verify if it fixes the failure?
Yes, I can test the 64-bit test package, but it would be better if I can test the 32-bit package to confirm the fix on the same computer. K.O. Created attachment 598486 [details]
patched version for testing purposes only
Ah, I forgot you need i386. This is an unofficial testing build for i386, which includes the proposed patch to catch SIGPIPE signal and ignore it. Please mind, that this build is unsupported and should be used only for testing purposes.
Test is successful. Unplug network cable, observe loss of network connectivity, wait 24 hours, observe ypbind still running, reconnect network cable, observe network connection is up, ypwhich is happy, ypcat is happy, autofs and nfs are happy, users can login. (takes a few minutes for the NFS TCP connections to come back). Assuming you can rank the changes as "low risk", any chance this fix can be pushed into the 5.x updates soon? K.O. (In reply to comment #19) > Test is successful. Unplug network cable, observe loss of network > connectivity, wait 24 hours, observe ypbind still running, reconnect network > cable, observe network connection is up, ypwhich is happy, ypcat is happy, > autofs and nfs are happy, users can login. (takes a few minutes for the NFS > TCP connections to come back). Thanks for testing. > Assuming you can rank the changes as "low risk", any chance this fix can be > pushed into the 5.x updates soon? I think it could be feasible through fastrack process, since this fix is quite simple and test-able, so proposing as fast. There is no chance to fix this in RHEL-5 any more, since RHEL 5.10 is going to include only serious fixes. Thus, closing as WONTFIX. |