Bug 606761
Summary: | Interruption of epoll_wait by SIGCHLD immediately leads to segv | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> |
Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | beta | CC: | gsim, jross, matt |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
URL: | https://issues.apache.org/jira/browse/QPID-2388 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-12-11 18:55:36 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 578137, 596210 |
Description
Pete MacKinnon
2010-06-22 12:37:51 UTC
What the qpid library does around the epoll_wait() may be to blame here. Currently it unconditionally unblocks ALL signals (on this thread) before entering epoll_wait. Then it enters epoll_wait - at this point ANY signal (even one that the application has blocked for the entire process) may be serviced. Then it unconditionally blocks all signals. The intention here is that sigals may not be processed at all on the I/O thread except when it is waiting for something to happen. So it the application has blocked SIGCHLD (or any other signal) and never expects to have to process it this logic defeats this. However this is not so straightforward to fix... I think the simplest solution is to block all signal delivery completely on the client IO threads. The library internally uses no signals at all and sets no handlers. Hoever doing this may impact somewhat on the broker as this application does use signals (at least a little). So some consideration will have to be given to the broker to make sure that there are still threads avaible in the broker to handle signals. Looks like the broker signal handlers are such: void SignalHandler::setBroker(const boost::intrusive_ptr<Broker>& b) { broker = b; signal(SIGINT,shutdownHandler); signal(SIGTERM, shutdownHandler); signal(SIGHUP,SIG_IGN); // TODO aconway 2007-07-18: reload config. signal(SIGCHLD,SIG_IGN); } More data... Basically, (I think Matt already postualed this) the condor parent thread went through the motions of shutting down the daemon module, including the deletion of the daemonCore object. However, the process has registered signal handlers for SIGCHLD et al which propagates the signal using the DC object ptr. This handler was blocked from receiving SIGCHLD when the sub-process (procd) terminated by the qpid epoll code. So, condor daemon core gets the "late" SIGCHLD and boom! qpid::sys::Poller::wait (this=0xb6a00ce8, timeout=...) at qpid/sys/epoll/EpollPoller.cpp:571 571 pthread_sigmask(SIG_SETMASK, &os, 0); (gdb) n 570 int rc = ::epoll_wait(impl->epollFd, &epe, 1, timeoutMs); (gdb) n 571 pthread_sigmask(SIG_SETMASK, &os, 0); (gdb) n 576 if (rc ==-1 && errno != EINTR) { (gdb) n 667 if (timeoutMs == -1) { (gdb) n 564 PollerHandleDeletionManager.markAllUnusedInThisThread(); (gdb) n 569 pthread_sigmask(SIG_SETMASK, &impl->sigMask, &os); (gdb) n 564 PollerHandleDeletionManager.markAllUnusedInThisThread(); (gdb) n 569 pthread_sigmask(SIG_SETMASK, &impl->sigMask, &os); (gdb) n 570 int rc = ::epoll_wait(impl->epollFd, &epe, 1, timeoutMs); (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x08175f4d in unix_sigchld () at daemon_core_main.cpp:1288 1288 daemonCore->Send_Signal( daemonCore->getpid(), SIGCHLD ); (gdb) p daemonCore $1 = (DaemonCore *) 0x0 So we have a good explanation for the bug. I think however that condor need fixing as well as qpid. The pattern of just blocking a signal that can never now be delivered due to the handler effectively being deleted is a bad idea and at the very least a bug waiting to happen in some other circumstance. So instead of just blocking the SIGCHLD signal when the daemonCore is deleted the code should be unregistering the handler and setting it back to the default handling (in this case ignore). I think that either fixing qpid or condor would be sufficient, but doing both is long term better. mrg 1.3 qpid code updated with the upstream fix from r957109 Please reopen if this is apparently still an issue. |