Both POSIX and glibc have a rather restricted set of functions which can be called from a signal handler. Functions like ctime(), printf() etc. are not permitted and tend to cause additional crashes if something goes wrong, obscuring the actual crash information and reducing the usefulness of coredumps.
It's better to let abrt and similar system-wide mechanism handle such matters, I think.
pmlogger has the same problem if PCP_DEBUG is defined.
Most (if not all) of pmcd signal handling is defered - when a signal is recieved it usually sets a flag, resets the handler routine, and continues on. Then, when pmcd breaks out of select() in its main loop, it tests these flags and acts on the signal recieved earlier.
So, I'm stupidly not seeing the issue here - which signals in particular did you observe the problem with? (all of them, and I've missed something?)
Nathan, for example, src/pmcd/src/pmcd.c SigHupProc can call SignalReloadPMNS from within the signal handler, which does a metric boatload of work, though that appears only #ifndef IS_MINGW. Maybe that's not important.
OTOH, SigBad appears to do too much. How about just letting those signals kill pmcd, and let init/whatnot restart?
re SigHupProc - that's what I mean - only on Windows (MinGW) does it do lotsa work, and on Windows life is totally different (not POSIX, and not glibc).
void SigHupProc(int s)
restart = 1;
SigBad does indeed look ... bad. IIRC, the original reasons for that were to get a usable stack trace into pmcd's logfile. FWIW, I've never observed this code fail in the >10 years I've been using pcp, and I suspect there will be much reluctance with others to not have the stack-trace-in-pmcd.log option and "dumping to core" message in pmcd.log anymore. Its really useful diagnostic information. There must be ways to make this work safely - have observed daemon tomcat (java) processes doing this kind of thing in the past - writing all manner of stack, register, etc state into catalina.out.
Java's a very different environment. With C programs on at least modern Linux distros, we have decent crash-catchers (ABRT etc.) always running, generating backtraces on demand. So pmcd doesn't have to do it itself.
Sorry, I meant java, the program (i.e. the C++ program) not Java the language - that's where I've observed some custom handlers, that are writing to custom log files (redirected to catalina.out in the Apache Tomcat case - which is probably just done via stderr redirecting come to think of it).
But, if it can be done better we should do so, for sure. Some research, code tweaking, testing and educating of others needed.
(In reply to comment #6)
> Sorry, I meant java, the program (i.e. the C++ program) not Java the
> language - that's where I've observed some custom handlers, that are writing
> to custom log files (redirected to catalina.out in the Apache Tomcat case -
> which is probably just done via stderr redirecting come to think of it).
> But, if it can be done better we should do so, for sure. Some research,
> code tweaking, testing and educating of others needed.
POSIX defines a set of async-signal-safe functions which can be used from signal handlers. In general, system calls from section 2 of the manual are safe on Linux, but many functions from section 3 are not.
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.
(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)
More information and reason for this action is here:
Planning to revisit this old chestnut this week and tackle in pcp-3.9.5 timeframe.
Will be reviewing both pmcd and pmlogger, as per Florian's #c1, as well as pmproxy and pmwebd.
This is fixed by upstream commit fc44068f8c50514522a00abb3dce8a0290e7d447.
Those changes look real good. A few nits:
- exit() shouldn't be called from signal handlers either (AS-Unsafe) from
pmlogger's functions; use _exit() instead.
- even in DBG_TRACE_DESPERATE mode, calling fprintf may just unnecessarily
invite deadlocks; consider using plain write(2) to get some diagnostics
Re exit() - ayup - good catch, fixes shall follow shortly (an abort() likewise is still wrong);
Re desperate mode, considered it yesterday but there's no point - the traceback routines allocate memory, so ultimately its unavoidable. In the end I chose to leave as-is since that's worked for kenj & others in the past - there's timestamped log messages in the mix too btw, it just becomes impractical to unravel. "desperate" debugging mode is aptly named.
dev branch commit 3988c4b follows up on #c12 & #c13.
pcp-3.9.5-1.el5 has been submitted as an update for Fedora EPEL 5.
pcp-3.9.5-1.el6 has been submitted as an update for Fedora EPEL 6.
pcp-3.9.5-1.fc19 has been submitted as an update for Fedora 19.
pcp-3.9.5-1.fc20 has been submitted as an update for Fedora 20.
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.9.5-1.el5'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
pcp-3.9.5-1.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.9.5-1.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.9.5-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
pcp-3.9.5-1.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.