847343 – pcp: pmcd signal handlers are unsafe

Bug 847343 - pcp: pmcd signal handlers are unsafe

Summary: pcp: pmcd signal handlers are unsafe

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pcp
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Nathan Scott
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	840765
TreeView+	depends on / blocked

Reported:	2012-08-10 15:45 UTC by Florian Weimer
Modified:	2014-07-05 09:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pcp-3.9.5-1.el6
Clone Of:
Environment:
Last Closed:	2014-07-01 23:27:16 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Florian Weimer 2012-08-10 15:45:20 UTC

Both POSIX and glibc have a rather restricted set of functions which can be called from a signal handler.  Functions like ctime(), printf() etc. are not permitted and tend to cause additional crashes if something goes wrong, obscuring the actual crash information and reducing the usefulness of coredumps.

It's better to let abrt and similar system-wide mechanism handle such matters, I think.

Comment 1 Florian Weimer 2012-08-10 17:05:31 UTC

pmlogger has the same problem if PCP_DEBUG is defined.

Comment 2 Nathan Scott 2012-08-13 00:07:16 UTC

Most (if not all) of pmcd signal handling is defered - when a signal is recieved it usually sets a flag, resets the handler routine, and continues on.  Then, when pmcd breaks out of select() in its main loop, it tests these flags and acts on the signal recieved earlier.

So, I'm stupidly not seeing the issue here - which signals in particular did you observe the problem with?  (all of them, and I've missed something?)

thanks!

Comment 3 Frank Ch. Eigler 2012-08-13 00:17:18 UTC

Nathan, for example, src/pmcd/src/pmcd.c SigHupProc can call SignalReloadPMNS from within the signal handler, which does a metric boatload of work, though that appears only #ifndef IS_MINGW.  Maybe that's not important.

OTOH, SigBad appears to do too much.  How about just letting those signals kill pmcd, and let init/whatnot restart?

Comment 4 Nathan Scott 2012-08-13 00:30:24 UTC

re SigHupProc - that's what I mean - only on Windows (MinGW) does it do lotsa work, and on Windows life is totally different (not POSIX, and not glibc).

void SigHupProc(int s)
{
#ifndef IS_MINGW
    signal(SIGHUP, SigHupProc);
    restart = 1;
#else
    SignalRestart();
    SignalReloadPMNS();
#endif
}

SigBad does indeed look ... bad. IIRC, the original reasons for that were to get a usable stack trace into pmcd's logfile.  FWIW, I've never observed this code fail in the >10 years I've been using pcp, and I suspect there will be much reluctance with others to not have the stack-trace-in-pmcd.log option and "dumping to core" message in pmcd.log anymore.  Its really useful diagnostic information.  There must be ways to make this work safely - have observed daemon tomcat (java) processes doing this kind of thing in the past - writing all manner of stack, register, etc state into catalina.out.

thanks.

Comment 5 Frank Ch. Eigler 2012-08-13 00:38:47 UTC

Java's a very different environment.  With C programs on at least modern Linux distros, we have decent crash-catchers (ABRT etc.) always running, generating backtraces on demand.  So pmcd doesn't have to do it itself.

Comment 6 Nathan Scott 2012-08-13 00:58:16 UTC

Sorry, I meant java, the program (i.e. the C++ program) not Java the language - that's where I've observed some custom handlers, that are writing to custom log files (redirected to catalina.out in the Apache Tomcat case - which is probably just done via stderr redirecting come to think of it).

But, if it can be done better we should do so, for sure.  Some research, code tweaking, testing and educating of others needed.

Comment 8 Florian Weimer 2012-08-13 08:29:58 UTC

(In reply to comment #6)
> Sorry, I meant java, the program (i.e. the C++ program) not Java the
> language - that's where I've observed some custom handlers, that are writing
> to custom log files (redirected to catalina.out in the Apache Tomcat case -
> which is probably just done via stderr redirecting come to think of it).
> 
> But, if it can be done better we should do so, for sure.  Some research,
> code tweaking, testing and educating of others needed.

POSIX defines a set of async-signal-safe functions which can be used from signal handlers.  In general, system calls from section 2 of the manual are safe on Linux, but many functions from section 3 are not.

Comment 9 Fedora End Of Life 2013-04-03 16:28:52 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 10 Nathan Scott 2014-05-19 01:18:39 UTC

Planning to revisit this old chestnut this week and tackle in pcp-3.9.5 timeframe.
Will be reviewing both pmcd and pmlogger, as per Florian's #c1, as well as pmproxy and pmwebd.

Comment 11 Nathan Scott 2014-05-20 07:46:00 UTC

This is fixed by upstream commit fc44068f8c50514522a00abb3dce8a0290e7d447.

Comment 12 Frank Ch. Eigler 2014-05-20 10:54:36 UTC

Those changes look real good.  A few nits:

- exit() shouldn't be called from signal handlers either (AS-Unsafe) from
  pmlogger's functions; use _exit() instead.
  http://www.gnu.org/software/libc/manual/html_node/Function-Index.html

- even in DBG_TRACE_DESPERATE mode, calling fprintf may just unnecessarily
  invite deadlocks; consider using plain write(2) to get some diagnostics
  out reliably

Comment 13 Nathan Scott 2014-05-21 00:25:46 UTC

Re exit() - ayup - good catch, fixes shall follow shortly (an abort() likewise is still wrong);

Re desperate mode, considered it yesterday but there's no point - the traceback routines allocate memory, so ultimately its unavoidable.  In the end I chose to leave as-is since that's worked for kenj & others in the past - there's timestamped log messages in the mix too btw, it just becomes impractical to unravel.  "desperate" debugging mode is aptly named.

thanks.

Comment 14 Nathan Scott 2014-05-21 07:18:19 UTC

dev branch commit 3988c4b follows up on #c12 & #c13.

Comment 15 Fedora Update System 2014-06-18 20:16:34 UTC

pcp-3.9.5-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.el5

Comment 16 Fedora Update System 2014-06-18 20:19:03 UTC

pcp-3.9.5-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.el6

Comment 17 Fedora Update System 2014-06-18 20:20:57 UTC

pcp-3.9.5-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.fc19

Comment 18 Fedora Update System 2014-06-18 20:22:37 UTC

pcp-3.9.5-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.fc20

Comment 19 Fedora Update System 2014-06-19 16:37:20 UTC

Package pcp-3.9.5-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.9.5-1.el5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-1677/pcp-3.9.5-1.el5
then log in and leave karma (feedback).

Comment 20 Fedora Update System 2014-07-01 23:27:16 UTC

pcp-3.9.5-1.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 21 Fedora Update System 2014-07-01 23:30:17 UTC

pcp-3.9.5-1.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 Fedora Update System 2014-07-05 09:33:01 UTC

pcp-3.9.5-1.el5 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 23 Fedora Update System 2014-07-05 09:34:08 UTC

pcp-3.9.5-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.