706291 – pacemaker daemons consume cpu on crash of lower level components

Bug 706291 - pacemaker daemons consume cpu on crash of lower level components

Summary: pacemaker daemons consume cpu on crash of lower level components

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pacemaker
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Andrew Beekhof
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-05-19 22:52 UTC by dan clark
Modified:	2011-09-26 02:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-09-26 02:45:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
corosync log and corosync config file (10.77 KB, application/x-gzip) 2011-05-31 21:42 UTC, dan clark	no flags	Details
View All

Description dan clark 2011-05-19 22:52:28 UTC

Description of problem:
start pacemaker/openais/corosync (service openais restart)
killall -9 corosync
% top
12657 root      RT   0 81476 2112 1724 R 63.5  0.1   0:09.02 stonithd           
12660 hacluste  RT   0 83800 2032 1632 R 63.5  0.1   0:09.00 attrd              
12658 hacluste  RT   0 80956 4084 2204 R 63.1  0.1   0:08.96 cib                
12662 hacluste  RT   0 90080 2592 2124 R 62.8  0.1   0:08.95 crmd        

The remaining pacemaker daemons should handle the exit of the underlying support services gracefully with a retry backoff until the corosync daemon restarts, not spin on high.  When corosync restarts the daemons should re-attach to the services.  

When stopping/starting 'openais' to start all pacemaker services (service openais stop' all the services (stonithd, attrd, cib, crmd) may need to be gracefully shutdown as well under this configuration.

Version-Release number of selected component (if applicable):


How reproducible:
repeatable

Steps to Reproduce:
1. service openais start
2. killall -9 corosync
  
Actual results:
cpu pegged at 100%

Expected results:
services do retry/backoff for some time waiting for restart of corosync then gracefully shut down.

Additional info:

Comment 1 dan clark 2011-05-23 18:33:18 UTC

see related bug 706289 which provokes this problem with the use of an available test suite program.  Is there a way to cleanup the orphaned components under this setup (service xxx restart)?  see bug 706297 for details.

Comment 2 Andrew Beekhof 2011-05-26 10:06:53 UTC

Actually all components terminate if their connection to corosync is lost.
Its not clear to me what the processes are doing during this time.

Can you attach some logs please?

Comment 3 dan clark 2011-05-31 21:42:09 UTC

Created attachment 502125 [details]
corosync log and corosync config file

Running corosync with the "-p" flag allowed the daemon to be killed without locking up the system, the pcmk daemons still peg the cpu at 100% across stonithd, cib, lrmd, attrd, pengine and crmd).  

The attached configuration file is for a simple two node setup, although one node is sufficient and the problem occurs on three nodes as well.  

This test was done with a minimal pacemaker configuration on  two nodes.
# crm configure show
node tarnfd2.local
node tarnfd3.local
property $id="cib-bootstrap-options" \
	dc-version="1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

gdb on the daemons indicates a loop

Comment 4 dan clark 2011-06-02 20:59:50 UTC

Finishing truncated messages from above:
the back trace indicates all pacemaker daemons resting on a poll in the main loop.  The strace indicates that their may be an unhandled error condition polling on fd that may have an error (is it the broken client file descriptor to the corosync daemon?), thus causing the loop which repeats below  (poll call trace resulted in about 600 poll system calls per second and about 1600 system calls per second).
-- sample of strace attached initially to a healthy system---
# initiate corosync -p to avoid running all daemons at high priority
Process 11835 attached - interrupt to quit
13:32:28.460837 restart_syscall(<... resuming interrupted call ...>) = 1 <325.498456>
13:37:53.960530 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000116>
### note - wait 5 minutes with no activity, followed by 'killall -9 corosync'
13:37:53.960936 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000033>
13:37:53.961052 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000047>
13:37:53.961332 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000838>
13:37:53.962481 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000036>
13:37:53.962591 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000033>
13:37:53.962691 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000060>
13:37:53.962828 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000031>
13:37:53.963026 poll([{fd=3, events=POLLIN}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLHUP}]) <0.000098>
13:37:53.963218 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000043>
13:37:53.963338 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000019>
13:37:53.963507 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
13:37:53.963614 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000019>
13:37:53.963733 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
13:37:53.963836 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000019>
13:37:53.963940 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
13:37:53.964057 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000033>
13:37:53.964174 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000023>
13:37:53.964290 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000019>
13:37:53.964395 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
13:37:53.964567 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000020>
13:37:53.964685 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
13:37:53.964792 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000018>
13:37:53.964898 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
13:37:53.964998 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000018>
13:37:53.965098 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000028>
13:37:53.965221 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 4, -1) = 1 ([{fd=3, revents=POLLIN|POLLHUP}]) <0.000021>
13:37:53.965507 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000020>
13:37:53.965734 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000018>
13:37:53.965837 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
13:37:53.965953 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000018>
13:37:53.966055 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
13:37:53.966171 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000018>
13:37:53.966271 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000018>
13:37:53.966389 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000019>

Comment 5 Andrew Beekhof 2011-06-17 06:14:05 UTC

I need a coherent set of data, not fragments from unrelated tests.
Please include corosync.log and the time at which corosync was killed.  Be sure to include plenty of logs from after the kill signal is sent.

Also, don't forget that I can't see what you see - if "gdb on the daemons indicates a loop", then it would be helpful to supply that data too.

Comment 6 Andrew Beekhof 2011-07-01 01:53:14 UTC

A colleague at SUSE has experienced the same issue.

Apparently there are two fixes required to corosync and one to Pacemaker.
We'll endeavour to get these applied asap.

Corosync:
   http://corosync.org/git/?p=corosync.git;a=commitdiff_plain;h=2e5dc5f322a5a852d31f5775debbf51e062695ba

and 

https://lists.linux-foundation.org/pipermail/openais/2011-June/016390.html

Pacemaker:

diff -r 75575ce39861 -r 0ac9baf90f8f lib/common/ais.c
--- a/lib/common/ais.c  Tue Jun 21 15:54:20 2011 +0200
+++ b/lib/common/ais.c  Thu Jun 30 16:47:30 2011 +0200
@@ -562,15 +562,15 @@ gboolean ais_dispatch(int sender, gpoint
    do {
       rc = coroipcc_dispatch_get (ais_ipc_handle, (void**)&buffer, 0);

-       if (rc == 0 || buffer == NULL) {
-           /* Zero is a legal "no message afterall" value */
-           return TRUE;
-
-       } else if (rc != CS_OK) {
+       if (rc != CS_OK) {
           crm_perror(LOG_ERR,"Receiving message body failed: (%d) %s", rc, ais_error2text(rc));
           goto bail;
       }
-
+       if (buffer == NULL) {
+           /* NULL is a legal "no message afterall" value */
+           return TRUE;
+       }
+
       good = ais_dispatch_message((AIS_Message*)buffer, dispatch);
       coroipcc_dispatch_put (ais_ipc_handle);

Note You need to log in before you can comment on or make changes to this bug.