Hide Forgot
Description of problem: start pacemaker/openais/corosync (service openais restart) killall -9 corosync % top 12657 root RT 0 81476 2112 1724 R 63.5 0.1 0:09.02 stonithd 12660 hacluste RT 0 83800 2032 1632 R 63.5 0.1 0:09.00 attrd 12658 hacluste RT 0 80956 4084 2204 R 63.1 0.1 0:08.96 cib 12662 hacluste RT 0 90080 2592 2124 R 62.8 0.1 0:08.95 crmd The remaining pacemaker daemons should handle the exit of the underlying support services gracefully with a retry backoff until the corosync daemon restarts, not spin on high. When corosync restarts the daemons should re-attach to the services. When stopping/starting 'openais' to start all pacemaker services (service openais stop' all the services (stonithd, attrd, cib, crmd) may need to be gracefully shutdown as well under this configuration. Version-Release number of selected component (if applicable): How reproducible: repeatable Steps to Reproduce: 1. service openais start 2. killall -9 corosync Actual results: cpu pegged at 100% Expected results: services do retry/backoff for some time waiting for restart of corosync then gracefully shut down. Additional info:
see related bug 706289 which provokes this problem with the use of an available test suite program. Is there a way to cleanup the orphaned components under this setup (service xxx restart)? see bug 706297 for details.
Actually all components terminate if their connection to corosync is lost. Its not clear to me what the processes are doing during this time. Can you attach some logs please?
Created attachment 502125 [details] corosync log and corosync config file Running corosync with the "-p" flag allowed the daemon to be killed without locking up the system, the pcmk daemons still peg the cpu at 100% across stonithd, cib, lrmd, attrd, pengine and crmd). The attached configuration file is for a simple two node setup, although one node is sufficient and the problem occurs on three nodes as well. This test was done with a minimal pacemaker configuration on two nodes. # crm configure show node tarnfd2.local node tarnfd3.local property $id="cib-bootstrap-options" \ dc-version="1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" rsc_defaults $id="rsc-options" \ resource-stickiness="100" gdb on the daemons indicates a loop
Finishing truncated messages from above: the back trace indicates all pacemaker daemons resting on a poll in the main loop. The strace indicates that their may be an unhandled error condition polling on fd that may have an error (is it the broken client file descriptor to the corosync daemon?), thus causing the loop which repeats below (poll call trace resulted in about 600 poll system calls per second and about 1600 system calls per second). -- sample of strace attached initially to a healthy system--- # initiate corosync -p to avoid running all daemons at high priority Process 11835 attached - interrupt to quit 13:32:28.460837 restart_syscall(<... resuming interrupted call ...>) = 1 <325.498456> 13:37:53.960530 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000116> ### note - wait 5 minutes with no activity, followed by 'killall -9 corosync' 13:37:53.960936 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000033> 13:37:53.961052 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000047> 13:37:53.961332 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000838> 13:37:53.962481 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000036> 13:37:53.962591 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000033> 13:37:53.962691 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000060> 13:37:53.962828 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000031> 13:37:53.963026 poll([{fd=3, events=POLLIN}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLHUP}]) <0.000098> 13:37:53.963218 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000043> 13:37:53.963338 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000019> 13:37:53.963507 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000020> 13:37:53.963614 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000019> 13:37:53.963733 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019> 13:37:53.963836 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000019> 13:37:53.963940 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000020> 13:37:53.964057 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000033> 13:37:53.964174 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000023> 13:37:53.964290 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327323 <0.000019> 13:37:53.964395 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000021> 13:37:53.964567 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000020> 13:37:53.964685 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019> 13:37:53.964792 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000018> 13:37:53.964898 recvfrom(8, 0x1c5dca5, 3531, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000018> 13:37:53.964998 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000018> 13:37:53.965098 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000028> 13:37:53.965221 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 4, -1) = 1 ([{fd=3, revents=POLLIN|POLLHUP}]) <0.000021> 13:37:53.965507 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000020> 13:37:53.965734 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000018> 13:37:53.965837 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000018> 13:37:53.965953 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000018> 13:37:53.966055 recvfrom(7, 0x1c5c39d, 3795, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000029> 13:37:53.966171 poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) <0.000018> 13:37:53.966271 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000018> 13:37:53.966389 times({tms_utime=1, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 447327324 <0.000019>
I need a coherent set of data, not fragments from unrelated tests. Please include corosync.log and the time at which corosync was killed. Be sure to include plenty of logs from after the kill signal is sent. Also, don't forget that I can't see what you see - if "gdb on the daemons indicates a loop", then it would be helpful to supply that data too.
A colleague at SUSE has experienced the same issue. Apparently there are two fixes required to corosync and one to Pacemaker. We'll endeavour to get these applied asap. Corosync: http://corosync.org/git/?p=corosync.git;a=commitdiff_plain;h=2e5dc5f322a5a852d31f5775debbf51e062695ba and https://lists.linux-foundation.org/pipermail/openais/2011-June/016390.html Pacemaker: diff -r 75575ce39861 -r 0ac9baf90f8f lib/common/ais.c --- a/lib/common/ais.c Tue Jun 21 15:54:20 2011 +0200 +++ b/lib/common/ais.c Thu Jun 30 16:47:30 2011 +0200 @@ -562,15 +562,15 @@ gboolean ais_dispatch(int sender, gpoint do { rc = coroipcc_dispatch_get (ais_ipc_handle, (void**)&buffer, 0); - if (rc == 0 || buffer == NULL) { - /* Zero is a legal "no message afterall" value */ - return TRUE; - - } else if (rc != CS_OK) { + if (rc != CS_OK) { crm_perror(LOG_ERR,"Receiving message body failed: (%d) %s", rc, ais_error2text(rc)); goto bail; } - + if (buffer == NULL) { + /* NULL is a legal "no message afterall" value */ + return TRUE; + } + good = ais_dispatch_message((AIS_Message*)buffer, dispatch); coroipcc_dispatch_put (ais_ipc_handle);