Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Cause: Previously, Pacemaker assumed that a fencing topology level used to unfence a node would still exist when the unfencing result arrived.
Consequence: If a target node had a fencing topology configured with a single level containing a single device, and that level was removed from the configuration after an unfencing action had been initiated but before the result came back, the Pacemaker fencer daemon on the Designated Controller (DC) node could go into an infinite loop, consuming all CPU and not responding to any further requests.
Fix: Pacemaker now checks whether the fencing topology still exists when a result comes back.
Result: The result is recorded properly, and the fencer daemon continues to behave normally.
The fencing device for controller-0 was removed from the configuration while controller-0 was being unfenced. The fencer initiated the unfencing while the device still existed, but the device was gone when the results came back. It then tried to check whether there were any more devices in the topology, and ended up in the infinite loop in stonith_choose_peer().
May 05 13:25:05 controller-1.redhat.local pacemaker-based [62591] (cib_perform_op) info: ++ /cib/configuration/fencing-topology: <fencing-level devices="stonith-fence_ipmilan-5254002c4be9,stonith-fence_compute-fence-nova" index="1" target="controller-0" id="fl-controller-0-1"/>
...
May 05 13:25:39 controller-1.redhat.local pacemaker-fenced [62592] (handle_request) notice: Client pacemaker-controld.62596.af2c977b wants to fence (on) 'controller-0' with device '(any)'
...
May 05 13:25:39 controller-1.redhat.local pacemaker-fenced [62592] (can_fence_host_with_device) notice: stonith-fence_ipmilan-5254002c4be9 is eligible to fence (on) controller-0: static-list
...
May 05 13:25:49 controller-1.redhat.local pacemaker-based [62591] (cib_perform_op) info: -- /cib/configuration/fencing-topology/fencing-level[@id='fl-controller-0-1']
...
May 05 13:25:50 controller-1.redhat.local pacemaker-fenced [62592] (call_remote_stonith) notice: Requesting that database-2 perform 'on' action targeting controller-0 using 'stonith-fence_ipmilan-5254002c4be9' | for client pacemaker-controld.62596 (72s)
...
May 05 13:25:50 controller-1.redhat.local pacemaker-fenced [62592] (process_remote_stonith_exec) notice: Action 'on' targeting controller-0 using stonith-fence_ipmilan-5254002c4be9 on behalf of pacemaker-controld.62596@controller-1: OK | rc=0
That was the final message from the fencer.
The workaround until a fix is ready is to not remove fence devices while they're being used.
I am able to reproduce this issue on RHEL 8.2. Interestingly, I am not able to reproduce it on RHEL 7, even with the same pacemaker code base, so it may be dependent on some external factor such as glib or corosync version. The issue only occurs if the target node has a single fencing topology level containing a single fencing device, and the level is removed from the configuration after an unfencing action has been requested and before it has completed.
QA: The easiest reproducer is:
1. Configure a cluster, and choose one node to be the target of the test (referred to as $TARGET below).
2. On every node, make fence_dummy available for use:
/usr/libexec/pacemaker/cts-support install
3. Create a fencing topology for the target consisting of a dummy fence device with a delay (the delay gives time to remove the level while an operation is in progress; the target should not have any other fencing configured):
pcs stonith create target-fencing fence_dummy mode=pass delay=4 mock_dynamic_hosts=$TARGET meta provides=unfencing
pcs stonith level add 1 $TARGET target-fencing
4. Changing the dummy device's parameters is the easiest way to trigger unfencing. It appears only the node running the device will be unfenced (which is likely a separate, unrelated bug), so ensure the device runs on the target:
pcs constraint location target-fencing prefers $TARGET=100
5. Trigger unfencing by changing the delay parameter (here, it's from 4 to 3, but any change will do), then remove the topology:
pcs stonith update target-fencing delay=3 && sleep 1 && pcs stonith level remove 1 $TARGET
Michele, are you OK with 8.3 for the fix, or do you want an 8.2.z-stream as well?
Comment 11Michele Baldessari
2020-05-13 07:48:53 UTC
I think we should be fine for rhel 8.3. Once we have the crm_node fix puppet won't go around changing stonith resources and/or levels.
Thanks,
Michele
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2020:4804
Description of problem: So we moved our CI jobs to point to RHEL8.2 and when rebooting one node at the time of an Instance HA environment we got a bunch of odd failures around stonith. We are still not sure exactly about the exact sequence of events that triggered this, but we see pacemaker-fenced stuck with 100% CPU: So on our IHA env on RHEL8.2 we see fenced just going crazy with cpu: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 62592 root 20 0 168468 27012 11640 R 99.0 0.1 85:51.63 pacemaker-fence It is fully stuck in userspace (aka no systemcalls and empty strace) in a loop inside stonith_choose_peer(): Here are some gdb backtraces of this situation: (gdb) bt #0 g_hash_table_lookup_node (hash_return=<synthetic pointer>, key=0x563a19522ce0, hash_table=0x563a1944f4c0) at ghash.c:414 #1 g_hash_table_lookup (hash_table=0x563a1944f4c0, key=key@entry=0x563a19522ce0) at ghash.c:1153 #2 0x0000563a1813aaaf in find_peer_device (device=device@entry=0x563a19522ce0 "stonith-fence_compute-fence-nova", peer=<optimized out>, op=<optimized out>) at fenced_remote.c:180 #3 0x0000563a1813ac41 in grab_peer_device (verified_devices_only=1, device=0x563a19522ce0 "stonith-fence_compute-fence-nova", peer=0x563a194b5ba0, op=0x563a196738e0) at fenced_remote.c:1159 #4 find_best_peer (device=device@entry=0x563a19522ce0 "stonith-fence_compute-fence-nova", op=op@entry=0x563a196738e0, options=options@entry=(FIND_PEER_SKIP_TARGET | FIND_PEER_VERIFIED_ONLY)) at fenced_remote.c:1159 #5 0x0000563a1813cb62 in stonith_choose_peer (op=op@entry=0x563a196738e0) at fenced_remote.c:1193 #6 0x0000563a1813e0e8 in call_remote_stonith (op=op@entry=0x563a196738e0, peer=<optimized out>, peer@entry=0x0) at fenced_remote.c:1472 #7 0x0000563a1813ed15 in advance_op_topology (op=op@entry=0x563a196738e0, device=device@entry=0x563a19226160 "stonith-fence_ipmilan-5254002c4be9", msg=msg@entry=0x563a191c7e70, rc=0) at fenced_remote.c:1436 #8 0x0000563a18140266 in process_remote_stonith_exec (msg=msg@entry=0x563a191c7e70) at fenced_remote.c:2018 #9 0x0000563a18139398 in handle_reply (remote_peer=0x563a192b10f0 "database-2", request=0x563a191c7e70, client=0x0) at fenced_commands.c:2677 #10 stonith_command (client=0x0, id=<optimized out>, flags=0, request=0x563a191c7e70, remote_peer=0x563a192b10f0 "database-2") at fenced_commands.c:2719 #11 0x0000563a1812f39d in stonith_peer_callback (private_data=0x0, msg=0x563a191c7e70) at pacemaker-fenced.c:183 #12 stonith_peer_ais_callback (handle=<optimized out>, groupName=<optimized out>, nodeid=<optimized out>, pid=<optimized out>, msg=<optimized out>, msg_len=<optimized out>) at pacemaker-fenced.c:209 #13 0x00007fdd3c2a6654 in cpg_dispatch () from /lib64/libcpg.so.4 #14 0x00007fdd3da75d5e in pcmk_cpg_dispatch (user_data=0x7ffe3e065e70) at cpg.c:237 #15 0x00007fdd3dca7f27 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x563a18d61ae0) at mainloop.c:805 #16 0x00007fdd3a53067d in g_main_dispatch (context=0x563a18d601f0) at gmain.c:3176 #17 g_main_context_dispatch (context=context@entry=0x563a18d601f0) at gmain.c:3829 #18 0x00007fdd3a530a48 in g_main_context_iterate (context=0x563a18d601f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3902 #19 0x00007fdd3a530d72 in g_main_loop_run (loop=0x563a18d454e0) at gmain.c:4098 #20 0x0000563a1812da9c in main (argc=<optimized out>, argv=<optimized out>) at pacemaker-fenced.c:1516 (gdb) bt #0 0x00007fdd39f13024 in __strcasecmp_l_avx () from /lib64/libc.so.6 #1 0x00007fdd3dcb8b10 in crm_str_eq (a=<optimized out>, b=b@entry=0x563a19491150 "controller-0", use_case=use_case@entry=0) at strings.c:237 #2 0x0000563a1813c242 in topology_matches (node=0x563a19491150 "controller-0", tp=0x563a19462f60) at fenced_remote.c:763 #3 find_topology_for_host (host=0x563a19491150 "controller-0") at fenced_remote.c:782 #4 0x0000563a1813c72c in stonith_topology_next (op=op@entry=0x563a196738e0) at fenced_remote.c:808 #5 0x0000563a1813cb13 in stonith_choose_peer (op=op@entry=0x563a196738e0) at ../../include/crm/common/util.h:175 #6 0x0000563a1813e0e8 in call_remote_stonith (op=op@entry=0x563a196738e0, peer=<optimized out>, peer@entry=0x0) at fenced_remote.c:1472 #7 0x0000563a1813ed15 in advance_op_topology (op=op@entry=0x563a196738e0, device=device@entry=0x563a19226160 "stonith-fence_ipmilan-5254002c4be9", msg=msg@entry=0x563a191c7e70, rc=0) at fenced_remote.c:1436 #8 0x0000563a18140266 in process_remote_stonith_exec (msg=msg@entry=0x563a191c7e70) at fenced_remote.c:2018 #9 0x0000563a18139398 in handle_reply (remote_peer=0x563a192b10f0 "database-2", request=0x563a191c7e70, client=0x0) at fenced_commands.c:2677 #10 stonith_command (client=0x0, id=<optimized out>, flags=0, request=0x563a191c7e70, remote_peer=0x563a192b10f0 "database-2") at fenced_commands.c:2719 #11 0x0000563a1812f39d in stonith_peer_callback (private_data=0x0, msg=0x563a191c7e70) at pacemaker-fenced.c:183 #12 stonith_peer_ais_callback (handle=<optimized out>, groupName=<optimized out>, nodeid=<optimized out>, pid=<optimized out>, msg=<optimized out>, msg_len=<optimized out>) at pacemaker-fenced.c:209 #13 0x00007fdd3c2a6654 in cpg_dispatch () from /lib64/libcpg.so.4 #14 0x00007fdd3da75d5e in pcmk_cpg_dispatch (user_data=0x7ffe3e065e70) at cpg.c:237 #15 0x00007fdd3dca7f27 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x563a18d61ae0) at mainloop.c:805 #16 0x00007fdd3a53067d in g_main_dispatch (context=0x563a18d601f0) at gmain.c:3176 #17 g_main_context_dispatch (context=context@entry=0x563a18d601f0) at gmain.c:3829 #18 0x00007fdd3a530a48 in g_main_context_iterate (context=0x563a18d601f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3902 #19 0x00007fdd3a530d72 in g_main_loop_run (loop=0x563a18d454e0) at gmain.c:4098 #20 0x0000563a1812da9c in main (argc=<optimized out>, argv=<optimized out>) at pacemaker-fenced.c:1516 Ltrace calls counts shows: root@controller-1 ~]# ltrace -p 62592 --demangle -c -w1 C% time seconds usecs/call calls function ----- ----------- ----------- --------- -------------------- 36.19 0.600346 55 10793 crm_is_callsite_active 26.05 0.432109 56 7621 crm_str_eq 18.40 0.305310 56 5397 g_hash_table_lookup 9.72 0.161241 56 2859 g_hash_table_iter_next 8.54 0.141687 55 2536 safe_str_neq 1.10 0.018192 57 317 g_hash_table_iter_init ----- ----------- ----------- --------- -------------------- 00.00 1.658885 29523 total And a longer ltrace shows: [root@controller-1 ~]# more /tmp/stuck-ltrace 62592 g_hash_table_lookup(0x563a192bb2a0, 0x563a19522ce0, 0x563a19522ce0, 99) = 0 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 0x563a194d7cd0) = 0 62592 crm_str_eq(0x563a192b0a80, 0x563a19491150, 0, 0x563a194d7cd0) = 0 62592 g_hash_table_lookup(0x563a1944f400, 0x563a19522ce0, 0x563a19522ce0, 99) = 0 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 0) = 0 62592 safe_str_neq(0x563a1947d3e0, 0x563a19491150, 0, 0) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99) = 0 62592 safe_str_neq(0x563a19695670, 0x563a19491150, 0, 99) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 48) = 0 62592 safe_str_neq(0x563a19691930, 0x563a19491150, 0, 48) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99) = 0 62592 safe_str_neq(0x563a196880e0, 0x563a19491150, 0, 99) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99) = 0 62592 safe_str_neq(0x563a19463350, 0x563a19491150, 0, 99) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99) = 0 62592 safe_str_neq(0x563a194d7a20, 0x563a19491150, 0, 99) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 48) = 0 62592 safe_str_neq(0x563a196970e0, 0x563a19491150, 0, 48) = 1 62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99) = 0 62592 safe_str_neq(0x563a192b0a80, 0x563a19491150, 0, 99) = 1 62592 g_hash_table_lookup(0x563a194422a0, 0x563a19491150, 0x7fdd39f272e0, 0) = 0 62592 g_hash_table_iter_init(0x7ffe3df64450, 0x563a194422a0, 7, 0x563a194b62e0) = 8 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 0x563a194b62e0) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a195240e0) = 0 62592 crm_str_eq(0x563a19053b40, 0x563a19491150, 0, 0x563a195240e0) = 0 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a194622d0) = 0 62592 crm_str_eq(0x563a19523d70, 0x563a19491150, 0, 0x563a194622d0) = 0 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19659b30) = 0 62592 crm_str_eq(0x563a194f9250, 0x563a19491150, 0, 0x563a19659b30) = 0 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 48) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19463440) = 0 62592 crm_str_eq(0x563a18fee780, 0x563a19491150, 0, 0x563a19463440) = 0 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 48) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19687d80) = 0 62592 crm_str_eq(0x563a192b5730, 0x563a19491150, 0, 0x563a19687d80) = 0 62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99) = 1 62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19463050) = 0 62592 crm_str_eq(0x563a196619b0, 0x563a19491150, 0, 0x563a19463050) = 0 http://file.rdu.redhat.com/~mbaldess/fence-sosreports/stuck-ltrace has the full looping ltrace. Sosreports of all nodes are here: http://file.rdu.redhat.com/~mbaldess/fence-sosreports/