Bug 1831775

Summary:	pacemaker-fenced got stuck with 100% CPU inside stonith_choose_peer()
Product:	Red Hat Enterprise Linux 8	Reporter:	Michele Baldessari <michele>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	medium	Docs Contact:
Priority:	high
Version:	8.2	CC:	cluster-maint, jeckersb, pkomarov
Target Milestone:	rc
Target Release:	8.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.0.4-2.el8	Doc Type:	Bug Fix
Doc Text:	Cause: Previously, Pacemaker assumed that a fencing topology level used to unfence a node would still exist when the unfencing result arrived. Consequence: If a target node had a fencing topology configured with a single level containing a single device, and that level was removed from the configuration after an unfencing action had been initiated but before the result came back, the Pacemaker fencer daemon on the Designated Controller (DC) node could go into an infinite loop, consuming all CPU and not responding to any further requests. Fix: Pacemaker now checks whether the fencing topology still exists when a result comes back. Result: The result is recorded properly, and the fencer daemon continues to behave normally.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-04 04:00:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michele Baldessari 2020-05-05 15:36:14 UTC

Description of problem:
So we moved our CI jobs to point to RHEL8.2 and when rebooting one node at the time of an Instance HA environment we got a bunch of odd failures around stonith.
We are still not sure exactly about the exact sequence of events that triggered this, but we see pacemaker-fenced stuck with 100% CPU:
So on our IHA env on RHEL8.2 we see fenced just going crazy with cpu:
     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   62592 root      20   0  168468  27012  11640 R  99.0   0.1  85:51.63 pacemaker-fence

It is fully stuck in userspace (aka no systemcalls and empty strace) in a loop inside stonith_choose_peer():

Here are some gdb backtraces of this situation:
 
(gdb) bt
#0  g_hash_table_lookup_node (hash_return=<synthetic pointer>, key=0x563a19522ce0, hash_table=0x563a1944f4c0) at ghash.c:414
#1  g_hash_table_lookup (hash_table=0x563a1944f4c0, key=key@entry=0x563a19522ce0) at ghash.c:1153
#2  0x0000563a1813aaaf in find_peer_device (device=device@entry=0x563a19522ce0 "stonith-fence_compute-fence-nova", peer=<optimized out>, op=<optimized out>)
    at fenced_remote.c:180
#3  0x0000563a1813ac41 in grab_peer_device (verified_devices_only=1, device=0x563a19522ce0 "stonith-fence_compute-fence-nova", peer=0x563a194b5ba0, op=0x563a196738e0)
    at fenced_remote.c:1159
#4  find_best_peer (device=device@entry=0x563a19522ce0 "stonith-fence_compute-fence-nova", op=op@entry=0x563a196738e0,
    options=options@entry=(FIND_PEER_SKIP_TARGET | FIND_PEER_VERIFIED_ONLY)) at fenced_remote.c:1159
#5  0x0000563a1813cb62 in stonith_choose_peer (op=op@entry=0x563a196738e0) at fenced_remote.c:1193
#6  0x0000563a1813e0e8 in call_remote_stonith (op=op@entry=0x563a196738e0, peer=<optimized out>, peer@entry=0x0) at fenced_remote.c:1472
#7  0x0000563a1813ed15 in advance_op_topology (op=op@entry=0x563a196738e0, device=device@entry=0x563a19226160 "stonith-fence_ipmilan-5254002c4be9",
    msg=msg@entry=0x563a191c7e70, rc=0) at fenced_remote.c:1436
#8  0x0000563a18140266 in process_remote_stonith_exec (msg=msg@entry=0x563a191c7e70) at fenced_remote.c:2018
#9  0x0000563a18139398 in handle_reply (remote_peer=0x563a192b10f0 "database-2", request=0x563a191c7e70, client=0x0) at fenced_commands.c:2677
#10 stonith_command (client=0x0, id=<optimized out>, flags=0, request=0x563a191c7e70, remote_peer=0x563a192b10f0 "database-2") at fenced_commands.c:2719
#11 0x0000563a1812f39d in stonith_peer_callback (private_data=0x0, msg=0x563a191c7e70) at pacemaker-fenced.c:183
#12 stonith_peer_ais_callback (handle=<optimized out>, groupName=<optimized out>, nodeid=<optimized out>, pid=<optimized out>, msg=<optimized out>, msg_len=<optimized out>)
    at pacemaker-fenced.c:209
#13 0x00007fdd3c2a6654 in cpg_dispatch () from /lib64/libcpg.so.4
#14 0x00007fdd3da75d5e in pcmk_cpg_dispatch (user_data=0x7ffe3e065e70) at cpg.c:237
#15 0x00007fdd3dca7f27 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x563a18d61ae0) at mainloop.c:805
#16 0x00007fdd3a53067d in g_main_dispatch (context=0x563a18d601f0) at gmain.c:3176
#17 g_main_context_dispatch (context=context@entry=0x563a18d601f0) at gmain.c:3829
#18 0x00007fdd3a530a48 in g_main_context_iterate (context=0x563a18d601f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3902
#19 0x00007fdd3a530d72 in g_main_loop_run (loop=0x563a18d454e0) at gmain.c:4098
#20 0x0000563a1812da9c in main (argc=<optimized out>, argv=<optimized out>) at pacemaker-fenced.c:1516
 
(gdb) bt
#0  0x00007fdd39f13024 in __strcasecmp_l_avx () from /lib64/libc.so.6
#1  0x00007fdd3dcb8b10 in crm_str_eq (a=<optimized out>, b=b@entry=0x563a19491150 "controller-0", use_case=use_case@entry=0) at strings.c:237
#2  0x0000563a1813c242 in topology_matches (node=0x563a19491150 "controller-0", tp=0x563a19462f60) at fenced_remote.c:763
#3  find_topology_for_host (host=0x563a19491150 "controller-0") at fenced_remote.c:782
#4  0x0000563a1813c72c in stonith_topology_next (op=op@entry=0x563a196738e0) at fenced_remote.c:808
#5  0x0000563a1813cb13 in stonith_choose_peer (op=op@entry=0x563a196738e0) at ../../include/crm/common/util.h:175
#6  0x0000563a1813e0e8 in call_remote_stonith (op=op@entry=0x563a196738e0, peer=<optimized out>, peer@entry=0x0) at fenced_remote.c:1472
#7  0x0000563a1813ed15 in advance_op_topology (op=op@entry=0x563a196738e0, device=device@entry=0x563a19226160 "stonith-fence_ipmilan-5254002c4be9",
    msg=msg@entry=0x563a191c7e70, rc=0) at fenced_remote.c:1436
#8  0x0000563a18140266 in process_remote_stonith_exec (msg=msg@entry=0x563a191c7e70) at fenced_remote.c:2018
#9  0x0000563a18139398 in handle_reply (remote_peer=0x563a192b10f0 "database-2", request=0x563a191c7e70, client=0x0) at fenced_commands.c:2677
#10 stonith_command (client=0x0, id=<optimized out>, flags=0, request=0x563a191c7e70, remote_peer=0x563a192b10f0 "database-2") at fenced_commands.c:2719
#11 0x0000563a1812f39d in stonith_peer_callback (private_data=0x0, msg=0x563a191c7e70) at pacemaker-fenced.c:183
#12 stonith_peer_ais_callback (handle=<optimized out>, groupName=<optimized out>, nodeid=<optimized out>, pid=<optimized out>, msg=<optimized out>,
    msg_len=<optimized out>) at pacemaker-fenced.c:209
#13 0x00007fdd3c2a6654 in cpg_dispatch () from /lib64/libcpg.so.4
#14 0x00007fdd3da75d5e in pcmk_cpg_dispatch (user_data=0x7ffe3e065e70) at cpg.c:237
#15 0x00007fdd3dca7f27 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x563a18d61ae0) at mainloop.c:805
#16 0x00007fdd3a53067d in g_main_dispatch (context=0x563a18d601f0) at gmain.c:3176
#17 g_main_context_dispatch (context=context@entry=0x563a18d601f0) at gmain.c:3829
#18 0x00007fdd3a530a48 in g_main_context_iterate (context=0x563a18d601f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3902
#19 0x00007fdd3a530d72 in g_main_loop_run (loop=0x563a18d454e0) at gmain.c:4098
#20 0x0000563a1812da9c in main (argc=<optimized out>, argv=<optimized out>) at pacemaker-fenced.c:1516


Ltrace calls counts shows:
root@controller-1 ~]# ltrace -p 62592 --demangle -c  -w1 
C% time     seconds  usecs/call     calls      function
----- ----------- ----------- --------- --------------------
36.19    0.600346          55     10793 crm_is_callsite_active
26.05    0.432109          56      7621 crm_str_eq
18.40    0.305310          56      5397 g_hash_table_lookup
 9.72    0.161241          56      2859 g_hash_table_iter_next
 8.54    0.141687          55      2536 safe_str_neq
 1.10    0.018192          57       317 g_hash_table_iter_init
----- ----------- ----------- --------- --------------------
00.00    1.658885                 29523 total

And a longer ltrace shows:
[root@controller-1 ~]# more /tmp/stuck-ltrace
62592 g_hash_table_lookup(0x563a192bb2a0, 0x563a19522ce0, 0x563a19522ce0, 99)                    = 0 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 0x563a194d7cd0)                               = 0 
62592 crm_str_eq(0x563a192b0a80, 0x563a19491150, 0, 0x563a194d7cd0)                              = 0 
62592 g_hash_table_lookup(0x563a1944f400, 0x563a19522ce0, 0x563a19522ce0, 99)                    = 0 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 0)                                            = 0 
62592 safe_str_neq(0x563a1947d3e0, 0x563a19491150, 0, 0)                                         = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99)                                           = 0 
62592 safe_str_neq(0x563a19695670, 0x563a19491150, 0, 99)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 48)                                           = 0 
62592 safe_str_neq(0x563a19691930, 0x563a19491150, 0, 48)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99)                                           = 0 
62592 safe_str_neq(0x563a196880e0, 0x563a19491150, 0, 99)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99)                                           = 0 
62592 safe_str_neq(0x563a19463350, 0x563a19491150, 0, 99)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99)                                           = 0 
62592 safe_str_neq(0x563a194d7a20, 0x563a19491150, 0, 99)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 48)                                           = 0 
62592 safe_str_neq(0x563a196970e0, 0x563a19491150, 0, 48)                                        = 1 
62592 crm_is_callsite_active(0x563a192bd230, 8, 0, 99)                                           = 0 
62592 safe_str_neq(0x563a192b0a80, 0x563a19491150, 0, 99)                                        = 1 
62592 g_hash_table_lookup(0x563a194422a0, 0x563a19491150, 0x7fdd39f272e0, 0)                     = 0 
62592 g_hash_table_iter_init(0x7ffe3df64450, 0x563a194422a0, 7, 0x563a194b62e0)                  = 8 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 0x563a194b62e0)                  = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a195240e0)                               = 0 
62592 crm_str_eq(0x563a19053b40, 0x563a19491150, 0, 0x563a195240e0)                              = 0 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99)                              = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a194622d0)                               = 0 
62592 crm_str_eq(0x563a19523d70, 0x563a19491150, 0, 0x563a194622d0)                              = 0 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99)                              = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19659b30)                               = 0 
62592 crm_str_eq(0x563a194f9250, 0x563a19491150, 0, 0x563a19659b30)                              = 0 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 48)                              = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19463440)                               = 0 
62592 crm_str_eq(0x563a18fee780, 0x563a19491150, 0, 0x563a19463440)                              = 0 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 48)                              = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19687d80)                               = 0 
62592 crm_str_eq(0x563a192b5730, 0x563a19491150, 0, 0x563a19687d80)                              = 0 
62592 g_hash_table_iter_next(0x7ffe3df64450, 0, 0x7ffe3df64448, 99)                              = 1 
62592 crm_is_callsite_active(0x563a19643b20, 8, 0, 0x563a19463050)                               = 0 
62592 crm_str_eq(0x563a196619b0, 0x563a19491150, 0, 0x563a19463050)                              = 0 


http://file.rdu.redhat.com/~mbaldess/fence-sosreports/stuck-ltrace has the full looping ltrace.

Sosreports of all nodes are here: http://file.rdu.redhat.com/~mbaldess/fence-sosreports/

Comment 2 Ken Gaillot 2020-05-05 22:24:25 UTC

Which node had the issue, and do you know around what time?

Comment 3 Michele Baldessari 2020-05-06 05:50:45 UTC

It was controller-1 that had the fence at 100% and the fenced process was still stuck in a loop at sosreport collection time.

Comment 4 Ken Gaillot 2020-05-08 22:24:29 UTC

The fencing device for controller-0 was removed from the configuration while controller-0 was being unfenced. The fencer initiated the unfencing while the device still existed, but the device was gone when the results came back. It then tried to check whether there were any more devices in the topology, and ended up in the infinite loop in stonith_choose_peer().

May 05 13:25:05 controller-1.redhat.local pacemaker-based     [62591] (cib_perform_op)  info: ++ /cib/configuration/fencing-topology:  <fencing-level devices="stonith-fence_ipmilan-5254002c4be9,stonith-fence_compute-fence-nova" index="1" target="controller-0" id="fl-controller-0-1"/>
...
May 05 13:25:39 controller-1.redhat.local pacemaker-fenced    [62592] (handle_request)  notice: Client pacemaker-controld.62596.af2c977b wants to fence (on) 'controller-0' with device '(any)'
...
May 05 13:25:39 controller-1.redhat.local pacemaker-fenced    [62592] (can_fence_host_with_device)      notice: stonith-fence_ipmilan-5254002c4be9 is eligible to fence (on) controller-0: static-list
...
May 05 13:25:49 controller-1.redhat.local pacemaker-based     [62591] (cib_perform_op)  info: -- /cib/configuration/fencing-topology/fencing-level[@id='fl-controller-0-1']
...
May 05 13:25:50 controller-1.redhat.local pacemaker-fenced    [62592] (call_remote_stonith)     notice: Requesting that database-2 perform 'on' action targeting controller-0 using 'stonith-fence_ipmilan-5254002c4be9' | for client pacemaker-controld.62596 (72s)
...
May 05 13:25:50 controller-1.redhat.local pacemaker-fenced    [62592] (process_remote_stonith_exec)     notice: Action 'on' targeting controller-0 using stonith-fence_ipmilan-5254002c4be9 on behalf of pacemaker-controld.62596@controller-1: OK | rc=0

That was the final message from the fencer.

The workaround until a fix is ready is to not remove fence devices while they're being used.

Comment 10 Ken Gaillot 2020-05-12 20:04:04 UTC

I am able to reproduce this issue on RHEL 8.2. Interestingly, I am not able to reproduce it on RHEL 7, even with the same pacemaker code base, so it may be dependent on some external factor such as glib or corosync version. The issue only occurs if the target node has a single fencing topology level containing a single fencing device, and the level is removed from the configuration after an unfencing action has been requested and before it has completed.

QA: The easiest reproducer is:

1. Configure a cluster, and choose one node to be the target of the test (referred to as $TARGET below).

2. On every node, make fence_dummy available for use:
/usr/libexec/pacemaker/cts-support install

3. Create a fencing topology for the target consisting of a dummy fence device with a delay (the delay gives time to remove the level while an operation is in progress; the target should not have any other fencing configured):
pcs stonith create target-fencing fence_dummy mode=pass delay=4 mock_dynamic_hosts=$TARGET meta provides=unfencing
pcs stonith level add 1 $TARGET target-fencing

4. Changing the dummy device's parameters is the easiest way to trigger unfencing. It appears only the node running the device will be unfenced (which is likely a separate, unrelated bug), so ensure the device runs on the target:
pcs constraint location target-fencing prefers $TARGET=100

5. Trigger unfencing by changing the delay parameter (here, it's from 4 to 3, but any change will do), then remove the topology:
pcs stonith update target-fencing delay=3 && sleep 1 && pcs stonith level remove 1 $TARGET

Michele, are you OK with 8.3 for the fix, or do you want an 8.2.z-stream as well?

Comment 11 Michele Baldessari 2020-05-13 07:48:53 UTC

I think we should be fine for rhel 8.3. Once we have the crm_node fix puppet won't go around changing stonith resources and/or levels.

Thanks,
Michele

Comment 12 Ken Gaillot 2020-05-13 14:00:00 UTC

Fixed upstream by commit 6d15ee56

Comment 16 pkomarov 2020-09-24 23:04:14 UTC

Verified, 
new package pass basic HA regression test

Comment 19 errata-xmlrpc 2020-11-04 04:00:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4804