Bug 2218292

Summary: Galera unable to recover after single controller failure - Error: local node <leaf1-controller-2> is started, but not in primary mode
Product: Red Hat OpenStack Reporter: Chris Janiszewski <cjanisze>
Component: mariadb-galeraAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED NOTABUG QA Contact: dabarzil
Severity: low Docs Contact:
Priority: low    
Version: 17.0 (Wallaby)CC: lmiccini, mbayer
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-11 12:56:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2023-06-28 15:28:16 UTC
Description of problem:
OOM error cause one of the controller nodes to fail. Galera cluster on the remaining 2 controller nodes was unable to recover and brought all the OpenStack services down


Version-Release number of selected component (if applicable):
OSP17.0.1


How reproducible:
Maybe by trying to inject OOM failure on one of the controller nodes?

Steps to Reproduce:
1. OOM failure on one of the controller nodes
2.
3.

Actual results:
OpenStack cluster failure

Expected results:
2 of the controller nodes should be able to keep the cluster up

Additional info:

[root@leaf1-controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: leaf1-controller-1 (version 2.1.2-4.el9_0.3-ada5c3b36e2) - partition with quorum
  * Last updated: Wed Jun 28 10:10:43 2023
  * Last change:  Mon Jun  5 13:16:00 2023 by hacluster via crmd on leaf1-controller-0
  * 12 nodes configured
  * 41 resource instances configured

Node List:
  * Online: [ leaf1-controller-1 leaf1-controller-2 ]
  * OFFLINE: [ leaf1-controller-0 ]
  * GuestOnline: [ galera-bundle-0@leaf1-controller-1 galera-bundle-1@leaf1-controller-2 rabbitmq-bundle-0@leaf1-controller-2 rabbitmq-bundle-1@leaf1-controller-1 redis-bundle-0@leaf1-controller-1 redis-bundle-1@leaf1-controller-2 ]

Full List of Resources:
  * ip-172.20.10.109    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * ip-172.20.11.50     (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * ip-172.20.12.115    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * ip-172.20.16.16     (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * ip-172.20.14.115    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * ip-172.20.15.116    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * Container bundle set: haproxy-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0   (ocf:heartbeat:podman):  Started leaf1-controller-1
    * haproxy-bundle-podman-1   (ocf:heartbeat:podman):  Started leaf1-controller-2
    * haproxy-bundle-podman-2   (ocf:heartbeat:podman):  Stopped
  * Container bundle set: galera-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-mariadb:pcmklatest]:
    * galera-bundle-0   (ocf:heartbeat:galera):  Unpromoted leaf1-controller-1
    * galera-bundle-1   (ocf:heartbeat:galera):  Unpromoted leaf1-controller-2
    * galera-bundle-2   (ocf:heartbeat:galera):  Stopped
  * Container bundle set: rabbitmq-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf:heartbeat:rabbitmq-cluster):        Started leaf1-controller-2
    * rabbitmq-bundle-1 (ocf:heartbeat:rabbitmq-cluster):        Started leaf1-controller-1
    * rabbitmq-bundle-2 (ocf:heartbeat:rabbitmq-cluster):        Stopped
  * Container bundle: openstack-cinder-volume [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0  (ocf:heartbeat:podman):  Started leaf1-controller-1
  * ip-172.20.9.214     (ocf:heartbeat:IPaddr2):         Started leaf1-controller-2
  * ceph-nfs    (systemd:ceph-nfs@pacemaker):    Started leaf1-controller-2
  * Container bundle: openstack-manila-share [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0   (ocf:heartbeat:podman):  Started leaf1-controller-2
  * ip-172.20.12.100    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-1
  * Container bundle set: redis-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-redis:pcmklatest]:
    * redis-bundle-0    (ocf:heartbeat:redis):   Promoted leaf1-controller-1
    * redis-bundle-1    (ocf:heartbeat:redis):   Unpromoted leaf1-controller-2
    * redis-bundle-2    (ocf:heartbeat:redis):   Stopped

Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' because 'local node <leaf1-controller-2> is started, but not in primary mode. Unknown state.' at Wed Jun 28 09:03:28 2023
  * rabbitmq 10s-interval monitor on rabbitmq-bundle-0 could not be executed (Timed Out) because 'Resource agent did not complete in time' at Wed Jun 28 09:24:56 2023
  * galera 10s-interval monitor on galera-bundle-0 returned 'error' because 'local node <leaf1-controller-1> is started, but not in primary mode. Unknown state.' at Wed Jun 28 09:03:26 2023
  * rabbitmq 10s-interval monitor on rabbitmq-bundle-1 could not be executed (Timed Out) because 'Resource agent did not complete in time' at Wed Jun 28 09:24:57 2023

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


sosreport from all the controllers:
http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-28-nwztdxl.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-28-pskyypk.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj.tar.xz

Comment 1 Luca Miccini 2023-06-29 06:44:15 UTC
Hey Chris,

something happened aroung 08:21:20 but unfortunately we can't see the full stack trace, we only have a hint it may have something to do with libqpid-proton.so.11.13.0 (ceilometer/stf agent of sort?):

Jun 28 08:21:20 leaf1-controller-0 kernel: show_signal_msg: 13 callbacks suppressed
Jun 28 08:21:20 leaf1-controller-0 kernel: handle[10844]: segfault at 18 ip 00007f09e061aa8b sp 00007f09b1d809c8 error 4
Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
Jun 28 08:21:20 leaf1-controller-0 kernel: in libqpid-proton.so.11.13.0[7f09e05ff000+38000]
Jun 28 08:21:20 leaf1-controller-0 kernel: Code: 0f 1e fa 89 b7 90 00 00 00 31 c0 c3 0f 1f 00 f3 0f 1e fa 48 8b bf 50 01 00 00 e9 b0 8b fe ff f3 0f 1e fa 31 c0 48 85 ff 74 25 <48> 8b 3f 48 85 ff 74 1d 48 83 ec 08 e8 94 8b fe ff 48 85 c0 0f 95

this server was probably stuck from 08:09 b/c there are no log messages between 08:09 and 08:21 .

oom event:

Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0

monitor action for galera failed @ 08:21:21:

Jun 28 08:21:21.921 leaf1-controller-1 pacemaker-based     [3425] (cib_perform_op)      info: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='galera-bundle-podman-2']:  <lrm_rsc_op id="galera-bundle-podman-2_last_failure_0" operation_key="galera-bundle-podman-2_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.13.0" transition-key="71:3:0:ea13fbb0-974f-4d00-960a-5eaa0d05233a" transition-magic="2:1;71:3:0:ea13fbb0-974f-4d00-960a-5eaa0d05233a" exit-reason="Resource agent did not complete in time" on_node="leaf1-controller-0" call-id="108" rc-code="1" op-status="2" interval="60000" last-rc-change="1687954879" exec-time="0" queue-time="0" op-digest="9d4ebfd6b97d042c103f49aa5ea7c6f2"/>

cluster tried to soft-recover by stopping and restarting the container a couple of seconds later:

Jun 28 08:21:23.233 leaf1-controller-1 pacemaker-schedulerd[3429] (log_list_item)       notice: Actions: Recover    galera-bundle-podman-2              (                       leaf1-controller-0 )
Jun 28 08:21:23.233 leaf1-controller-1 pacemaker-schedulerd[3429] (log_list_item)       notice: Actions: Restart    galera-bundle-2                     (                       leaf1-controller-0 )  due to required galera-bundle-podman-2 start
Jun 28 08:21:23.233 leaf1-controller-1 pacemaker-schedulerd[3429] (log_list_item)       notice: Actions: Recover    galera:2                            (                 Promoted galera-bundle-2 )

while the recovery was still happening we lost leaf1-controller-2:

Jun 28 08:23:22.196 leaf1-controller-1 pacemaker-attrd     [3428] (node_left)   info: Group attrd event 5: leaf1-controller-2 (node 3 pid 3246) left via cluster exit
Jun 28 08:23:22.196 leaf1-controller-1 pacemaker-based     [3425] (node_left)   info: Group cib event 5: leaf1-controller-2 (node 3 pid 3243) left via cluster exit
Jun 28 08:23:22.196 leaf1-controller-1 pacemaker-attrd     [3428] (crm_update_peer_proc)        info: node_left: Node leaf1-controller-2[3] - corosync-cpg is now offline
Jun 28 08:23:22.196 leaf1-controller-1 pacemaker-based     [3425] (crm_update_peer_proc)        info: node_left: Node leaf1-controller-2[3] - corosync-cpg is now offline
Jun 28 08:23:22.196 leaf1-controller-1 pacemaker-controld  [3430] (node_left)   info: Group crmd event 5: leaf1-controller-2 (node 3 pid 3248) left via cluster exit


so at this stage we have leaf1-controller-0 that went OOM (so not really operational but still reporting UP from a pacemaker perspective) and leaf1-controller-2 that left b/c what seems to be a network/load issue, looking at the corosync log we can see that the heartbeat network is flapping (or the controllers are too slow to process heartbeat messages in time):

Jun 28 08:20:57 [3201] leaf1-controller-2 corosync info    [KNET  ] link: host: 1 link: 0 is down
Jun 28 08:20:57 [3201] leaf1-controller-2 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 08:20:57 [3201] leaf1-controller-2 corosync warning [KNET  ] host: host: 1 has no active links
Jun 28 08:21:01 [3201] leaf1-controller-2 corosync notice  [TOTEM ] Token has not been received in 7987 ms 
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync members[3]: 1 2 3
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync notice  [TOTEM ] A new membership (1.5f8) was formed. Members
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Jun 28 08:21:03 [3201] leaf1-controller-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 28 08:23:03 [3201] leaf1-controller-2 corosync info    [KNET  ] link: host: 1 link: 0 is down
Jun 28 08:23:03 [3201] leaf1-controller-2 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 08:23:03 [3201] leaf1-controller-2 corosync warning [KNET  ] host: host: 1 has no active links
Jun 28 08:23:34 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync members[1]: 3
Jun 28 08:23:34 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync left[2]: 1 2

the following is especially problematic b/c it is a split brain where leaf1-controller-2 thought it was the last man standing:

Jun 28 08:23:34 [3201] leaf1-controller-2 corosync notice  [TOTEM ] A new membership (3.5fc) was formed. Members left: 1 2
Jun 28 08:23:34 [3201] leaf1-controller-2 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 1 2
Jun 28 08:23:47 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync members[1]: 3
Jun 28 08:23:47 [3201] leaf1-controller-2 corosync notice  [TOTEM ] A new membership (3.600) was formed. Members
Jun 28 08:23:47 [3201] leaf1-controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 28 08:23:47 [3201] leaf1-controller-2 corosync notice  [QUORUM] Members[1]: 3
Jun 28 08:23:47 [3201] leaf1-controller-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 28 08:24:00 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync members[1]: 3
Jun 28 08:24:00 [3201] leaf1-controller-2 corosync notice  [TOTEM ] A new membership (3.604) was formed. Members
Jun 28 08:24:00 [3201] leaf1-controller-2 corosync notice  [QUORUM] Members[1]: 3
Jun 28 08:24:00 [3201] leaf1-controller-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

network comes back 10s later:

Jun 28 08:24:10 [3201] leaf1-controller-2 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Jun 28 08:24:10 [3201] leaf1-controller-2 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync members[3]: 1 2 3
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [QUORUM] Sync joined[2]: 1 2
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [TOTEM ] A new membership (1.60c) was formed. Members joined: 1 2
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Jun 28 08:24:11 [3201] leaf1-controller-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

there are quite a few more of these network issues that keep happening for more than 2 hours or so, with various cluster nodes jumping in and out of the cluster repeatedly.

The TL;DR is that you experienced a double controller failure and your cluster is not designed to recover from it, especially since it doesn't have stonith. Galera did not recover b/c of this.


My suggestions:

1. Enable stonith
2. Double check that the controllers are sized appropriately for the kind of workload, imho you are little short on the memory size. You could try reducing the number of threads/processes that get spawned by each service if increasing the ram is not an option, or disabling a few services.
3. Split the internalapi traffic to a dedicated interface if possible (not under ovs control)

Comment 2 Chris Janiszewski 2023-06-29 15:02:20 UTC
Thanks Luca for looking into it. This BZ was mainly to track the controller inability to recover the galera/mariadb after a single controller failure. It sounds like my issue could have been prevented with the STONITH, which I have purposely not enabled, but I understand it's necessary to prevent split brain situation like what seems to be the issue here.

I have also filed another BZ to track down what might contributing to the OOM issue every 30 days or so -> https://bugzilla.redhat.com/show_bug.cgi?id=2218295
We've ran on the same hardware since OSP8 and in this OSP17.0 deployment we seems be more memory hungry or leaking it somewhere. In the past releases we've been able to stay up for months and even years without the reboot. Thanks. Feel free to close this BZ if you don't think there is any value in pursuing this from the galera/mariadb perspective. Unless we think db is eating up all the memory.

Comment 3 Luca Miccini 2023-08-11 12:56:04 UTC
Thanks Chris, I am closing this for now, please reopen if required.