Bug 1985067

Summary: While performing a minor update, the process timedout and it looks like pacemaker can't determine address for bundles
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openstack-tripleo-heat-templatesAssignee: OSP Team <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: low    
Version: 16.1 (Train)CC: aschultz, bsawyers, dciabrin, lmiccini, mburns
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-30 07:55:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2021-07-22 18:36:53 UTC
Description of problem:
While performing a minor update, the process timedout and it looks like pacemaker can't  determine address for bundles:

Jul 19 18:01:58 overcloud-controller-2 crmd[313631]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Jul 19 18:01:58 overcloud-controller-2 corosync[313564]: [TOTEM ] A new membership (10.10.10.10:7732) was formed. Members left: 1
Jul 19 18:01:58 overcloud-controller-2 corosync[313564]: [QUORUM] Members[2]: 2 3
Jul 19 18:01:58 overcloud-controller-2 corosync[313564]: [MAIN  ] Completed service synchronization, ready to provide service.
Jul 19 18:01:58 overcloud-controller-2 pacemakerd[313576]:  notice: Node overcloud-controller-0 state is now lost
Jul 19 18:01:59 overcloud-controller-2 dnsmasq[180147]: read /var/lib/neutron/dhcp/90292b43-3cd9-4c98-b008-013208e4d9e4/addn_hosts - 4 addresses
Jul 19 18:01:59 overcloud-controller-2 dnsmasq-dhcp[180147]: read /var/lib/neutron/dhcp/90292b43-3cd9-4c98-b008-013208e4d9e4/host
Jul 19 18:01:59 overcloud-controller-2 dnsmasq-dhcp[180147]: read /var/lib/neutron/dhcp/90292b43-3cd9-4c98-b008-013208e4d9e4/opts
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node galera-bundle-2 state is now member
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node redis-bundle-0 state is now lost
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]: warning: No reason to expect node redis-bundle-0 to be down
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Stonith/shutdown of redis-bundle-0 not matched
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node rabbitmq-bundle-1 state is now lost
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]: warning: No reason to expect node rabbitmq-bundle-1 to be down
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Stonith/shutdown of rabbitmq-bundle-1 not matched
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node galera-bundle-0 state is now lost
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]: warning: No reason to expect node galera-bundle-0 to be down
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Stonith/shutdown of galera-bundle-0 not matched
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node redis-bundle-2 state is now member
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node rabbitmq-bundle-0 state is now member
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Node overcloud-controller-0 state is now lost
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]: warning: No reason to expect node 1 to be down
Jul 19 18:02:00 overcloud-controller-2 crmd[313631]:  notice: Stonith/shutdown of overcloud-controller-0 not matched
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]: warning: Blind faith: not fencing unseen nodes
Jul 19 18:02:00 overcloud-controller-2 cib[313626]: warning: A-Sync reply to crmd failed: No message of desired type
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      rabbitmq-bundle-1     ( overcloud-controller-2 )   due to unrunnable rabbitmq-bundle-docker-1 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      rabbitmq:1            (  rabbitmq-bundle-1 )   due to unrunnable rabbitmq-bundle-docker-1 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      rabbitmq-bundle-2     ( overcloud-controller-2 )   due to unrunnable rabbitmq-bundle-docker-2 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      rabbitmq:2            (  rabbitmq-bundle-2 )   due to unrunnable rabbitmq-bundle-docker-2 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      galera-bundle-0       ( overcloud-controller-1 )   due to unrunnable galera-bundle-docker-0 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      galera:0              (    galera-bundle-0 )   due to unrunnable galera-bundle-docker-0 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      redis-bundle-0        ( overcloud-controller-2 )   due to unrunnable redis-bundle-docker-0 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice:  * Start      redis:0               (     redis-bundle-0 )   due to unrunnable redis-bundle-docker-0 start (blocked)
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:   error: Could not determine address for bundle connection rabbitmq-bundle-1
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:   error: Could not determine address for bundle connection rabbitmq-bundle-2
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:   error: Could not determine address for bundle connection galera-bundle-0
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:   error: Could not determine address for bundle connection redis-bundle-0
Jul 19 18:02:00 overcloud-controller-2 pengine[313630]:  notice: Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-2328.bz2



Version-Release number of selected component (if applicable):
pacemaker-libs-1.1.19-8.el7_6.2.x86_64

How reproducible:
This environment

Steps to Reproduce:
1. Minor update timedout
2. bundles are not able to start due to "Could not determine address for bundle connection"
3.

Actual results:
Minor update failure

Expected results:
No failures.

Additional info:

Comment 4 David Hill 2021-07-28 20:09:23 UTC
The resource was banned and running "pcs resource clear rabbitmq-bundle" solved this .   The issue we have now is that this rabbitmq won't join the cluster and I'm wondering at this stage if simply re-starting the minor update procedure would solve this.

Comment 5 Brandon Sawyers 2021-07-28 20:14:52 UTC
Will the update run with pcs not being in a healthy state, though?