Bug 1391671 - RHOS Upgrade failed: Attempted to promote Master instance of galera before bootstrap node has been detected.
Summary: RHOS Upgrade failed: Attempted to promote Master instance of galera before bo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 10.0 (Newton)
Assignee: Michele Baldessari
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-03 17:45 UTC by Yurii Prokulevych
Modified: 2023-02-22 23:02 UTC (History)
18 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.0.0-1.5.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 16:29:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1640182 0 None None None 2016-11-08 14:15:03 UTC
OpenStack gerrit 394980 0 None MERGED Mitaka-Newton upgrade fix network disruption during convergence 2020-06-30 10:28:54 UTC
Red Hat Product Errata RHEA-2016:2948 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC

Description Yurii Prokulevych 2016-11-03 17:45:20 UTC
Description of problem:
-----------------------
Upgrade of RHOS-9 to RHOS-10 failed during convergence step cause galera-master resource failed to start


pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Thu Nov  3 16:37:52 2016          Last change: Thu Nov  3 16:09:54 2016 by root via cibadmin on controller-0

3 nodes and 19 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.10 (ocf::heartbeat:IPaddr2):       Started controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     galera     (ocf::heartbeat:galera):        FAILED Master controller-2 (blocked)
     galera     (ocf::heartbeat:galera):        FAILED Master controller-1 (blocked)
     galera     (ocf::heartbeat:galera):        FAILED Master controller-0 (blocked)
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Started controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-1 ]
     Slaves: [ controller-0 controller-2 ]
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-192.0.2.15  (ocf::heartbeat:IPaddr2):       Started controller-2
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started controller-0

Failed Actions:
* galera_promote_0 on controller-2 'unknown error' (1): call=94, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Thu Nov  3 15:37:13 2016', queued=0ms, exec=245ms
* galera_promote_0 on controller-1 'unknown error' (1): call=107, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Thu Nov  3 15:37:25 2016', queued=0ms, exec=229ms
* galera_promote_0 on controller-0 'unknown error' (1): call=124, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Thu Nov  3 15:37:52 2016', queued=0ms, exec=153ms
* openstack-cinder-volume_monitor_60000 on controller-0 'not running' (7): call=244, status=complete, exitreason='none',
    last-rc-change='Thu Nov  3 16:37:26 2016', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
galera-25.3.5-7.el7ost.x86_64
mariadb-galera-common-5.5.42-5.el7ost.x86_64
mariadb-galera-server-5.5.42-5.el7ost.x86_64
pcs-0.9.152-10.el7.x86_64
pacemaker-cluster-libs-1.1.15-11.el7.x86_64
pacemaker-cli-1.1.15-11.el7.x86_64
puppet-pacemaker-0.3.0-0.20161028103953.f0d2b2a.el7ost.noarch
pacemaker-libs-1.1.15-11.el7.x86_64
pacemaker-1.1.15-11.el7.x86_64
pacemaker-remote-1.1.15-11.el7.x86_64
resource-agents-3.9.5-82.el7.x86_64
python-testresources-0.2.7-6.el7ost.noarch

Steps to Reproduce:
1. Run upgrade from RHOS-9 to RHOS-10

openstack overcloud deploy \
        --templates \
        --libvirt-type kvm \
        --ntp-server clock.redhat.com \
        --neutron-network-type vxlan \
        --neutron-tunnel-types vxlan \
        --control-scale 3 \
        --control-flavor controller-d75f3dec-c770-5f88-9d4c-3fea1bf9c484 \
        --compute-scale 2 \
        --compute-flavor compute-b634c10a-570f-59ba-bdbf-0c313d745a10 \
        --ceph-storage-scale 3 \
        --ceph-storage-flavor ceph-cf1f074b-dadb-5eb8-9eb0-55828273fab7 \
        -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
        -e /home/stack/virt/ceph.yaml \
        -e /home/stack/virt/network/network-environment.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /home/stack/virt/enable-tls.yaml \
        -e /home/stack/virt/inject-trust-anchor.yaml \
        -e /home/stack/virt/hostnames.yml \
        -e /home/stack/virt/debug.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml


Additional info:
----------------
RHEL-7.3. Virtual setup: 3controllers + 2computes + 3ceph

Comment 2 Damien Ciabrini 2016-11-04 00:50:36 UTC
There were several connection loss between corosync nodes which led
them to lose quorum and conservatively stop resource running locally:

at 15:37:00 all galera resource end up being stopped by pacemaker:

Nov 03 15:37:05 controller-0.localdomain pengine[15637]:  warning: Fencing and resource management disabled due to lack of quorum
Nov 03 15:37:05 controller-0.localdomain pengine[15637]:   notice: Stop    galera:0        (Master controller-0)
Nov 03 15:37:08 controller-0.localdomain crmd[15638]:   notice: Result of demote operation for galera on controller-0: 0 (ok) | call=111 key=galera_demote_0 confirmed=true cib-upd

Nov 03 15:36:58 controller-1.localdomain pengine[25230]:  warning: Fencing and resource management disabled due to lack of quorum
Nov 03 15:36:58 controller-1.localdomain pengine[25230]:   notice: Stop    galera:0        (Master controller-1)
Nov 03 15:37:02 controller-1.localdomain crmd[25231]:   notice: Result of demote operation for galera on controller-1: 0 (ok) | call=89 key=galera_demote_0 confirmed=true cib-upda

Nov 03 15:37:03 controller-2.localdomain pengine[9232]:  warning: Fencing and resource management disabled due to lack of quorum
Nov 03 15:37:03 controller-2.localdomain pengine[9232]:   notice: Stop    galera:0        (Master controller-2)
Nov 03 15:37:08 controller-2.localdomain crmd[9233]:   notice: Result of demote operation for galera on controller-2: 0 (ok) | call=86 key=galera_demote_0 confirmed=true cib-updat

aound 15:37, all nodes have been stopped by pacemaker intentionally,
as seen in the mysql logs

161103 15:37:08 [Note] /usr/libexec/mysqld: Shutdown complete
161103 15:37:01 [Note] /usr/libexec/mysqld: Shutdown complete
161103 15:37:07 [Note] /usr/libexec/mysqld: Shutdown complete

then, corosync nodes seem to restore contact to each other,
controller-0 becomes the DC, and decides to re-promote all galera
resources. For a reason I'm not completely sure, pacemaker didn't
_start_ the resources, but just _promote_ them:

Nov 03 15:37:13 controller-0.localdomain pengine[15637]:   notice: Promote galera:0        (Slave -> Master controller-2)
Nov 03 15:37:13 controller-0.localdomain pengine[15637]:   notice: Promote galera:1        (Slave -> Master controller-1)
Nov 03 15:37:13 controller-0.localdomain pengine[15637]:   notice: Promote galera:2        (Slave -> Master controller-0)

However, since no galera were running at the time (no resource in
Master state), the resource agent has to bail out because it has to
reboostrap the cluster first before being able to start a galera
server as a "joining node":

Nov 03 15:37:53 controller-0.localdomain galera(galera)[12116]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.
Nov 03 15:37:25 controller-1.localdomain galera(galera)[1154]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.
Nov 03 15:37:13 controller-2.localdomain galera(galera)[11809]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.

Yurii, given the profile of the overcloud machine (4CPU, 20GB), the loss of quorum may be due to corosync nodes not being able to communicate with each other fast enough during the upgrade process which is resource-hungry. I'm not sure if that's a supported configuration.

Andrew, can you confirm that the behaviour of pacemaker is expected w.r.t stopping the resource, i.e. demote only rather than stop?

Also, if it's expected that after an entire corosync cluster loss the galera resource agent should recover gracefully, it there a means to tell pacemaker that it would have to rerun a start or monitor action on every node before trying to promote all the nodes?

Comment 3 Damien Ciabrini 2016-11-04 00:57:04 UTC
Small addition to comment #2, the 3 nodes cannot been "promote"d concurrently with the galera resource agent, because the agent's logics expects that the bootstrap node is known and its "promote" operation finished before the other two nodes can be "promote"d.

Comment 4 Andrew Beekhof 2016-11-04 03:06:31 UTC
(In reply to Damien Ciabrini from comment #2)

> Andrew, can you confirm that the behaviour of pacemaker is expected w.r.t
> stopping the resource, i.e. demote only rather than stop?

The observed behaviour is expected.
If the cluster had not reformed, then each node would have continued shutting down galera. However we see:

Nov 03 15:37:07 [15638] controller-0.localdomain       crmd:  warning: crmd_ha_msg_filter:	Another DC detected: controller-1 (op=noop)

which aborts any actions we were about to perform.

So while the cluster will try to shut down galera when quorum is lost, you cannot rely on it completing before quorum is reattained and promotion attempted. 

> Also, if it's expected that after an entire corosync cluster loss the galera
> resource agent should recover gracefully, it there a means to tell pacemaker
> that it would have to rerun a start or monitor action on every node before
> trying to promote all the nodes?


What I suspect you need to do, is have the agent perform a stop as part of the demote action and return 7 (aka. OCF_NOT_RUNNING).

Is there any case when a demote couldn't safely do a stop?

That should work today, but I'm also planning some extra changes in pacemaker to have it behave more optimally.

Comment 7 Michele Baldessari 2016-11-04 17:39:44 UTC
The problem is that the corosync nodes cannot see each other for a certain amount
of time. Something is disrupting the network communication and we need to find out
what it is.

So the theory Damien and I so far have, is that during convergence step there is
something in puppet that breaks the communication between the corosync nodes
and that is why each node cannot see any of the other nodes and things break down.

The first suspect would seem the tripleo firewall class (at least in the 
environment Sofer gave to us it was on). Could we do a quick test without the
firewall on? (i.e. by settting tripleo::firewall::manage_firewall to false. We need to set ManageFirewall to false in the parameter_defaults)?

If we can never reproduce this with firewall off, then we have a good focus as to what could be going on. Other hypothesis are:
- Something odd with OVS during the puppet run
- os-net-config somehow disrupting things
- host is overloaded

(Note that we did not see any evidence of the three hypothesis above in the logs
of the system that sofer gave to us)

Comment 9 Michele Baldessari 2016-11-07 17:01:20 UTC
Thanks Yurii.

Damien and I looked at the sosreports from comment 8 and here is what we observed:

On controller-0 we see the following:
Nov 07 11:15:48 controller-0.localdomain haproxy[22506]: Server glance_api/controller-1 is DOWN, re
Nov 07 11:15:51 controller-0.localdomain systemd[1]: Starting IPv4 firewall with iptables...
Nov 07 11:15:52 controller-0.localdomain systemd[1]: Started IPv4 firewall with iptables.

So from the first line we can deduce that for controller-0, the glance_api service
on controller-1 is not reachable. Right afterwards the firewall gets restarted on controller-0.

The reason for glance_api being down on controller-1 is very likely the restart of iptables:
Nov 07 11:15:39 controller-1.localdomain systemd[1]: Starting IPv4 firewall with iptables...
Nov 07 11:15:39 controller-1.localdomain systemd[1]: Started IPv4 firewall with iptables.

So to recap. The biggest suspect at the moment is the reload of firewall rules that triggers a network disconnect between all controllers and brings the cluster in a state where it cannot really do much. 

Yurii, could we run one test without firewall to dispel/confirm this hypothesis please?

Comment 11 Michele Baldessari 2016-11-07 17:33:55 UTC
So ideally we'd test by adding a custom env file to all deploy commands:
cat > disable-firewall.yaml
parameter_defaults:
  ManageFirewall: false

Now there are two outcomes to this:
1) The problem does not appear again
In this case we need someone to help with the puppet tripleo side of things. Specifically the firewall module

2) The problem does appear again
In this case we're back to the drawing board, but still it means that puppet is messing with the networking during the converge step. In this case we need to loop in some folks from networking

Comment 13 Omri Hochman 2016-11-07 20:26:41 UTC
adding to automation: 

###################
##  WORKAROUND BZ BANDINI ISSUE with GALERA DURING CONVERGENCE STEP####
    cat >  <<EOF
parameter_defaults:
  ManageFirewall: false
EOF 


+ calling all DEPLOY_COMMAND  with -e /home/stack/disable-firewall.yaml

Comment 14 Fabio Massimo Di Nitto 2016-11-07 20:49:38 UTC
(In reply to Omri Hochman from comment #13)
> adding to automation: 
> 
> ###################
> ##  WORKAROUND BZ BANDINI ISSUE with GALERA DURING CONVERGENCE STEP####
>     cat >  <<EOF
> parameter_defaults:
>   ManageFirewall: false
> EOF 
> 
> 
> + calling all DEPLOY_COMMAND  with -e /home/stack/disable-firewall.yaml

Does this mean the workaround is working or are you testing the workaround?

Comment 15 Michele Baldessari 2016-11-08 07:29:01 UTC
While we're still verifying why the network disappears on convergence, we can safely assume that this has nothing to do with resource-agents. Reassigning

Comment 16 Michele Baldessari 2016-11-08 14:15:04 UTC
Ok, so Damien and I think we got to the bottom of this one.

When deploying mitaka/osp9 without a firewall an overcloud (controller for example) will be
in the following iptables state (everything is ACCEPTed):
[root@overcloud-controller-0 ~]# iptables -nvL
Chain INPUT (policy ACCEPT 818K packets, 147M bytes)
 pkts bytes target prot opt in out source destination
 683K 93M nova-api-INPUT all -- * * 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in out source destination
    0 0 nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/0
    0 0 nova-api-FORWARD all -- * * 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT 812K packets, 119M bytes)
 pkts bytes target prot opt in out source destination
 675K 95M nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/0
 675K 95M nova-api-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/0

Chain nova-api-FORWARD (1 references)
 pkts bytes target prot opt in out source destination

Chain nova-api-INPUT (1 references)
 pkts bytes target prot opt in out source destination
    0 0 ACCEPT tcp -- * * 0.0.0.0/0 10.0.0.6 tcp dpt:8775

Chain nova-api-OUTPUT (1 references)
 pkts bytes target prot opt in out source destination

Chain nova-api-local (1 references)
 pkts bytes target prot opt in out source destination

Chain nova-filter-top (2 references)
 pkts bytes target prot opt in out source destination
 675K 95M nova-api-local all -- * * 0.0.0.0/0 0.0.0.0/0

But it seems that at least on rhel we have the following in the iptables rules file:
[root@overcloud-controller-0 ~]# more /etc/sysconfig/iptables
# sample configuration for iptables service
# you can edit this manually or use system-config-firewall
# please do not ask us to add additional ports/services to this default configuration
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT

Now what happens is the following. When we run the convergence step during the M->N Upgrade (i.e. we run the newton puppet manifests), we actually have the firewall enabled per default.So what happens is that basically puppet starts the iptables service before applying the rules. At this point the only permitted traffic is ssh and icmp, which breaks the cluster because each
nodes is fully isolated. Only after all the rules are added is the traffic permitted again. In our environments this took over a minute, which can break certain resources.

Comment 17 Yurii Prokulevych 2016-11-08 14:38:13 UTC
(In reply to Michele Baldessari from comment #11)
> So ideally we'd test by adding a custom env file to all deploy commands:
> cat > disable-firewall.yaml
> parameter_defaults:
>   ManageFirewall: false
> 
> Now there are two outcomes to this:
> 1) The problem does not appear again
> In this case we need someone to help with the puppet tripleo side of things.
> Specifically the firewall module
> 
> 2) The problem does appear again
> In this case we're back to the drawing board, but still it means that puppet
> is messing with the networking during the converge step. In this case we
> need to loop in some folks from networking

Running convergence with 'ManageFirewall: false' helps to eliminate the issue and and the step succeeds itself.

Comment 19 Jaromir Coufal 2016-11-08 17:56:04 UTC
Moving back to Lifecycle team since it has nothing to do with Galera or cluster in general.

Comment 20 Michele Baldessari 2016-11-08 18:20:07 UTC
Patch merged in stable/newton, moving to POST

Comment 24 Omri Hochman 2016-11-15 20:05:41 UTC
unable to reproduce with : openstack-tripleo-heat-templates-5.0.0-1.7.el7ost.noarch

Comment 26 errata-xmlrpc 2016-12-14 16:29:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.