Bug 1391671
Summary: | RHOS Upgrade failed: Attempted to promote Master instance of galera before bootstrap node has been detected. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yurii Prokulevych <yprokule> |
Component: | openstack-tripleo-heat-templates | Assignee: | Michele Baldessari <michele> |
Status: | CLOSED ERRATA | QA Contact: | nlevinki <nlevinki> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 10.0 (Newton) | CC: | abeekhof, agk, cluster-maint, dciabrin, fdinitto, jcoufal, jjoyce, jschluet, mandreou, mbayer, mburns, michele, ohochman, rhel-osp-director-maint, royoung, slinaber, tvignaud, yprokule |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-5.0.0-1.5.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 16:29:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yurii Prokulevych
2016-11-03 17:45:20 UTC
There were several connection loss between corosync nodes which led them to lose quorum and conservatively stop resource running locally: at 15:37:00 all galera resource end up being stopped by pacemaker: Nov 03 15:37:05 controller-0.localdomain pengine[15637]: warning: Fencing and resource management disabled due to lack of quorum Nov 03 15:37:05 controller-0.localdomain pengine[15637]: notice: Stop galera:0 (Master controller-0) Nov 03 15:37:08 controller-0.localdomain crmd[15638]: notice: Result of demote operation for galera on controller-0: 0 (ok) | call=111 key=galera_demote_0 confirmed=true cib-upd Nov 03 15:36:58 controller-1.localdomain pengine[25230]: warning: Fencing and resource management disabled due to lack of quorum Nov 03 15:36:58 controller-1.localdomain pengine[25230]: notice: Stop galera:0 (Master controller-1) Nov 03 15:37:02 controller-1.localdomain crmd[25231]: notice: Result of demote operation for galera on controller-1: 0 (ok) | call=89 key=galera_demote_0 confirmed=true cib-upda Nov 03 15:37:03 controller-2.localdomain pengine[9232]: warning: Fencing and resource management disabled due to lack of quorum Nov 03 15:37:03 controller-2.localdomain pengine[9232]: notice: Stop galera:0 (Master controller-2) Nov 03 15:37:08 controller-2.localdomain crmd[9233]: notice: Result of demote operation for galera on controller-2: 0 (ok) | call=86 key=galera_demote_0 confirmed=true cib-updat aound 15:37, all nodes have been stopped by pacemaker intentionally, as seen in the mysql logs 161103 15:37:08 [Note] /usr/libexec/mysqld: Shutdown complete 161103 15:37:01 [Note] /usr/libexec/mysqld: Shutdown complete 161103 15:37:07 [Note] /usr/libexec/mysqld: Shutdown complete then, corosync nodes seem to restore contact to each other, controller-0 becomes the DC, and decides to re-promote all galera resources. For a reason I'm not completely sure, pacemaker didn't _start_ the resources, but just _promote_ them: Nov 03 15:37:13 controller-0.localdomain pengine[15637]: notice: Promote galera:0 (Slave -> Master controller-2) Nov 03 15:37:13 controller-0.localdomain pengine[15637]: notice: Promote galera:1 (Slave -> Master controller-1) Nov 03 15:37:13 controller-0.localdomain pengine[15637]: notice: Promote galera:2 (Slave -> Master controller-0) However, since no galera were running at the time (no resource in Master state), the resource agent has to bail out because it has to reboostrap the cluster first before being able to start a galera server as a "joining node": Nov 03 15:37:53 controller-0.localdomain galera(galera)[12116]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected. Nov 03 15:37:25 controller-1.localdomain galera(galera)[1154]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected. Nov 03 15:37:13 controller-2.localdomain galera(galera)[11809]: ERROR: Failure, Attempted to promote Master instance of galera before bootstrap node has been detected. Yurii, given the profile of the overcloud machine (4CPU, 20GB), the loss of quorum may be due to corosync nodes not being able to communicate with each other fast enough during the upgrade process which is resource-hungry. I'm not sure if that's a supported configuration. Andrew, can you confirm that the behaviour of pacemaker is expected w.r.t stopping the resource, i.e. demote only rather than stop? Also, if it's expected that after an entire corosync cluster loss the galera resource agent should recover gracefully, it there a means to tell pacemaker that it would have to rerun a start or monitor action on every node before trying to promote all the nodes? Small addition to comment #2, the 3 nodes cannot been "promote"d concurrently with the galera resource agent, because the agent's logics expects that the bootstrap node is known and its "promote" operation finished before the other two nodes can be "promote"d. (In reply to Damien Ciabrini from comment #2) > Andrew, can you confirm that the behaviour of pacemaker is expected w.r.t > stopping the resource, i.e. demote only rather than stop? The observed behaviour is expected. If the cluster had not reformed, then each node would have continued shutting down galera. However we see: Nov 03 15:37:07 [15638] controller-0.localdomain crmd: warning: crmd_ha_msg_filter: Another DC detected: controller-1 (op=noop) which aborts any actions we were about to perform. So while the cluster will try to shut down galera when quorum is lost, you cannot rely on it completing before quorum is reattained and promotion attempted. > Also, if it's expected that after an entire corosync cluster loss the galera > resource agent should recover gracefully, it there a means to tell pacemaker > that it would have to rerun a start or monitor action on every node before > trying to promote all the nodes? What I suspect you need to do, is have the agent perform a stop as part of the demote action and return 7 (aka. OCF_NOT_RUNNING). Is there any case when a demote couldn't safely do a stop? That should work today, but I'm also planning some extra changes in pacemaker to have it behave more optimally. The problem is that the corosync nodes cannot see each other for a certain amount of time. Something is disrupting the network communication and we need to find out what it is. So the theory Damien and I so far have, is that during convergence step there is something in puppet that breaks the communication between the corosync nodes and that is why each node cannot see any of the other nodes and things break down. The first suspect would seem the tripleo firewall class (at least in the environment Sofer gave to us it was on). Could we do a quick test without the firewall on? (i.e. by settting tripleo::firewall::manage_firewall to false. We need to set ManageFirewall to false in the parameter_defaults)? If we can never reproduce this with firewall off, then we have a good focus as to what could be going on. Other hypothesis are: - Something odd with OVS during the puppet run - os-net-config somehow disrupting things - host is overloaded (Note that we did not see any evidence of the three hypothesis above in the logs of the system that sofer gave to us) Thanks Yurii. Damien and I looked at the sosreports from comment 8 and here is what we observed: On controller-0 we see the following: Nov 07 11:15:48 controller-0.localdomain haproxy[22506]: Server glance_api/controller-1 is DOWN, re Nov 07 11:15:51 controller-0.localdomain systemd[1]: Starting IPv4 firewall with iptables... Nov 07 11:15:52 controller-0.localdomain systemd[1]: Started IPv4 firewall with iptables. So from the first line we can deduce that for controller-0, the glance_api service on controller-1 is not reachable. Right afterwards the firewall gets restarted on controller-0. The reason for glance_api being down on controller-1 is very likely the restart of iptables: Nov 07 11:15:39 controller-1.localdomain systemd[1]: Starting IPv4 firewall with iptables... Nov 07 11:15:39 controller-1.localdomain systemd[1]: Started IPv4 firewall with iptables. So to recap. The biggest suspect at the moment is the reload of firewall rules that triggers a network disconnect between all controllers and brings the cluster in a state where it cannot really do much. Yurii, could we run one test without firewall to dispel/confirm this hypothesis please? So ideally we'd test by adding a custom env file to all deploy commands: cat > disable-firewall.yaml parameter_defaults: ManageFirewall: false Now there are two outcomes to this: 1) The problem does not appear again In this case we need someone to help with the puppet tripleo side of things. Specifically the firewall module 2) The problem does appear again In this case we're back to the drawing board, but still it means that puppet is messing with the networking during the converge step. In this case we need to loop in some folks from networking adding to automation: ################### ## WORKAROUND BZ BANDINI ISSUE with GALERA DURING CONVERGENCE STEP#### cat > <<EOF parameter_defaults: ManageFirewall: false EOF + calling all DEPLOY_COMMAND with -e /home/stack/disable-firewall.yaml (In reply to Omri Hochman from comment #13) > adding to automation: > > ################### > ## WORKAROUND BZ BANDINI ISSUE with GALERA DURING CONVERGENCE STEP#### > cat > <<EOF > parameter_defaults: > ManageFirewall: false > EOF > > > + calling all DEPLOY_COMMAND with -e /home/stack/disable-firewall.yaml Does this mean the workaround is working or are you testing the workaround? While we're still verifying why the network disappears on convergence, we can safely assume that this has nothing to do with resource-agents. Reassigning Ok, so Damien and I think we got to the bottom of this one. When deploying mitaka/osp9 without a firewall an overcloud (controller for example) will be in the following iptables state (everything is ACCEPTed): [root@overcloud-controller-0 ~]# iptables -nvL Chain INPUT (policy ACCEPT 818K packets, 147M bytes) pkts bytes target prot opt in out source destination 683K 93M nova-api-INPUT all -- * * 0.0.0.0/0 0.0.0.0/0 Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 0 0 nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 nova-api-FORWARD all -- * * 0.0.0.0/0 0.0.0.0/0 Chain OUTPUT (policy ACCEPT 812K packets, 119M bytes) pkts bytes target prot opt in out source destination 675K 95M nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/0 675K 95M nova-api-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/0 Chain nova-api-FORWARD (1 references) pkts bytes target prot opt in out source destination Chain nova-api-INPUT (1 references) pkts bytes target prot opt in out source destination 0 0 ACCEPT tcp -- * * 0.0.0.0/0 10.0.0.6 tcp dpt:8775 Chain nova-api-OUTPUT (1 references) pkts bytes target prot opt in out source destination Chain nova-api-local (1 references) pkts bytes target prot opt in out source destination Chain nova-filter-top (2 references) pkts bytes target prot opt in out source destination 675K 95M nova-api-local all -- * * 0.0.0.0/0 0.0.0.0/0 But it seems that at least on rhel we have the following in the iptables rules file: [root@overcloud-controller-0 ~]# more /etc/sysconfig/iptables # sample configuration for iptables service # you can edit this manually or use system-config-firewall # please do not ask us to add additional ports/services to this default configuration *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT Now what happens is the following. When we run the convergence step during the M->N Upgrade (i.e. we run the newton puppet manifests), we actually have the firewall enabled per default.So what happens is that basically puppet starts the iptables service before applying the rules. At this point the only permitted traffic is ssh and icmp, which breaks the cluster because each nodes is fully isolated. Only after all the rules are added is the traffic permitted again. In our environments this took over a minute, which can break certain resources. (In reply to Michele Baldessari from comment #11) > So ideally we'd test by adding a custom env file to all deploy commands: > cat > disable-firewall.yaml > parameter_defaults: > ManageFirewall: false > > Now there are two outcomes to this: > 1) The problem does not appear again > In this case we need someone to help with the puppet tripleo side of things. > Specifically the firewall module > > 2) The problem does appear again > In this case we're back to the drawing board, but still it means that puppet > is messing with the networking during the converge step. In this case we > need to loop in some folks from networking Running convergence with 'ManageFirewall: false' helps to eliminate the issue and and the step succeeds itself. Moving back to Lifecycle team since it has nothing to do with Galera or cluster in general. Patch merged in stable/newton, moving to POST unable to reproduce with : openstack-tripleo-heat-templates-5.0.0-1.7.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |