Bug 1396904

Summary: rhel-osp-director: minor update: os-net-config package update restarts the network when corosync is already up and running
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Brent Eagles <beagles>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: dbecker, fdinitto, jcoufal, jschluet, jslagle, mandreou, mbayer, mburns, mcornea, michele, morazi, rhel-osp-director-maint, rscarazz, sasha
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.1.0-6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:34:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tail -10 overcloud nodes /var/log/yum.log
none
list_nodes_status output is clean
none
grep for error from description
none
some of the controller journal from sos reports in comment #5
none
some of the controller journal from sos reports in comment #5 none

Description Alexander Chuzhoy 2016-11-21 06:24:58 UTC
rhel-osp-director:  OSP10 minor update fails: ERROR galera sync timed out

Environment:
instack-undercloud-5.1.0-2.el7ost.noarch
openstack-puppet-modules-9.3.0-1.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch


Steps to reproduce:
1. Deploy overcloud with composable roles + ipv6:
openstack overcloud deploy --templates $THT \
-r ~/openstack_deployment/roles/roles_data.yaml \
-e $THT/environments/network-isolation-v6.yaml \
-e $THT/environments/network-management.yaml \
-e $THT/environments/storage-environment.yaml \
-e $THT/environments/tls-endpoints-public-ip.yaml \
-e ~/openstack_deployment/environments/nodes.yaml \
-e ~/openstack_deployment/environments/network-environment.yaml \
-e ~/openstack_deployment/environments/disk-layout.yaml \
-e ~/openstack_deployment/environments/public_vip.yaml \
-e ~/openstack_deployment/environments/enable-tls.yaml \
-e ~/openstack_deployment/environments/inject-trust-anchor.yaml \
-e ~/openstack_deployment/environments/neutron-settings.yaml \
--log-file overcloud_deployment.log &> overcloud_install.log



2. Try to minor update.

Result:
The minor update fails.

Heat shows:
+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                                                            | resource_type                                                                                                         | resource_status | updated_time         | stack_name                                                                                                                                                               |
+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Controller                                    | ca2744b3-4929-4fc7-bfcb-c9e2c9d19429                                            | OS::Heat::ResourceGroup                                                                                               | UPDATE_FAILED   | 2016-11-21T02:11:56Z | overcloud                                                                                                                                                                |
| 2                                             | 1d59538b-0d0e-494b-a190-ce5c1d66e963                                            | OS::TripleO::Controller                                                                                               | UPDATE_FAILED   | 2016-11-21T02:12:04Z | overcloud-Controller-zjofqzf7ebp7                                                                                                                                        |
| UpdateDeployment                              | cccd70cd-cb16-4006-80e3-2539ca8a20ae                                            | OS::Heat::SoftwareDeployment                                                                                          | UPDATE_FAILED   | 2016-11-21T02:40:16Z | overcloud-Controller-zjofqzf7ebp7-2-6p2jzmjmjc55                                                                                                                         |
+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.
ERROR galera sync timed out
", "deploy_stderr": "", "update_managed_packages": "false", "deploy_status_code": 1 }, "creation_time": "2016-11-21T01:16:40Z", "updated_time": "2016-11-21T03:12:32Z", "input_values": { "update_identifier": "1479694135" }, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "cccd70cd-cb16-4006-80e3-2539ca8a20ae" }


Logs on controller show:
INFO: Galera instance wsrep_cluster_status=non-Primary
Nov 21 02:26:57 overcloud-controller-0.localdomain galera(galera)[272198]: 
ERROR: local node <overcloud-controller-0> is started, but not in primary mode. Unknown state


pcs status output:
[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Mon Nov 21 06:24:20 2016          Last change: Mon Nov 21 01:39:41 2016 by root via cibadmin on overcloud-controller-2

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-2001.db8.ca2.4..10  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-fd00.fd00.fd00.3000..17     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-fd00.fd00.fd00.2000..1a     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-2 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-1 ]
 ip-fd00.fd00.fd00.2000..10     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 ip-192.168.0.15        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-fd00.fd00.fd00.4000..19     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Expected result:
Successful update.

Comment 1 Marios Andreou 2016-11-21 07:56:04 UTC
Created attachment 1222314 [details]
tail -10 overcloud nodes /var/log/yum.log

Comment 2 Marios Andreou 2016-11-21 08:02:26 UTC
Created attachment 1222317 [details]
list_nodes_status output is clean

Comment 3 Marios Andreou 2016-11-21 08:11:26 UTC
Created attachment 1222318 [details]
grep for error from description

Comment 4 Marios Andreou 2016-11-21 08:15:48 UTC
(In reply to Alexander Chuzhoy from comment #0)
> rhel-osp-director:  OSP10 minor update fails: ERROR galera sync timed out
> 
> 2. Try to minor update.


To be clear, this is a fresh OSP10 deployment, with IPv6, 3 controllers HA and also 3 service nodes (composable roles). Some clarifications please sasha, something doesn't add up here:

> 
> Result:
> The minor update fails.
> 
> Heat shows:
> Galera cluster node is not synced.
> ERROR galera sync timed out
> ", "deploy_stderr": "", "update_managed_packages": "false",
> "deploy_status_code": 1 }, "creation_time": "2016-11-21T01:16:40Z",
> "updated_time": "2016-11-21T03:12:32Z", "input_values": {
> "update_identifier": "1479694135" }, "action": "UPDATE", "status_reason":
> "deploy_status_code : Deployment exited with non-zero status code: 1", "id":
> "cccd70cd-cb16-4006-80e3-2539ca8a20ae" }
> 
> 
> Logs on controller show:
> INFO: Galera instance wsrep_cluster_status=non-Primary
> Nov 21 02:26:57 overcloud-controller-0.localdomain galera(galera)[272198]: 
> ERROR: local node <overcloud-controller-0> is started, but not in primary
> mode. Unknown state
> 


Which controller did you see this error on? ^^^ . I see this error exactly once, only on controller 0 https://bugzilla.redhat.com/attachment.cgi?id=1222318, but it has a different timestamp to what is reported above)

       " 19420:Nov 20 21:26:57 host-192-168-0-18 galera(galera)[272133]: ERROR: local node <overcloud-controller-0> is started, but not in primary mode. Unknown state." 

Poking at the undercloud in the env you sent via email, I see 

        -rw-r--r--. 1 stack stack       389 Nov 21 02:04 overcloud_install.log
and
        -rwxr-xr-x. 1 stack stack        55 Nov 21 02:07 update_command

I wonder if this is a case of you not giving the cluster enough time to settle/services started after the deployment (3 mins after deploy you started the update ^^ looks like?). But I am still confused by that error then about galera, did it happen during the initial deploy?

Can you give any more details about where the update failed? I mean which nodes if any were already successfully updated? I *think* controllers were updated (based on tail -10 /var/log/yum.log https://bugzilla.redhat.com/attachment.cgi?id=1222314), but none of the other nodes. 

The cluster (pcs status) looks green to me (no errors/failed services) and the output of list_nodes_status shows nothing (i.e. no heat resources deployed with status other than 0  https://bugzilla.redhat.com/attachment.cgi?id=1222317).

SO, atm there is nothing to debug here, sasha please clarify. Does this reproduce?

thanks, marios

Comment 6 Alexander Chuzhoy 2016-11-21 17:42:16 UTC
Reproduced.

Comment 7 Alexander Chuzhoy 2016-11-21 17:53:55 UTC
The job waited for 10 mins before starting the update.

Seems like not all nodes have their RPMs updated:
[stack@undercloud-0 ~]$ for i in `nova list|awk '/ACTIVE/ {print $(NF-1)}'|awk -F"=" '{print $NF}'`; do ssh heat-admin@$i "hostname; sudo tail /var/log/yum.log"; done

overcloud-cephstorage-0.localdomain         
Nov 21 16:31:15 Installed: rhos-release-1.1.8-1.noarch  
Nov 21 16:31:29 Installed: wget-1.14-13.el7.x86_64 


overcloud-compute-0.localdomain 
Nov 21 16:31:32 Installed: rhos-release-1.1.8-1.noarch
Nov 21 16:31:46 Installed: wget-1.14-13.el7.x86_64   
Nov 21 16:46:16 Updated: puppet-neutron-9.4.0-4.el7ost.noarch            
Nov 21 16:46:16 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch              
                      
overcloud-controller-0.localdomain 
Nov 21 16:55:28 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch                     
Nov 21 16:55:28 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch               
Nov 21 16:55:28 Updated: nss-tools-3.21.3-2.el7_3.x86_64                              
Nov 21 16:55:28 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch                   
Nov 21 16:55:28 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch                
Nov 21 16:55:28 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch                             
Nov 21 16:55:28 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
Nov 21 16:55:28 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch
Nov 21 16:55:28 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 16:55:28 Updated: os-net-config-5.0.0-5.el7ost.noarch


overcloud-controller-1.localdomain
Nov 21 17:02:22 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch
Nov 21 17:02:22 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch
Nov 21 17:02:22 Updated: nss-tools-3.21.3-2.el7_3.x86_64
Nov 21 17:02:22 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch
Nov 21 17:02:22 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch
Nov 21 17:02:22 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch
Nov 21 17:02:22 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
Nov 21 17:02:22 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch
Nov 21 17:02:23 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 17:02:23 Updated: os-net-config-5.0.0-5.el7ost.noarch
overcloud-controller-2.localdomain
Nov 21 16:59:26 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch
Nov 21 16:59:26 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch
Nov 21 16:59:27 Updated: nss-tools-3.21.3-2.el7_3.x86_64
Nov 21 16:59:27 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch
Nov 21 16:59:27 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch
Nov 21 16:59:27 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch
Nov 21 16:59:27 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
Nov 21 16:59:27 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch
Nov 21 16:59:27 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 16:59:27 Updated: os-net-config-5.0.0-5.el7ost.noarch


overcloud-objectstorage-0.localdomain
Nov 21 16:32:48 Installed: rhos-release-1.1.8-1.noarch
Nov 21 16:33:03 Installed: wget-1.14-13.el7.x86_64
Nov 21 16:43:26 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 16:43:26 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch


overcloud-serviceapi-0.localdomain
Nov 21 16:33:06 Installed: rhos-release-1.1.8-1.noarch
Nov 21 16:33:23 Installed: wget-1.14-13.el7.x86_64
Nov 21 16:49:29 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 16:49:29 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch


overcloud-serviceapi-1.localdomain
Nov 21 16:33:26 Installed: rhos-release-1.1.8-1.noarch
Nov 21 16:33:43 Installed: wget-1.14-13.el7.x86_64
Nov 21 16:52:19 Updated: puppet-neutron-9.4.0-4.el7ost.noarch
Nov 21 16:52:19 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch

overcloud-serviceapi-2.localdomain
Nov 21 16:33:47 Installed: rhos-release-1.1.8-1.noarch
Nov 21 16:34:01 Installed: wget-1.14-13.el7.x86_64

Comment 9 Marios Andreou 2016-11-22 12:11:17 UTC
Created attachment 1222689 [details]
some of the controller journal from sos reports in comment #5

from the timestamps this looks like it is happening before the minor update is attempted? Are we sure the deployment was in a good state before starting the update?

Comment 10 Marios Andreou 2016-11-22 12:21:55 UTC
Created attachment 1222705 [details]
some of the controller journal from sos reports in comment #5

some of the controller journal from sos reports in comment #5

from the timestamps this looks like it is happening before the minor update is attempted? Are we sure the deployment was in a good state before starting the update?

Comment 11 Marios Andreou 2016-11-22 12:26:01 UTC
This behaviour you are seeing here may be related to BZ 1391671 - but at this point the sos reports don't strongly support this theory (I see a restart of firewall as per bandini's comment https://bugzilla.redhat.com/show_bug.cgi?id=1391671#c9 but it happens far earlier that the update at Nov 20 20:23:15 host-192-168-0-18 systemd: Starting IPv4 firewall with iptables... and only once so not clear yet). I see the following a lot in the controller logs however and it explicitly mentions the firewall:

Nov 21 01:34:49 host-192-168-0-18 corosync[266094]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

In trying to trace when this first started, ^^^ I think it is as attached in comment #10 which seems to be before the minor update?

Sasha can you please verify this recreates on a completely clean/new environment, and be sure to keep the env around so we can debug further. I think we may ultimately escalate to PIDONE.

Comment 14 Michele Baldessari 2016-11-22 18:09:25 UTC
So we looked at this in more detail and here is what happens (taking overcloud-controller-0 as a reference, but the same problem happens on all the controllers):
1) Cluster on the node gets stopped
2) yum update upgrade packages (amongst which os-net-config)
3) Cluster on the node gets started
4) The new os-net-config triggers and ifdown and ifup of the interface

Step 4) is the problematic one. Corosync does not cope with interfaces being deleted (planned for corosync 3) and then readded and we get the logs fully spammed with:
Nov 21 02:28:48  overcloud-controller-0.localdomain corosync[266094]:  [MAIN  ] Totem is  unable to form a cluster because of an operating system or network  fault. The most common cause of thi...

We *must* guarantee that any network restart happens before corosync gets started. Maybe calling os-net-config manually within yum_update.sh ? Or somehow making os-net-config smarter about which interfaces it can restart and which ones it cannot restart.

Marios, how can we make sure that either a) os-net-config run while the cluster is down or b) os-net-config smarter about what it can restart?

Comment 17 James Slagle 2016-11-22 20:01:41 UTC
(In reply to Michele Baldessari from comment #14)
> So we looked at this in more detail and here is what happens (taking
> overcloud-controller-0 as a reference, but the same problem happens on all
> the controllers):
> 1) Cluster on the node gets stopped
> 2) yum update upgrade packages (amongst which os-net-config)
> 3) Cluster on the node gets started
> 4) The new os-net-config triggers and ifdown and ifup of the interface
> 
> Step 4) is the problematic one. Corosync does not cope with interfaces being
> deleted (planned for corosync 3) and then readded and we get the logs fully
> spammed with:
> Nov 21 02:28:48  overcloud-controller-0.localdomain corosync[266094]:  [MAIN
> ] Totem is  unable to form a cluster because of an operating system or
> network  fault. The most common cause of thi...
> 
> We *must* guarantee that any network restart happens before corosync gets
> started. Maybe calling os-net-config manually within yum_update.sh ? Or
> somehow making os-net-config smarter about which interfaces it can restart
> and which ones it cannot restart.
> 
> Marios, how can we make sure that either a) os-net-config run while the
> cluster is down or b) os-net-config smarter about what it can restart?

How did you determine that 4 (the os-net-config restart) is happening?

Would just like to know where to start looking.

Comment 18 James Slagle 2016-11-22 20:03:07 UTC
got the info on irc:

15:01 <   bandini> slagle: https://review.openstack.org/#/c/397405/ see also giulio's comment on patchset 3. So I think that due to the fact that os-net-config gets updated we have this ifdown/ifup sequence
15:01 <   supybot> Title: Gerrit Code Review (at review.openstack.org)
15:01 <   bandini> slagle: makes sense?
15:02 <    slagle> bandini: yea, i think so
15:02 <    slagle> os-net-config detects that there is a change in the interface config, so it has to bounce it

Comment 19 James Slagle 2016-11-22 20:24:08 UTC
beagles said he would look into this one initially at least

Comment 20 Brent Eagles 2016-11-23 03:45:39 UTC
patch currently proposed upstream on master: https://review.openstack.org/#/c/400960/

The snippet works in isolation. I'm trying to get an ha env setup so I can test it in an HA context, but it's taking awhile. If anybody has an existing setup that they can pull the patch into to give it a try, it would be much appreciated.

Comment 24 Marios Andreou 2016-11-24 12:35:52 UTC
waiting for CI to land master and in the meantime I cherrypicked newton so updating trackers 

https://review.openstack.org/#/c/401987/ stable/newton

Comment 25 Brent Eagles 2016-11-24 15:14:56 UTC
Description of the "fast version" of how this patch was tested.

On an HA setup -
 - make copy yum_update.sh
 - remove preamble specific to being driven by heat
 - copy yum_update.sh to controller nodes
 - one one node, leave things interface files alone and run. What we expect to see:
   - pcs commands to stop cluster services
   - yum update of regular packages
   - yum update for openvswitch
   - os-net-config command runs successfully but no interface changes, doesn't bounce the interfaces
   - pcs commands to start cluster services
   - cluster settles
  This worked fine. os-net-config checked all of the interfaces and decided there were no changes necessary.
 - on another node, edit ifcfg-br-ex and change in some way (I removed the standalone thing) and run. We expect to see:
   - pcs commands to stop cluster services
   - yum update of regular packages
   - yum update for openvswitch
   - os-net-config command runs successfully, changing the modified interface back to the configured values and bounces the interfaces
   - pcs commands to start cluster services
   - cluster settles
  Also works as expected

I copied it to a compute node and ran the same test, but we don't expect to see anything relevant there because the yum_update.sh script exits before it gets to the os-net-config. I don't think this is an issue. os-refresh-config will run os-net-config and update the OVS bridges that are configured on that node but this would only impact compute nodes that were not multi-nic or DVR and any impact should be short lived.

Comment 27 Alexander Chuzhoy 2016-11-29 19:12:19 UTC
Verified:
Environment:
openstack-tripleo-heat-templates-5.1.0-6.el7ost.noarch

The reported issue doesn't reproduce.
Updated successfully.

Comment 29 errata-xmlrpc 2016-12-14 16:34:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html