Bug 1388546

Summary: Upgrade of openvswitch-2.4.0-1.el7 makes ip disappears. (osp8)
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: urgent Docs Contact:
Priority: medium    
Version: 8.0 (Liberty)CC: achernet, alan_bishop, aloughla, apevec, arkady_kanevsky, audra_cooper, cdevine, christopher_dearborn, chrisw, david_paterson, jcoufal, John_walsh, kazen, kbader, kurt_hey, lbezdick, lhh, mandreou, markmc, mburns, randy_perryman, rhel-osp-director-maint, rhos-maint, rsussman, sathlang, srevivo
Target Milestone: asyncKeywords: Reopened, Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.14-23.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1364540
: 1394322 (view as bug list) Environment:
Last Closed: 2017-01-05 14:37:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1364540    
Bug Blocks: 1261979, 1305654, 1337794, 1388543, 1394322, 1406478    

Comment 1 Marios Andreou 2016-10-31 14:40:36 UTC
The full 'fix' here consists of two reviews - they were originally tracked as independent fixes for BZ 1364540 (the original ovs upgrade workaround) and then BZ 1388675 (added --replacepkgs incase ovs was already upgraded and fix ceph upgrade script syntax nit).  

The fixes are https://review.openstack.org/#/c/389753/ (needs to be cherrypicked to mitaka and liberty, already in newton) and https://review.openstack.org/#/c/390792/ (needs cherrypick to newton mitaka and liberty but master still in review. )

Comment 12 Omri Hochman 2016-12-19 18:17:49 UTC
we need to make sure that osp8 that runs on rhel7.2 can be updated to latest osp8 that runs on rhel7.3, without encountering the issue of losing IP on overcloud  during the upgrade of openvswitch .

Comment 13 Alexander Chuzhoy 2016-12-19 22:14:43 UTC
Environment:
openstack-tripleo-heat-templates-0.8.14-23.el7ost.noarch


Deployed 8.0GA with rhel7.2

Updated to latest OSP8 with rhel7.3


The osp version is: openvswitch-2.4.0-2.el7_2.x86_64

The controllers were reachable. 
Rebooted all the nodes in the setup and verified that all nodes are reachable via ctlplane network after reboot.


Marios,
Is this enough to verify this bug or you'd like me to check something else?

Comment 14 Alexander Chuzhoy 2016-12-19 22:16:30 UTC
Slight correction to comment #13;
The openvswitch version is not osp of course, meant to write "ovs".

Note, that after minor update of 8.0 - it's not 2.5

Comment 15 Marios Andreou 2016-12-20 15:11:09 UTC
(In reply to Alexander Chuzhoy from comment #14)
> Slight correction to comment #13;
> The openvswitch version is not osp of course, meant to write "ovs".
> 
> Note, that after minor update of 8.0 - it's not 2.5

right this BZ and the 'fix' special case handling we carry in the review tracker are about upgrading openvswitch-2.4 to 2.5 so if you're not getting to openvswitch 2.5 then its not verifying here imo - maybe we need to ping mburns about downstream build/status of openvswitch 2.5 on 8

Comment 16 Alexander Chuzhoy 2016-12-20 21:33:09 UTC
FailedQA:
Environment:
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
instack-undercloud-2.2.7-8.el7ost.noarch


Deployed OSP8.0GA:
instack-undercloud-2.2.7-4.el7ost.noarch
openvswitch-2.4.0-2.el7_2.x86_64
Checked the status of services:
● ovirt-guest-agent.service    loaded failed failed    oVirt Guest Agent

Checked the IP with:
ip a s dev br-ctlplane:

5: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:18:86:d3:7b:59 brd ff:ff:ff:ff:ff:ff
    inet 192.0.2.1/24 brd 192.0.2.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::218:86ff:fed3:7b59/64 scope link
       valid_lft forever preferred_lft forever


Ran yum update on the undercloud:
relevant rpms version:
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
instack-undercloud-2.2.7-8.el7ost.noarch

List of failed services:
● httpd.service    loaded failed     failed          The Apache HTTP Server
● openstack-ceilometer-api.service   loaded failed failed  OpenStack ceilometer API service
● openstack-heat-api-cfn.service   loaded failed     failed     Openstack Heat CFN-compatible API Service
● openstack-heat-api-cloudwatch.service  loaded failed  failed   OpenStack Heat CloudWatch API Service
● openstack-heat-api.service  loaded failed  failed  OpenStack Heat API Service
● openstack-ironic-api.service loaded failed failed OpenStack Ironic API service
● ovirt-guest-agent.service  loaded failed   failed          oVirt Guest Agent
● rabbitmq-server.service  loaded failed     failed          RabbitMQ broker

The IP is gone from br-ctlplane interface. Output from ip a s dev br-ctlplane:
12: br-ctlplane: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 00:18:86:d3:7b:59 brd ff:ff:ff:ff:ff:ff

Comment 17 Alexander Chuzhoy 2016-12-20 21:41:49 UTC
Note: If I reboot the undercloud, the IP "returns" after reboot and the list of failed services is reduced to
● ovirt-guest-agent.service   loaded failed failed    oVirt Guest Agent

Which was exactly the case before the update.



[stack@instack ~]$ ip a s dev br-ctlplane
8: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 00:18:86:d3:7b:59 brd ff:ff:ff:ff:ff:ff
    inet 192.0.2.1/24 brd 192.0.2.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::218:86ff:fed3:7b59/64 scope link
       valid_lft forever preferred_lft forever

Comment 18 Alexander Chuzhoy 2016-12-21 04:02:23 UTC
The subsequent overcloud update failed after a long time.

Checking where the update took place:

[stack@instack ~]$ for i in 192.0.2.{7..12}; do echo $i; ssh heat-admin@$i "hostname; sudo grep -i update /var/log/yum.log"; done
192.0.2.7
overcloud-controller-2.localdomain
192.0.2.8
overcloud-compute-1.localdomain
192.0.2.9
overcloud-compute-0.localdomain
192.0.2.10
overcloud-controller-0.localdomain
192.0.2.11
overcloud-cephstorage-0.localdomain
Dec 20 21:49:19 Updated: 1:openstack-puppet-modules-7.1.5-1.el7ost.noarch
192.0.2.12
overcloud-controller-1.localdomain

Comment 23 Marios Andreou 2016-12-21 16:54:36 UTC
(In reply to Alexander Chuzhoy from comment #16)
> FailedQA:

> Ran yum update on the undercloud:
> relevant rpms version:
> openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
> instack-undercloud-2.2.7-8.el7ost.noarch
> 


@Sasha as mentioned on irc, losing IP after upgrading openvswitch on the *undercloud* is a different (obviously related) issue and for 9->10 we added the explicit sudo systemctl stop openvswitch before the undercloud upgrade.

Can you please check if it works as a workaround here too. If not then we probably need to reach out to the ovs guys for more debugging here (i.e. file a distinct bug for the undercloud 2.4-2.5 upgrade undercloud OSP8)

WRT the overcloud update as per comment #18 I'd rather first make sure we have a good setup (undercloud upgrade completed fine with the workaround) and then see if the overcloud update fails.

Comment 24 Alexander Chuzhoy 2016-12-22 21:02:18 UTC
Verified:
Environment:
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64



Following comment #23 I reran the update on OSP8 with the following procedure:

1) on the undercloud node:
Stop all services starting with:
openstack-*
neutron-*
openvswitch.service

2) Make sure the updates are available (take care of missing repos if needed)

3) openstack undercloud upgrade


Then I ran the overcloud normally and it completed successfully.
I was able to ping all OC nodes.

Comment 25 Randy Perryman 2017-01-03 15:33:13 UTC
Steps I am taking:
1. pcs cluster stop
2. systemctl stop openvswitch.service
3. yum update openvswtich*
4. systemctl start openvswitch.service
5. ip a - validate all IP's
6. pcs cluster start
7. pcs status until all nodes are back in cluster repeat on next controller

--- 
computes
1. systemctl stop neutron/openstac/openvswitch
2. yum update openvswitch*
3. systemctl start openvswitch
4. ip a validate IP's
5. systemctl start neutron/openstack
6.

Comment 26 Randy Perryman 2017-01-03 19:37:50 UTC
*** Bug 1406478 has been marked as a duplicate of this bug. ***

Comment 27 Alexander Chuzhoy 2017-01-04 14:40:06 UTC
Note: On a setup with successful minor update,I get the following openstack-tripleo-heat-templates version after updating the undercloud:
openstack-tripleo-heat-templates-0.8.14-24.el7ost.noarch

Comment 28 Sofer Athlan-Guyot 2017-01-04 16:08:05 UTC
Hi,

Continuing the discussion from https://bugzilla.redhat.com/show_bug.cgi?id=1406478.


> I see that nopostrun is not part of Liberty tag, but the Mitaka
> branch has it.

I made a typo in the original comment.  The command to run is

   grep -r postun ~/pilot/templates

But according to your previous comment, you don't have the latest version of the
tht package.  The code was backported on downstream only as liberty
was EOL at that time and the code couldn't be pushed upstream.

The rpm that hold the necessary code is
openstack-tripleo-heat-templates-0.8.14-24.el7ost.noarch.rpm

Could you try again after having upgraded the openstack-tripleo-heat-templates package ?

Comment 30 errata-xmlrpc 2017-01-05 14:37:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0026.html