Bug 1364540

Summary:	Upgrade of openvswitch-2.4.0-1.el7 makes ip disappears. (osp10)
Product:	Red Hat OpenStack	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	openstack-tripleo-heat-templates	Assignee:	Marios Andreou <mandreou>
Status:	CLOSED ERRATA	QA Contact:	Omri Hochman <ohochman>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	10.0 (Newton)	CC:	aloughla, apevec, chrisw, jcoufal, jschluet, lbezdick, mandreou, mburns, mlammon, rhel-osp-director-maint, rhos-maint, sathlang, srevivo
Target Milestone:	rc	Keywords:	Reopened, Triaged
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-5.1.0-6.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1388543 1388546 (view as bug list)		Environment:
Last Closed:	2016-12-14 15:49:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1337794, 1388543, 1388546, 1394322

Description Sofer Athlan-Guyot 2016-08-05 16:11:39 UTC

Description of problem:

During upgrade of OSP-9 to OSP-10, the upgrade of the openvswitch
package cleans up all ips on the controller making the upgrade falis.

Version-Release number of selected component (if applicable):

    openvswitch-2.4.0-1.el7.x86_64 to openvswitch-2.5.0-3.el7.x86_64

How reproducible: always

Steps to Reproduce:
1. Well I guess that just ugrading the package should show the problem.

Actual results: clean up of all ips on the crontroller

    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
        link/ether 00:a8:94:21:61:39 brd ff:ff:ff:ff:ff:ff
        inet6 fe80::2a8:94ff:fe21:6139/64 scope link 
           valid_lft forever preferred_lft forever
    3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether aa:b4:d7:24:62:c5 brd ff:ff:ff:ff:ff:ff
    12: vlan10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether d6:4a:74:67:1c:e9 brd ff:ff:ff:ff:ff:ff
    13: vlan20: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether fa:93:78:4b:31:f9 brd ff:ff:ff:ff:ff:ff
    14: vlan40: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether da:b9:ee:92:85:7e brd ff:ff:ff:ff:ff:ff
    15: vlan50: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether 1a:ea:62:67:31:e7 brd ff:ff:ff:ff:ff:ff
    16: vlan30: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether de:4c:10:bd:96:f3 brd ff:ff:ff:ff:ff:ff
    17: br-ex: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether 00:a8:94:21:61:39 brd ff:ff:ff:ff:ff:ff
    18: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether b6:ef:9a:fe:44:42 brd ff:ff:ff:ff:ff:ff
    19: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
        link/ether d6:12:60:1f:a9:45 brd ff:ff:ff:ff:ff:ff


Expected results: successful upgrade of openvswitch

Additional info:

Part of the yum upgrade relevant to the problem:

      Updating   : openvswitch-2.5.0-3.el7.x86_64 149/476
      ...
      Cleanup    : openvswitch-2.4.0-1.el7.x86_64 474/476

        ^^^ at this point the ssh connection died

A simple systemctl restart network from the console makes everything
working again.

Comment 2 Lukas Bezdicka 2016-08-08 14:46:16 UTC

Upstream bug:
https://bugs.launchpad.net/neutron/+bug/1514056

Comment 3 Sofer Athlan-Guyot 2016-08-10 13:48:36 UTC

Still not sure the above patch solve the problem. I upgraded osp-9
using https://review.openstack.org/#/c/348889/. After setup, all my
bridge but br-ex are in secure mode:

ovs-vsctl show | grep -B1 secure
Bridge br-int
fail_mode: secure
--
Bridge br-tun
fail_mode: secure

but the upgrade fails with lost connectivity.

After restoring it, I did:

yum downgrade openvswitch

It installed openvswitch-2.0.0-7.el7.x86_64 and everything was fine

After I did:

yum install ftp://ftp.icm.edu.pl/vol/rzm5/linux-slc/centos/7.1.1503/cloud/x86_64/openstack-kilo/common/openvswitch-2.4.0-1.el7.x86_64.rpm

And everything was ok.

Then I did

yum upgrade

I had to do a systemctl restart network to have this output

Running transaction
Updating : openvswitch-2.5.0-3.el7.x86_64 1/2
Cleanup : openvswitch-2.4.0-1.el7.x86_64 2/2
Verifying : openvswitch-2.5.0-3.el7.x86_64 1/2
Verifying : openvswitch-2.4.0-1.el7.x86_64 2/2

Updated:
openvswitch.x86_64 0:2.5.0-3.el7

without it the connection was lost.

So it's either a new "feature" in openvswitch-2.5 or the spec of the
rpm which is not good.

Trying to set the br-ex to secure (ovs-vsctl set-fail-mode br-ex
secure) is a no go as there is no controller associated with br-ex (If
I understand all correctly). The net result of running the above
command is ... immediate lost of connectivity.

For the time being I'm going to pin openvswitch to 2.4 during the
upgrade.

Comment 4 Sofer Athlan-Guyot 2016-08-12 14:48:29 UTC

I confirm that it nothing to do with the upstream patch mentionned by
Lukas.

Pinning openvswitch "solves" it.

    --- extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh.orig      2016-08-12 06:01:20.900145477 -0400
    +++ extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh   2016-08-12 10:45:55.870145477 -0400
    @@ -145,7 +145,6 @@
     yum -y install python-zaqarclient  # needed for os-collect-config
    +yum -y install yum-plugin-versionlock
    +yum versionlock openvswitch
     yum -y -q update

Comment 5 Sofer Athlan-Guyot 2016-08-26 18:27:00 UTC

Note, this happen on all upgraded note so:

    upgrade-non-controller.sh --upgrade overcloud-objectstorage-0
    ....
    
      Cleanup    : 1:librados2-0.94.5-14.el7cp.x86_64                       475/481 
      Cleanup    : parted-3.1-23.el7.x86_64                                 476/481 
      Cleanup    : gperftools-libs-2.4-7.el7.x86_64                         477/481 
      Cleanup    : python-pandas-0.17.0-1.el7ost.x86_64                     478/481 
      Cleanup    : openvswitch-2.4.0-1.el7.x86_64                           479/481 
    
    Write failed: Broken pipe

Will hang up at the cleanup stage of the openvswitch and then fails.

The same pinning must go into all upgrade scripts in tripleo heat template:

    extraconfig/tasks/major_upgrade_ceph_storage.sh
    
    extraconfig/tasks/major_upgrade_compute.sh
    
    extraconfig/tasks/major_upgrade_object_storage.sh

Comment 6 Panu Matilainen 2016-09-01 07:23:35 UTC

Looks like duplicate of bug 1371840 (or the other way around) to me.

Comment 7 Sofer Athlan-Guyot 2016-09-01 16:43:49 UTC

Hi,

Yep this is.  So this is a WONTFIX as well ?  Is the pinning the acceptable solution for OSP-9 to OSP-10 upgrade ?  When the 2.6 version will be available ?

Comment 8 Panu Matilainen 2016-09-14 12:59:42 UTC

Closing as dupe, lets keep the related discussion in the main bug.

*** This bug has been marked as a duplicate of bug 1371840 ***

Comment 9 Marios Andreou 2016-10-20 14:17:00 UTC

Hey, reopening this one because the duplicate at https://bugzilla.redhat.com/show_bug.cgi?id=1371840 is marked as 'wontfix'... I'll use this bug to track the workaround we will carry in the upgrade/update to deal with the openvswitch update.

Comment 10 Marios Andreou 2016-10-20 14:19:49 UTC

also retargetting to tripleo-heat-templates since we'll be carrying a workaround for the issue there

Comment 11 Marios Andreou 2016-10-21 16:44:08 UTC

changed the upstream review to point to newton @ https://review.openstack.org/#/c/389753/ (master merged a little while ago)

Comment 12 Marios Andreou 2016-10-21 16:49:12 UTC

Please note that the full context around how we came to use this workaround is in BZ 1371840 and also BZ 1385096

Comment 13 Marios Andreou 2016-10-25 15:19:49 UTC

We still need to backport this to mitaka and liberty ...

Comment 15 Marios Andreou 2016-10-31 13:49:08 UTC

Adding a note for reference... there is a related BZ at https://bugzilla.redhat.com/show_bug.cgi?id=1388675 with its own upstream bug and review (linked there) which is a follow on from the fix landed here (the fix there adds the --replacepkgs in case latest ovs was already installed and fixes a syntax nit with the ceph upgrade script)

Comment 16 Marios Andreou 2016-10-31 13:57:21 UTC

also adding a link to https://review.openstack.org/#/c/390792/ since you need both reviews for the 'complete' ovs upgrade workaround.

Comment 17 Omri Hochman 2016-11-07 21:55:24 UTC

marios - is that another duplicate of one of these : 

https://bugzilla.redhat.com/show_bug.cgi?id=1386299 - https://bugzilla.redhat.com/show_bug.cgi?id=1385096

Comment 18 Marios Andreou 2016-11-08 07:52:58 UTC

(In reply to Omri Hochman from comment #17)
> marios - is that another duplicate of one of these : 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1386299 -

^^^ No I don't think so, though I did think they may be related at one point (we have discussed this on lifecycle scrum). in BZ 1386299 the problem appears when you reboot. Here the problem was just upgrading openvswitch from 2.4 to 2.5 (no reboot needed, IPs disappeared just by doing the yum update).

> https://bugzilla.redhat.com/show_bug.cgi?id=1385096

^^^ No though they definitely *are* related. BZ 1385096 is tracking the problem we are 'fixing' here against openvswitch. In this bug we worked around the problem with the review linked in external trackers.

thanks

Comment 20 mlammon 2016-11-15 19:06:08 UTC

Deployed RHOS 9 latest
Upgraded to RHOS 10 with latest puddle (2016-11-14.1)

I no longer see this issue.

openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch
openstack-neutron-openvswitch-9.1.0-4.el7ost.noarch

Comment 22 Marios Andreou 2016-11-23 18:20:39 UTC

adding the one more fix needed here @ https://review.openstack.org/#/c/401195/ waiting on CI to merge to stable/newton then we can move this back to POST

Comment 23 Marios Andreou 2016-11-24 10:16:09 UTC

https://review.openstack.org/#/c/401195/ landed newton moving POST

Comment 25 Omri Hochman 2016-11-29 21:07:31 UTC

Verified with openstack-tripleo-heat-templates-5.1.0-6.el7ost.noarch


After upgrade to osp10,  I've rebooted all the OC nodes and check that all overcloud nodes are still reachable after return from reboot.

Comment 26 Marios Andreou 2016-11-30 08:35:58 UTC

(In reply to Omri Hochman from comment #25)
> Verified with openstack-tripleo-heat-templates-5.1.0-6.el7ost.noarch
> 
> 
> After upgrade to osp10,  I've rebooted all the OC nodes and check that all
> overcloud nodes are still reachable after return from reboot.

just to be clear, this BZ is about the openvswitch 2.4-2.5 upgrade which causes nodes to lose IPs (the reboot one was also openvswitch but a different issue). However the fact that you successfully upgraded w/out problem (i.e. nodes don't lose IPs during the yum update on a given node) is enough to verify here.

Comment 28 errata-xmlrpc 2016-12-14 15:49:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html