Bug 1371840 - [OVS] - Updating the openvswitch package cause the host to lose ip
Summary: [OVS] - Updating the openvswitch package cause the host to lose ip
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch
Version: 7.2
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: pre-dev-freeze
: ---
Assignee: Thadeu Lima de Souza Cascardo
QA Contact: Network QE
URL:
Whiteboard:
Depends On:
Blocks: OpenVswitch_Support 1337794
TreeView+ depends on / blocked
 
Reported: 2016-08-31 08:50 UTC by Michael Burman
Modified: 2017-11-07 01:00 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-18 16:57:15 UTC


Attachments (Terms of Use)
Logs (856.05 KB, application/x-gzip)
2016-08-31 08:50 UTC, Michael Burman
no flags Details

Description Michael Burman 2016-08-31 08:50:40 UTC
Created attachment 1196266 [details]
Logs

Description of problem:
[OVS] - Updating the openvswitch package cause the host loose ip.

When trying to run yum update openvswitch, the ovirtmgmt going down and host looses it's ip. 

Restart network service didn't help.
If trying to reboot the server we got stuck in 'Stop storage shared leased manager' 

ovs bridge still exist - 

ovs-vsctl show
f890bae1-2a42-48d0-bd8c-91fa37e0fbc1
    Bridge "vdsmbr_vs2eYDRw"
        Port "vdsmbr_vs2eYDRw"
            Interface "vdsmbr_vs2eYDRw"
                type: internal
        Port "enp4s0"
            Interface "enp4s0"
        Port ovirtmgmt
            Interface ovirtmgmt
                type: internal
    ovs_version: "2.5.0"


Version-Release number of selected component (if applicable):
4.1.0-0.0.master.20160828231419.gitdbe44d9.el7.centos
vdsm-4.18.999-464.git15fac93.el7.centos.x86_64
openvswitch-2.4.0-2.el7_2.x86_64 >> openvswitch-2.5.0-2.el7.x86_64

How reproducible:
100

Steps to Reproduce:
1. Install 4.1 host to rhv-m 4.1 in OVS type cluster
2. Update the openvswitch package on the host(if there is no available update, you can downgrade and then update)

Actual results:
Host lost ip and reboot is stuck.

Expected results:
Should work

Comment 1 Michael Burman 2016-08-31 10:20:54 UTC
So it is a bug in openvswitch when upgrading from minor version to other.
For example from openvswitch-2.4.0-2.el7_2.x86_64 >> openvswitch-2.5.0-2.el7.x86_64.

Bridge that was created via ovs-vsctl going down and loosing it's ip when running update.

For example:

1) Install clean rhel 7.2 host - 3.10.0-327.28.3.el7.x86_64
2) ovs-vsctl add-br ovirt
3) ovs-vsctl add-port ovirt enp6s0
4) ip addr add 5.5.5.5/24 dev ovirt

11: ovirt: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
    link/ether 00:14:5e:fb:11:f2 brd ff:ff:ff:ff:ff:ff
    inet 5.5.5.5/24 scope global ovirt
       valid_lft forever preferred_lft forever
[root@red-vds4 ~]# ovs-vsctl show 
30b21271-399a-4415-a016-701835943b5d
    Bridge ovirt
        Port ovirt
            Interface ovirt
                type: internal
        Port "enp6s0"
            Interface "enp6s0"
    ovs_version: "2.4.0"


5) update openvswitch package 

Updating   : openvswitch-2.5.0-2.el7.x86_64
Cleanup    : openvswitch-2.4.0-2.el7_2.x86_64
Verifying  : openvswitch-2.5.0-2.el7.x86_64
Verifying  : openvswitch-2.4.0-2.el7_2.x86_64

Updated:
  openvswitch.x86_64 0:2.5.0-2.el7

12: ovirt: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
    link/ether 00:14:5e:fb:11:f2 brd ff:ff:ff:ff:ff:ff
[root@red-vds4 yum.repos.d]# ovs-vsctl show 
30b21271-399a-4415-a016-701835943b5d
    Bridge ovirt
        Port ovirt
            Interface ovirt
                type: internal
        Port "enp6s0"
            Interface "enp6s0"
    ovs_version: "2.5.0"

Comment 3 Flavio Leitner 2016-08-31 21:47:49 UTC
Open vSwitch doesn't know/control anything above the interface, so if you upgrade/restart OVS, most probably you need to restart the network service or other related services managing the configuration as well.

There is a work in progress to allow that happen, but the required changes will land only on OVS 2.6 (not released yet) and unfortunately it's not back-portable to our current OVS 2.5.

Therefore, this is a WONTFIX if you are looking for a immediate solution or we can leave this open until 2.6 and other changes are available.

Comment 4 Dan Kenigsberg 2016-09-01 10:41:32 UTC
Flavio, could you tell me what is the expected to become of an OVS-2.5 interface with an IP address, once the OVS daemon is stopped? Does it disappear (how? why?) does it lose its address? does it lose connectivity?

What happens to the interface when OVS is restarted?

Comment 5 Panu Matilainen 2016-09-14 12:59:42 UTC
*** Bug 1364540 has been marked as a duplicate of this bug. ***

Comment 7 Thadeu Lima de Souza Cascardo 2016-09-18 10:09:25 UTC
Hi, Dan.

When using the kernel datapath, after stopping the openvswitch daemon, the ports will still be there, no changes to their link and address.

However, you will lose traffic, as there is no daemon to install new flows or handle packets that don't hit an installed flow.

When the daemon starts again, it will remove some of the ports from the bridges and add them again. I will verify how that might impact the port link and if there is any difference between those versions that might be causing this problem.

Regards.
Cascardo.

Comment 8 Thadeu Lima de Souza Cascardo 2016-09-23 14:00:19 UTC
I can confirm that the upgrade from 2.4.0 to 2.5.0 causes the interface to lose IP address, while the restart of each version keeps it. I will see what is the difference with these paths.

However, in any case, as different datapaths might suffer from this same problem, as interfaces will be destroyed and created again, this needs to be solved in a different way.

Flavio and Aaron have a solution for ifcfg, but if any other program assigns the addresses, it must be prepared for the removal and reappearance of the interfaces.

Cascardo.

Comment 10 Omri Hochman 2016-10-10 14:53:44 UTC
Adding escalate + adding test-blocker flag,  as it effects life-cylce upgrade tests.

Comment 11 Flavio Leitner 2016-10-10 15:03:08 UTC
Hi
Could you tell us how the IP is being configured in the first place?
Thanks,
fbl

Comment 12 Thadeu Lima de Souza Cascardo 2016-10-10 16:02:10 UTC
OK, so I moved further on my investigation on why an upgrade would cause this, while a restart wouldn't.

So it happens that we get these two commits upstream on 2.5.

5b5868191c8792726fc3237dfff84dcbbf3da6ae ("ovs-lib: Try to call exit before killing.")
de37dacd7825d345c1cc8579a4c0ac3f26bbff42 ("ovs-vswitchd: Preserve datapath ports across graceful shutdown.")

So, 2.4 ovs-ctl kills vswitchd with SIGTERM, which keeps the datapath ports. 2.4 exit (graceful shutdown) removes the datapath ports.
2.5 ovs-ctl calls exit (graceful shutdown) when restarting, and 2.5 exit will keep the datapath ports.

However, when we use ovs-ctl restart from 2.5 in order to upgrade, it will call exit for the 2.4 vswitchd, which will remove the ports.

Some solutions include: doing stop using 2.4 ovs-ctl during the upgrade, or changing 2.5 ovs-ctl to not call exit if vswitchd older than 2.5 is running.

However, this does not fix the problem architecturally. Keeping the address is not supported in other scenarios, and should not be supported in this scenario as well. Whatever configures the IP address must be ready to reconfigure it after the ports are gone and back up. I'll check if ifcfg support is working correctly.

Cascardo.

Comment 14 Thadeu Lima de Souza Cascardo 2016-10-10 19:05:47 UTC
In order to help resolve this, we need to understand the situation a little better. We have a good idea how to solve the issue when using the kernel datapath and setting an address by hand on the local port. However, this will not work when using DPDK or the netdev datapath.

We can tell that the problem that in seen is specific to upgrading from OVS 2.4 to OVS 2.5. Upgrades between 2.5 releases will work just fine *for this particular setup*. As said, when using thet netdev datapath (DPDK), this won't work.

What we need to know is how the interface address is configured. Does it use some OSP code, does it use ifcfg files? Or is there an expectation that addresses that are setup by running ip addr on a shell will just be kept around after the upgrade?

Thanks.
Cascardo.

Comment 15 Michael Burman 2016-10-11 07:49:31 UTC
(In reply to Flavio Leitner from comment #11)
> Hi
> Could you tell us how the IP is being configured in the first place?
> Thanks,
> fbl

Hi
The IP is being configured using ifcfg files during anaconda installation from foreman. 
The bootproto configured in the ifcfg file is dhcp and we get IP from the dhcp server.

- After the installation is done and i'm running upgrade to the openvswitch package(2.4.1>2.5.0), interface lost it's ip and connection to server is lost.

Comment 16 Marios Andreou 2016-10-11 10:52:25 UTC
sorry to confuse the two issues but we are hitting this for upgrade too so replying for that use case but sounds like it is similar to mburman comment #15 with the use of ifcfg (we could also use the BZ we originally filed at https://bugzilla.redhat.com/show_bug.cgi?id=1364540 if you prefer)...  

ifcfg files are written out by os-net-config. The contents of the ifcfg files are governed by the tripleo-heat-templates networking setup (for example https://github.com/openstack/tripleo-heat-templates/blob/master/network/config/single-nic-vlans/controller.yaml) . For a vanilla OSP9 network isolation setup the os-net-config configuration file on a controller looks like:


        [root@overcloud-controller-0 ~]# cat /etc/os-net-config/config.json 
        {"network_config": [{"dns_servers": ["192.168.122.1"], "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}, {"routes": [{"default": true, "next_hop": "10.0.0.1"}], "type": "vlan", "addresses": [{"ip_netmask": "10.0.0.6/24"}], "vlan_id": 10}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.2.8/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.1.5/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.3.6/24"}], "vlan_id": 40}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.4/24"}], "vlan_id": 50}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "ovs_bridge", "addresses": [{"ip_netmask": "192.0.2.13/24"}]}]}

os-net-config takes that data ^^^ and writes out the ifcfg- files, e.g.:

        [root@overcloud-controller-0 ~]# cat /etc/sysconfig/network-scripts/ifcfg-vlan40
        # This file is autogenerated by os-net-config
        DEVICE=vlan40
        ONBOOT=yes
        HOTPLUG=no
        NM_CONTROLLED=no
        PEERDNS=no
        DEVICETYPE=ovs
        TYPE=OVSIntPort
        OVS_BRIDGE=br-ex
        OVS_OPTIONS="tag=40"
        BOOTPROTO=static
        IPADDR=172.16.3.6
        NETMASK=255.255.255.0

hope that helps?

thanks, marios

Comment 17 Thadeu Lima de Souza Cascardo 2016-10-11 21:32:05 UTC
Hey, folks.

Here is what I have so far:

1) I could reproduce the problem with ifcfg script, and with the kernel datapath, it's really about doing the appctl exit to stop the 2.4 daemon.

2) However, if we want to fix this for DPDK as well and not depend on the datapath lying around, we should have the interfaces be brought up again whenever we restart and upgrade openvswitch.

3) Aaron Conole's patches for openvswitch systemd unit files do not fix this.

4) I couldn't find a simple way to restart the network service whenever we restart the openvswitch service, if we want to keep changes to openvswitch, and restarting the whole network does not seem right.

5) Restarting only those interfaces that have gone and come back, however, seems the right thing to do, which is what the HOTPLUG option used to be for.

6) So, you should set HOTPLUG=yes, or leave it blank, according to /etc/sysconfig/network-scripts/ifup.

7) However, the code that used to trigger that belonged to a udev rule and a initscripts script that has been gone since 2012, see https://git.fedorahosted.org/cgit/initscripts.git/commit/?id=031a9ddaf6e25fb92959d5c1d7198a69c6ea6fbd.

8)_Adding that net.hotplug script back and that little snippet to 60-network.rules and setting HOTPLUG=yes have fixed it for me.

My suggestion: have net.hotplug restored in initscripts and set HOTPLUG=yes. If we can't have that on initscripts, have that on OSP somehow.

Does that seem reasonable? I would be glad to help with that, if needed. I will be out tomorrow, and so will Flavio, can you take this conversation to initscripts guys, at least so we can plan for the future, and consider a path that allows an upgrade to a initscripts RPM that restore that behavior? Or see what else they propose instead.

Thanks.
Cascardo.

Comment 18 Marios Andreou 2016-10-13 14:11:09 UTC
@dsneddon sorry for needinfo spam... am I correct in recollecting that we don't want to use HOTPLUG in the ifcfg files because of issues with bonds?


Hey Cascardo:

WRT using HOTPLUG=yes I *think* this caused problems with bonded interfaces if I recall correctly so that may be a non-starter. I think someone like dsneddon may be able to confirm (adding needinfo).  If we *can* use that then going the route of restoring that functionality with the hotplug script you pointed at is probably worth exploring, but that could take a while to get agreed/merged etc right?

I am trying to come up with a workaround for now... I took from your points that though a pretty big hammer, restarting network service will probably deal with this. I played with a script like:

        mkdir OVS_UPGRADE && pushd OVS_UPGRADE
        yumdownloader --resolve openvswitch
        echo "***********stopping OVS"
        systemctl stop openvswitch
        echo "**********stopped.... installing rpms:"
        yum -y localinstall ./*.rpm
        echo "done restarting the network"
        systemctl restart network
        echo "done .... restarting the switch"
        systemctl restart openvswitch 
        echo "done"
        popd


I ran this on a controller-2 node and from another terminal was pinging it:

        [stack@instack ~]$ while [ true ]; do ping -c 1 192.0.2.15; date; sleep 1; done

        1 packets transmitted, 1 received, 0% packet loss, time 0ms
        rtt min/avg/max/mdev = 0.156/0.156/0.156/0.000 ms
        Thu Oct 13 09:57:34 EDT 2016

        PING 192.0.2.15 (192.0.2.15) 56(84) bytes of data.

        --- 192.0.2.15 ping statistics ---
        1 packets transmitted, 0 received, 100% packet loss, time 0ms

        Thu Oct 13 09:57:45 EDT 2016

        PING 192.0.2.15 (192.0.2.15) 56(84) bytes of data.

        --- 192.0.2.15 ping statistics ---
        1 packets transmitted, 0 received, 100% packet loss, time 0ms

        Thu Oct 13 09:57:56 EDT 2016

        PING 192.0.2.15 (192.0.2.15) 56(84) bytes of data.
        64 bytes from 192.0.2.15: icmp_seq=1 ttl=64 time=2.92 ms

        --- 192.0.2.15 ping statistics ---
        1 packets transmitted, 1 received, 0% packet loss, time 0ms
        rtt min/avg/max/mdev = 2.929/2.929/2.929/0.000 ms
        Thu Oct 13 09:57:57 EDT 2016


This worked OK with a few seconds downtime (see above) though I had to re-login obviously to all my terminals. SO one question for you is, is there a smaller hammer than restarting network I could use here? In any case we *can* special case the openvswitch upgrade if we have to as part of the upgrades worfklow, even if that is a workaround for now (e.g. if we ultimately go for the hotplug or another fix). 

We can deliver that snippet ^^^ above as part of each node upgrade for mitaka to newton... One concern/unkown for me is whether the momentary downtime in network on the given node will cause problems with the heat stack-update? IF so then the workaround will be a non starter...

Comment 19 Thadeu Lima de Souza Cascardo 2016-10-13 15:10:18 UTC
There are some other options here. But with their drawbacks as well.

A) As we found out that shutting down 2.4 vswitchd using 2.5 ovs-ctl script causes the interfaces to go away, you can shutdown 2.4 before doing the upgrade and the interfaces will be kept. Basically, this is your solution without restarting network. Just stop openvswitch before doing the upgrade, then restart it.

The drawback is when DPDK is used, as I mentioned. The interfaces will be gone anyway, and the problem is not solved by this method. If no users of this upgrade script will be using DPDK, then you are good. Otherwise, you can verify it and use a different method, if that's the case.

You can run this to verify if any bridge relies on DPDK/netdev datapath type.

ovs-vsctl find bridge datapath_type=netdev

Also, as vswitchd will be stopped, during the upgrade, any traffic that requires upcalls to the daemon will be dropped. Notice that flows are usually cleaned up from the kernel after some time unused, so it's always possible that is the case that all traffic will be dropped while the daemon is down. Of course, that affects any upgrades to openvswitch. It's just that this method will increase the downtime.

B) Only restart (ifdown/ifup them) those interfaces that are openvswitch bridges and internal ports. Inspect every ifcfg-* file, get the type, and restart it if the type is OVSBridge, OVSUserBridge, or OVSIntPort. To check how to do that, take a look at /etc/init.d/network.

I am not sure if other routes might be lost. For example, if there is a route that goes through one of those interfaces that is configured at /etc/sysconfig/static-routes, it's possible they might be lost.

There is still some downtime on the network here, as the interfaces will be gone when the daemon is stopped until the addresses are restablished. And in that case, instead of packets silently dropping, some errors might be noticeable.

C) Use a hotplug strategy. Have some udev rule, just like the one I pointed at, but use a different script that filters for the interface type. Might be done at udev too, only running the script if it's an openvswitch internal port (in the case of netdev datapath type or DPDK, it could be a tap port).

D) Use a mix. Stop the daemon before so routes will be kept, but have the hotplug scripts ready to go, just in case the interfaces are gone and back for some reason.

Cascardo.

Comment 20 Sofer Athlan-Guyot 2016-10-14 12:35:13 UTC
Just a quick note to say that the upgrade process survived the network disconnection of the hack described by Marios on comment 18.  

I've got a successful upgrade on ha/1compute setup with it and openvswitch to 1:2.5.0-5.git20160628.el7fdb.

Comment 21 Marios Andreou 2016-10-18 16:16:19 UTC
related bug was filed at https://bugzilla.redhat.com/show_bug.cgi?id=1385096

Comment 22 Flavio Leitner 2016-10-18 16:33:43 UTC
Note that if you need to restart OVS, then you should do it _before_ restarting the 'network' service because the 'network' service will add the networking configuration on top of OVS ports.

Comment 23 Flavio Leitner 2016-10-18 16:57:15 UTC
Hopefully things might change in the future but considering the current situation there is no reliable way for OVS to restore the previous networking state after a service restart. Many things like firewall configuration, traffic shaping, stacked devices, custom setups are completely out of OVS control.  Also because the restart can cause issues (e.g.: bug 1385096), the openvswitch spec file is being fixed to not perform an automatic restart during a package upgrade.

I am closing this. Feel free to re-open if you disagree and have ideas on how it could work.

Thanks
fbl

Comment 24 Marios Andreou 2016-10-19 15:34:01 UTC
flavio can you please add a note about the options to the rpm install i could try and see if it helps/doesn't need network restart when I install it that way?

Comment 25 Flavio Leitner 2016-10-19 16:41:50 UTC
The rpm section doing restart is %postun, so rpm --nopostun should be enough.

Comment 26 Marios Andreou 2016-10-20 14:17:33 UTC
OK thanks Flavio I had some success with this today, I am going to use the original bug we filed at https://bugzilla.redhat.com/show_bug.cgi?id=1364540 to track the workaround we carry


Note You need to log in before you can comment on or make changes to this bug.