Bug 1571647

Summary: neutron-openvswitch-agent cleans up stale flows months after they were created but it does not recreated correct flows and bridge configuration
Product: Red Hat OpenStack Reporter: Irina Petrova <ipetrova>
Component: openstack-neutronAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED ERRATA QA Contact: Roee Agiman <ragiman>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.0 (Liberty)CC: aguetta, akaris, amuller, brault, chrisw, jlibosva, nyechiel, pablo.iranzo, ragiman, srevivo
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-7.2.0-35.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1576248 (view as bug list) Environment:
Last Closed: 2018-07-05 12:28:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1572958, 1576248, 1576256, 1576273, 1576284, 1576286    

Comment 3 Jakub Libosvar 2018-04-25 13:30:45 UTC
We had a call today with customer and the culprit here is br-ex losing flow rules.

2018-04-24 14:58:48.946 4154 WARNING neutron.plugins.ml2.driv
ers.openvswitch.agent.openflow.ovs_ofctl.ofswitch [req-edb51234-8ae4-4642-9fd4-895461bc359e - - - - -] Deleting flow  cookie=0x0, duration=9076
104.394s, table=0, n_packets=3763138657, n_bytes=7350048628976, idle_age=0, hard_age=65534, priority=0 actions=NORMAL

As the network topology is set to use br-ex for management network and the bridge is set to secure mode by ovs agent, the ovs agent cannot talk to neutron-server anymore, until it gets flows configured on br-ex. But to get information needed for creating the flows, it needs to talk to neutron-server first.

Ovs agent has a mechanism that is used during ovs-agent restarts to not disrupt data plane, this mechanism uses cookies at flows. Basically it generates a cookie and cleans only flows that have unknown cookie. Such flows are considered stale and are removed on certain events. As we see above, the NORMAL action flow had cookie set to 0x0, which is the default when one doesn't specify a cookie.

I'm investigating possibilities how this flow could be created. I'll update soon.

Comment 4 Andreas Karis 2018-04-25 14:20:43 UTC
I just realized this, I had completely missed it yesterday (thanks, Irinia!)

br-ex when the connectivity is down:

    Bridge br-ex
        Port infra-bond
            Interface "ens3f1"
            Interface "ens3f0"
        Port "vlan160"
            tag: 160
            Interface "vlan160"
                type: internal
        Port "vlan150"
            tag: 150
            Interface "vlan150"
                type: internal
        Port "vlan170"
            tag: 170
            Interface "vlan170"
                type: internal
        Port "vlan110"
            tag: 110
            Interface "vlan110"
                type: internal
        Port br-ex
            Interface br-ex
                type: internal

br-ex after neutron-openvswitch-agent restart:

    Bridge br-ex
        fail_mode: secure                    <<<<<<<<<<<<<<<
        Port infra-bond
            Interface "ens3f1"
            Interface "ens3f0"
        Port phy-br-ex                       <<<<<<<<<<<<<<<
            Interface phy-br-ex              <<<<<<<<<<<<<<<
                type: patch                  <<<<<<<<<<<<<<<
                options: {peer=int-br-ex}    <<<<<<<<<<<<<<<
        Port "vlan160"
            tag: 160
            Interface "vlan160"
                type: internal
        Port "vlan150"
            tag: 150
            Interface "vlan150"
                type: internal
        Port "vlan170"
            tag: 170
            Interface "vlan170"
                type: internal
        Port "vlan110"
            tag: 110
            Interface "vlan110"
                type: internal
        Port br-ex
            Interface br-ex
                type: internal


So this would indicate that something else was wrong on br-ex ... Is it possible that the flow deletion that we see in the logs is only "step 2" of the issue? Step 1 being that somehow br-ex:
* lost its flows and inserted a default "normal" flow
* lost the fail-mode settings (which in more recent versions of OVS are switched to "secure" by default
* lost the patch port: phy-br-ex

It's as if br-ex had been reset to defaults...

Comment 5 Jakub Libosvar 2018-04-25 15:11:59 UTC
(In reply to Andreas Karis from comment #4)
> I just realized this, I had completely missed it yesterday (thanks, Irinia!)
> 
> br-ex when the connectivity is down:
> 
>     Bridge br-ex
>         Port infra-bond
>             Interface "ens3f1"
>             Interface "ens3f0"
>         Port "vlan160"
>             tag: 160
>             Interface "vlan160"
>                 type: internal
>         Port "vlan150"
>             tag: 150
>             Interface "vlan150"
>                 type: internal
>         Port "vlan170"
>             tag: 170
>             Interface "vlan170"
>                 type: internal
>         Port "vlan110"
>             tag: 110
>             Interface "vlan110"
>                 type: internal
>         Port br-ex
>             Interface br-ex
>                 type: internal
> 
> br-ex after neutron-openvswitch-agent restart:
> 
>     Bridge br-ex
>         fail_mode: secure                    <<<<<<<<<<<<<<<
>         Port infra-bond
>             Interface "ens3f1"
>             Interface "ens3f0"
>         Port phy-br-ex                       <<<<<<<<<<<<<<<
>             Interface phy-br-ex              <<<<<<<<<<<<<<<
>                 type: patch                  <<<<<<<<<<<<<<<
>                 options: {peer=int-br-ex}    <<<<<<<<<<<<<<<
>         Port "vlan160"
>             tag: 160
>             Interface "vlan160"
>                 type: internal
>         Port "vlan150"
>             tag: 150
>             Interface "vlan150"
>                 type: internal
>         Port "vlan170"
>             tag: 170
>             Interface "vlan170"
>                 type: internal
>         Port "vlan110"
>             tag: 110
>             Interface "vlan110"
>                 type: internal
>         Port br-ex
>             Interface br-ex
>                 type: internal
> 
> 
> So this would indicate that something else was wrong on br-ex ... Is it
> possible that the flow deletion that we see in the logs is only "step 2" of
> the issue? Step 1 being that somehow br-ex:
> * lost its flows and inserted a default "normal" flow
> * lost the fail-mode settings (which in more recent versions of OVS are
> switched to "secure" by default
> * lost the patch port: phy-br-ex
> 
> It's as if br-ex had been reset to defaults...

Yes, that's definitely the cause. When bridge is set from secure to standalone then it implements normal rule with cookie=0x0 (I tested it).

There are two questions we need to find answers for:

 - how did the br-ex got to standalone? My theory was that network service was restarted, that triggers ifcfg-up-ovs which re-creates the bridge - in standalone cause it's default. Also network service restarts ovs agent via systemd, so that would explain the stale flow cleaning.
     - but I didn't find in the journal anything about ovs-agent nor network being restarted

 - how did the cleanup got triggered in ovs agent? The cleanup runs in first iteration of periodical loop in ovs agent, unless there are some failed devices. The ovs agent has been running for a long time and the only way cleanup is executed is at the beginning of ovs agent process.

Comment 7 Jakub Libosvar 2018-04-25 16:08:20 UTC
(In reply to Jakub Libosvar from comment #5)
[snip]
> 
> There are two questions we need to find answers for:
> 
>  - how did the br-ex got to standalone? My theory was that network service
> was restarted, that triggers ifcfg-up-ovs which re-creates the bridge - in
> standalone cause it's default. Also network service restarts ovs agent via
> systemd, so that would explain the stale flow cleaning.
>      - but I didn't find in the journal anything about ovs-agent nor network
> being restarted

I found the answers. The br-ex has been in standalone mode since Jan 09 13:50:24 because somebody called ifdown and ifup on br-ex, which triggers ifup-ovs that re-creates the bridge. As the ifcfg-br-ex file doesn't contain any specifics, it was created with defaults, ie. in standalone mode with normal action flow having cookie=0x0. The bridge was put back to secure on Apr 25 00:08:27, which is when ovs agent was restarted to get rid of the observed connectivity issue.

As per journactl logs, I suspect there was running an overcloud update at that time which triggered br-ex to be re-created by network scripts but at the same time ovs-agent was kept running. So the br-ex was put to standalone while ovs-agent had already initialized the bridge and hence it has never been put back to secure mode and flows configured. Which means br-ex had only a single flow with cookie=0x0 and normal action there since Jan 9 - such flow is considered stale by ovs-agent when cleanup is triggered.

> - how did the cleanup got triggered in ovs agent? The cleanup runs in first 
> iteration of periodical loop in ovs agent, unless there are some failed devices. 
> The ovs agent has been running for a long time and the only way cleanup is 
> executed is at the beginning of ovs agent process.

As the stale flows cleanup happens during the first iteration of rpc_loop, it must have been somehow postponed until Apr 24 (connectivity failure) and then all agents on 110 computes triggered the cleanup at the same time. The only way to postpone the stale cleanup call is ONLY when this rpc call [1] returns a content in failed_devices_down. That means neutron-servers had failed devices in its database and on Apr 24 they were removed. That leads to sync variable here [2] was finally False and it triggered the cleanup. The source of truth is database and once failed devices were removed, all agents got a content without failed devices when rpc call [1] was issued. As per logs, ovs-neutron agent has been running for a long time and this is the only possible way how the cleanup is called.


[1] http://git.app.eng.bos.redhat.com/git/neutron.git/tree/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py?h=rhos-8.0-patches#n1495
[2] http://git.app.eng.bos.redhat.com/git/neutron.git/tree/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py?h=rhos-8.0-patches#n1833

Comment 14 Assaf Muller 2018-04-26 18:03:10 UTC
We discussed changing the OVS agent to detect if provider bridges (in this case br-ex) need to be reprogrammed, using a similar mechanism we already use to detect if we need to reprogram br-int.

Comment 15 Andreas Karis 2018-04-27 15:45:45 UTC
Hi,

Please see https://bugzilla.redhat.com/show_bug.cgi?id=1572698 for a full analysis of how we get into this issue with Director and for the tripleo templates bug that I opened to address this from the Director side.

If you want to fix this from neutron-openvswitch-agent, then you would need to monitor flows and push them, as well as patch connections between bridges, as well as the fail_mode, etc. So ovs-agent would have to constantly check the bridges and take action if something manipulates them.

- Andreas

Comment 17 Andreas Karis 2018-05-03 20:50:05 UTC
Hi,

At another customer, I ran into a similar issue. This was triggered by a change to the configuration of DnsServers.

Between the templates, there's a change to the number of DnsServers.
~~~
# before 
DnsServers: ["10.236.255.2"]
~~~

~~~
# after
DnsServers: ["10.236.255.2","10.236.255.6"]
~~~

This change triggers a run of os-net-config which propagates it into ifcfg-br-ex:
~~~
# before
[akaris@collab-shell network-scripts]$ cat ifcfg-br-ex
# This file is autogenerated by os-net-config
(...)
DNS1=10.236.255.2
~~~

~~~
# after
[akaris@collab-shell network-scripts]$ cat ifcfg-br-ex
# This file is autogenerated by os-net-config
(...)
DNS1=10.236.255.2
DNS2=10.236.255.6
~~~

os-net-config then deletes and creates br-ex which deletes all flows and the virtual patch cord to br-ex:
~~~
(...)
May 03 19:22:31  os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] running ifdown on bridge: br-ex
May 03 19:22:31  ovs-vsctl[465206]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-ex
May 03 19:22:31  os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-ex
May 03 19:22:31  os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-ex
May 03 19:22:31  os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-ex
May 03 19:22:31  os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] running ifup on bridge: br-ex
(...)
~~~

Flows and virtual patch cord are only recreated once neutron-openvswitch-agent is restarted. This will cause an immediate outage on VLAN networks.

- Andreas

Comment 21 Assaf Muller 2018-06-01 18:49:42 UTC
*** Bug 1572959 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2018-07-05 12:28:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2132