Bug 1571647
Summary: | neutron-openvswitch-agent cleans up stale flows months after they were created but it does not recreated correct flows and bridge configuration | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Irina Petrova <ipetrova> | |
Component: | openstack-neutron | Assignee: | Slawek Kaplonski <skaplons> | |
Status: | CLOSED ERRATA | QA Contact: | Roee Agiman <ragiman> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 8.0 (Liberty) | CC: | aguetta, akaris, amuller, brault, chrisw, jlibosva, nyechiel, pablo.iranzo, ragiman, srevivo | |
Target Milestone: | zstream | Keywords: | Triaged, ZStream | |
Target Release: | 8.0 (Liberty) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-neutron-7.2.0-35.el7ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1576248 (view as bug list) | Environment: | ||
Last Closed: | 2018-07-05 12:28:41 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1572958, 1576248, 1576256, 1576273, 1576284, 1576286 |
Comment 3
Jakub Libosvar
2018-04-25 13:30:45 UTC
I just realized this, I had completely missed it yesterday (thanks, Irinia!) br-ex when the connectivity is down: Bridge br-ex Port infra-bond Interface "ens3f1" Interface "ens3f0" Port "vlan160" tag: 160 Interface "vlan160" type: internal Port "vlan150" tag: 150 Interface "vlan150" type: internal Port "vlan170" tag: 170 Interface "vlan170" type: internal Port "vlan110" tag: 110 Interface "vlan110" type: internal Port br-ex Interface br-ex type: internal br-ex after neutron-openvswitch-agent restart: Bridge br-ex fail_mode: secure <<<<<<<<<<<<<<< Port infra-bond Interface "ens3f1" Interface "ens3f0" Port phy-br-ex <<<<<<<<<<<<<<< Interface phy-br-ex <<<<<<<<<<<<<<< type: patch <<<<<<<<<<<<<<< options: {peer=int-br-ex} <<<<<<<<<<<<<<< Port "vlan160" tag: 160 Interface "vlan160" type: internal Port "vlan150" tag: 150 Interface "vlan150" type: internal Port "vlan170" tag: 170 Interface "vlan170" type: internal Port "vlan110" tag: 110 Interface "vlan110" type: internal Port br-ex Interface br-ex type: internal So this would indicate that something else was wrong on br-ex ... Is it possible that the flow deletion that we see in the logs is only "step 2" of the issue? Step 1 being that somehow br-ex: * lost its flows and inserted a default "normal" flow * lost the fail-mode settings (which in more recent versions of OVS are switched to "secure" by default * lost the patch port: phy-br-ex It's as if br-ex had been reset to defaults... (In reply to Andreas Karis from comment #4) > I just realized this, I had completely missed it yesterday (thanks, Irinia!) > > br-ex when the connectivity is down: > > Bridge br-ex > Port infra-bond > Interface "ens3f1" > Interface "ens3f0" > Port "vlan160" > tag: 160 > Interface "vlan160" > type: internal > Port "vlan150" > tag: 150 > Interface "vlan150" > type: internal > Port "vlan170" > tag: 170 > Interface "vlan170" > type: internal > Port "vlan110" > tag: 110 > Interface "vlan110" > type: internal > Port br-ex > Interface br-ex > type: internal > > br-ex after neutron-openvswitch-agent restart: > > Bridge br-ex > fail_mode: secure <<<<<<<<<<<<<<< > Port infra-bond > Interface "ens3f1" > Interface "ens3f0" > Port phy-br-ex <<<<<<<<<<<<<<< > Interface phy-br-ex <<<<<<<<<<<<<<< > type: patch <<<<<<<<<<<<<<< > options: {peer=int-br-ex} <<<<<<<<<<<<<<< > Port "vlan160" > tag: 160 > Interface "vlan160" > type: internal > Port "vlan150" > tag: 150 > Interface "vlan150" > type: internal > Port "vlan170" > tag: 170 > Interface "vlan170" > type: internal > Port "vlan110" > tag: 110 > Interface "vlan110" > type: internal > Port br-ex > Interface br-ex > type: internal > > > So this would indicate that something else was wrong on br-ex ... Is it > possible that the flow deletion that we see in the logs is only "step 2" of > the issue? Step 1 being that somehow br-ex: > * lost its flows and inserted a default "normal" flow > * lost the fail-mode settings (which in more recent versions of OVS are > switched to "secure" by default > * lost the patch port: phy-br-ex > > It's as if br-ex had been reset to defaults... Yes, that's definitely the cause. When bridge is set from secure to standalone then it implements normal rule with cookie=0x0 (I tested it). There are two questions we need to find answers for: - how did the br-ex got to standalone? My theory was that network service was restarted, that triggers ifcfg-up-ovs which re-creates the bridge - in standalone cause it's default. Also network service restarts ovs agent via systemd, so that would explain the stale flow cleaning. - but I didn't find in the journal anything about ovs-agent nor network being restarted - how did the cleanup got triggered in ovs agent? The cleanup runs in first iteration of periodical loop in ovs agent, unless there are some failed devices. The ovs agent has been running for a long time and the only way cleanup is executed is at the beginning of ovs agent process. (In reply to Jakub Libosvar from comment #5) [snip] > > There are two questions we need to find answers for: > > - how did the br-ex got to standalone? My theory was that network service > was restarted, that triggers ifcfg-up-ovs which re-creates the bridge - in > standalone cause it's default. Also network service restarts ovs agent via > systemd, so that would explain the stale flow cleaning. > - but I didn't find in the journal anything about ovs-agent nor network > being restarted I found the answers. The br-ex has been in standalone mode since Jan 09 13:50:24 because somebody called ifdown and ifup on br-ex, which triggers ifup-ovs that re-creates the bridge. As the ifcfg-br-ex file doesn't contain any specifics, it was created with defaults, ie. in standalone mode with normal action flow having cookie=0x0. The bridge was put back to secure on Apr 25 00:08:27, which is when ovs agent was restarted to get rid of the observed connectivity issue. As per journactl logs, I suspect there was running an overcloud update at that time which triggered br-ex to be re-created by network scripts but at the same time ovs-agent was kept running. So the br-ex was put to standalone while ovs-agent had already initialized the bridge and hence it has never been put back to secure mode and flows configured. Which means br-ex had only a single flow with cookie=0x0 and normal action there since Jan 9 - such flow is considered stale by ovs-agent when cleanup is triggered. > - how did the cleanup got triggered in ovs agent? The cleanup runs in first > iteration of periodical loop in ovs agent, unless there are some failed devices. > The ovs agent has been running for a long time and the only way cleanup is > executed is at the beginning of ovs agent process. As the stale flows cleanup happens during the first iteration of rpc_loop, it must have been somehow postponed until Apr 24 (connectivity failure) and then all agents on 110 computes triggered the cleanup at the same time. The only way to postpone the stale cleanup call is ONLY when this rpc call [1] returns a content in failed_devices_down. That means neutron-servers had failed devices in its database and on Apr 24 they were removed. That leads to sync variable here [2] was finally False and it triggered the cleanup. The source of truth is database and once failed devices were removed, all agents got a content without failed devices when rpc call [1] was issued. As per logs, ovs-neutron agent has been running for a long time and this is the only possible way how the cleanup is called. [1] http://git.app.eng.bos.redhat.com/git/neutron.git/tree/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py?h=rhos-8.0-patches#n1495 [2] http://git.app.eng.bos.redhat.com/git/neutron.git/tree/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py?h=rhos-8.0-patches#n1833 We discussed changing the OVS agent to detect if provider bridges (in this case br-ex) need to be reprogrammed, using a similar mechanism we already use to detect if we need to reprogram br-int. Hi, Please see https://bugzilla.redhat.com/show_bug.cgi?id=1572698 for a full analysis of how we get into this issue with Director and for the tripleo templates bug that I opened to address this from the Director side. If you want to fix this from neutron-openvswitch-agent, then you would need to monitor flows and push them, as well as patch connections between bridges, as well as the fail_mode, etc. So ovs-agent would have to constantly check the bridges and take action if something manipulates them. - Andreas Hi, At another customer, I ran into a similar issue. This was triggered by a change to the configuration of DnsServers. Between the templates, there's a change to the number of DnsServers. ~~~ # before DnsServers: ["10.236.255.2"] ~~~ ~~~ # after DnsServers: ["10.236.255.2","10.236.255.6"] ~~~ This change triggers a run of os-net-config which propagates it into ifcfg-br-ex: ~~~ # before [akaris@collab-shell network-scripts]$ cat ifcfg-br-ex # This file is autogenerated by os-net-config (...) DNS1=10.236.255.2 ~~~ ~~~ # after [akaris@collab-shell network-scripts]$ cat ifcfg-br-ex # This file is autogenerated by os-net-config (...) DNS1=10.236.255.2 DNS2=10.236.255.6 ~~~ os-net-config then deletes and creates br-ex which deletes all flows and the virtual patch cord to br-ex: ~~~ (...) May 03 19:22:31 os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] running ifdown on bridge: br-ex May 03 19:22:31 ovs-vsctl[465206]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-ex May 03 19:22:31 os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-ex May 03 19:22:31 os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-ex May 03 19:22:31 os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-ex May 03 19:22:31 os-collect-config[3712]: [2018/05/03 07:22:31 PM] [INFO] running ifup on bridge: br-ex (...) ~~~ Flows and virtual patch cord are only recreated once neutron-openvswitch-agent is restarted. This will cause an immediate outage on VLAN networks. - Andreas *** Bug 1572959 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2132 |