Description of problem: ovs-vswitchd CPU utilization jumps 50x during our normal Rally API tests (control plane - no data plane).[1] This can begin to break down the Cloud - I have seen services like RabbitMQ lose its heartbeat become partitioned and the rest is history. [1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?from=1500545668722&to=1500547429680&orgId=1&var-Cloud=openstack-12-ha-noncontainer&var-NodeType=*&var-Node=overcloud-controller-0&var-Interface=interface-br-ex&var-Disk=disk-sda&var-OSD=All&var-Pool=All&var-cpus0=All&var-cpus00=All&panelId=78&fullscreen Version-Release number of selected component (if applicable): [root@overcloud-controller-0 heat-admin]# rpm -qa | grep openvswitch openvswitch-ovn-central-2.7.0-8.git20170530.el7fdb.x86_64 openvswitch-2.7.0-8.git20170530.el7fdb.x86_64 openvswitch-ovn-host-2.7.0-8.git20170530.el7fdb.x86_64 openvswitch-ovn-common-2.7.0-8.git20170530.el7fdb.x86_64 openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch python-openvswitch-2.7.0-8.git20170530.el7fdb.noarch [root@overcloud-controller-0 heat-admin]# How reproducible: 100% Steps to Reproduce: 1. with 3x controller ha deployment, execute Neutron Rally create-list-routers Actual results: Link in the description to Grafana dashboard. Expected results: Utilization doesn't jump 50x Additional info: From vswitchd log (tail): 2017-07-21T11:51:14.589Z|2164542|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008 2017-07-21T11:51:14.615Z|2164543|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995 2017-07-21T11:51:14.639Z|2164544|bridge|INFO|bridge br-int: added interface ha-4f9e2bce-e1 on port 1008 2017-07-21T11:51:14.713Z|2164545|bridge|INFO|bridge br-int: deleted interface qr-504e5f81-9a on port 1008 2017-07-21T11:51:14.715Z|2164546|bridge|INFO|bridge br-int: deleted interface ha-8ae9c391-16 on port 995 2017-07-21T11:51:14.743Z|2164547|bridge|INFO|bridge br-int: added interface qr-504e5f81-9a on port 1008 2017-07-21T11:51:14.763Z|2164548|bridge|INFO|bridge br-int: added interface ha-8ae9c391-16 on port 995 2017-07-21T11:51:14.833Z|2164549|bridge|INFO|bridge br-int: deleted interface ha-7bba4721-43 on port 995 2017-07-21T11:51:14.835Z|2164550|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008 2017-07-21T11:51:14.858Z|2164551|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995 2017-07-21T11:51:14.885Z|2164552|bridge|INFO|bridge br-int: added interface ha-4f9e2bce-e1 on port 1008 2017-07-21T11:51:14.961Z|2164553|bridge|INFO|bridge br-int: deleted interface qr-504e5f81-9a on port 1008 2017-07-21T11:51:14.963Z|2164554|bridge|INFO|bridge br-int: deleted interface ha-8ae9c391-16 on port 995 2017-07-21T11:51:14.987Z|2164555|bridge|INFO|bridge br-int: added interface qr-504e5f81-9a on port 1008 2017-07-21T11:51:15.012Z|2164556|bridge|INFO|bridge br-int: added interface ha-8ae9c391-16 on port 995 2017-07-21T11:51:15.081Z|2164557|bridge|INFO|bridge br-int: deleted interface ha-7bba4721-43 on port 995 2017-07-21T11:51:15.083Z|2164558|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008 2017-07-21T11:51:15.108Z|2164559|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995 From /var/log/messages : (tail) Jul 21 07:50:16 localhost NetworkManager[1264]: <info> [1500637816.7172] manager: (qr-504e5f81-9a): new Generic device (/org/freedesktop/NetworkManager/Devices/1027729) Jul 21 07:50:16 localhost kernel: device ha-8ae9c391-16 entered promiscuous mode Jul 21 07:50:16 localhost NetworkManager[1264]: <info> [1500637816.7434] manager: (ha-8ae9c391-16): new Generic device (/org/freedesktop/NetworkManager/Devices/1027730) Jul 21 07:50:16 localhost kernel: device ha-7bba4721-43 left promiscuous mode Jul 21 07:50:16 localhost kernel: device ha-7bba4721-43 entered promiscuous mode Jul 21 07:50:16 localhost kernel: device ha-8ae9c391-16 left promiscuous mode Jul 21 07:50:16 localhost NetworkManager[1264]: <info> [1500637816.8710] manager: (ha-7bba4721-43): new Generic device (/org/freedesktop/NetworkManager/Devices/1027731) Jul 21 07:50:16 localhost kernel: device ha-4f9e2bce-e1 entered promiscuous mode Jul 21 07:50:16 localhost NetworkManager[1264]: <info> [1500637816.8896] manager: (ha-4f9e2bce-e1): new Generic device (/org/freedesktop/NetworkManager/Devices/1027732) Jul 21 07:50:16 localhost kernel: device qr-504e5f81-9a left promiscuous mode Jul 21 07:50:16 localhost kernel: device qr-504e5f81-9a entered promiscuous mode Jul 21 07:50:16 localhost kernel: device ha-4f9e2bce-e1 left promiscuous mode Jul 21 07:50:16 localhost NetworkManager[1264]: <info> [1500637816.9925] manager: (qr-504e5f81-9a): new Generic device (/org/freedesktop/NetworkManager/Devices/1027733) Jul 21 07:50:17 localhost kernel: device ha-8ae9c391-16 entered promiscuous mode Jul 21 07:50:17 localhost kernel: device ha-7bba4721-43 left promiscuous mode Jul 21 07:50:17 localhost NetworkManager[1264]: <info> [1500637817.0129] manager: (ha-8ae9c391-16): new Generic device (/org/freedesktop/NetworkManager/Devices/1027734) Jul 21 07:50:17 localhost kernel: device ha-7bba4721-43 entered promiscuous mode Jul 21 07:50:17 localhost NetworkManager[1264]: <info> [1500637817.1254] manager: (ha-7bba4721-43): new Generic device (/org/freedesktop/NetworkManager/Devices/1027735) Jul 21 07:50:17 localhost kernel: device ha-8ae9c391-16 left promiscuous mode Jul 21 07:50:17 localhost kernel: device ha-4f9e2bce-e1 entered promiscuous mode Jul 21 07:50:17 localhost kernel: device qr-504e5f81-9a left promiscuous mode Jul 21 07:50:17 localhost NetworkManager[1264]: <info> [1500637817.1502] manager: (ha-4f9e2bce-e1): new Generic device (/org/freedesktop/NetworkManager/Devices/1027736) perf report of the vswitchd - 17.11% ovs-vswitchd [kernel.kallsyms] [k] snmp_fold_field 11.09% ovs-vswitchd [kernel.kallsyms] [k] internal_get_stats 8.08% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner 4.17% ovs-vswitchd [kernel.kallsyms] [k] find_next_bit 3.16% ovs-vswitchd libc-2.17.so [.] _int_malloc 2.25% ovs-vswitchd libc-2.17.so [.] _int_free 1.39% ovs-vswitchd [kernel.kallsyms] [k] __memcpy 1.35% ovs-vswitchd libc-2.17.so [.] malloc 1.15% ovs-vswitchd [kernel.kallsyms] [k] osq_lock 1.09% ovs-vswitchd libc-2.17.so [.] __strlen_sse2_pminub 0.98% ovs-vswitchd [kernel.kallsyms] [k] strlen 0.93% ovs-vswitchd [kernel.kallsyms] [k] netlink_broadcast_filtered 0.92% ovs-vswitchd libc-2.17.so [.] __libc_calloc 0.90% ovs-vswitchd [kernel.kallsyms] [k] strcpy 0.74% ovs-vswitchd libc-2.17.so [.] malloc_consolidate 0.68% ovs-vswitchd libpthread-2.17.so [.] pthread_mutex_lock 0.65% ovs-vswitchd libpthread-2.17.so [.] pthread_mutex_unlock 0.64% ovs-vswitchd ovs-vswitchd [.] 0x000000000027f3e7 0.59% ovs-vswitchd [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.56% ovs-vswitchd libc-2.17.so [.] __strcmp_sse42 0.54% ovs-vswitchd [kernel.kallsyms] [k] kobject_uevent_env 0.52% ovs-vswitchd ovs-vswitchd [.] 0x00000000002c58d3 0.52% ovs-vswitchd [kernel.kallsyms] [k] __memset 0.50% ovs-vswitchd [kernel.kallsyms] [k] dst_ifdown Reviewing the controllers, Rally might not be cleaning up, however it seems to be qdhcp namespaces (not router ns). [root@overcloud-controller-0 heat-admin]# ip netns | grep qdhcp | wc -l 795
Created attachment 1302316 [details] perf record of vswitchd
Neutron creates an HA port per-router, referenced from Assaf's blog post [1] to send keepalived traffic. Attempting to diagnose why ovs-vswitchd had the sharp CPU utilization, I began deleting resources off of the switch. Deleting tap and qr devices had zero impact on the utilization. Deleting all of the ha-ports, I witnessed [2], which CPU utilization returned. [1] https://assafmuller.com/2014/08/16/layer-3-high-availability/ [2] https://snapshot.raintank.io/dashboard/snapshot/M8NiZW2ooIfqyUnAP7Bo3IpgaT81xtQW?panelId=72&fullscreen&orgId=2
Packet capture on one of the HA ports [root@overcloud-controller-1 neutron]# ip l s up ha-0cc675d6-46 [root@overcloud-controller-1 neutron]# tcpdump -vvv -i ha-0cc675d6-46 tcpdump: listening on ha-0cc675d6-46, link-type EN10MB (Ethernet), capture size 262144 bytes 16:56:25.692546 IP6 (hlim 1, next-header Options (0) payload length: 36) overcloud-controller-1.perf.lab.eng.rdu.redhat.com > ff02::16: HBH (rtalert: 0x0000) (padn) [icmp6 sum ok] ICMP6, multicast listener report v2, 1 group record(s) [gaddr ff02::1:ff1f:6769 to_ex { }] 16:56:26.314133 IP (tos 0xc0, ttl 255, id 8109, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:28.315246 IP (tos 0xc0, ttl 255, id 8110, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:30.316924 IP (tos 0xc0, ttl 255, id 8111, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:32.318545 IP (tos 0xc0, ttl 255, id 8112, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:34.320108 IP (tos 0xc0, ttl 255, id 8113, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:36.321738 IP (tos 0xc0, ttl 255, id 8114, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 16:56:38.323299 IP (tos 0xc0, ttl 255, id 8115, offset 0, flags [none], proto VRRP (112), length 40) 169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1 ^C 8 packets captured 8 packets received by filter 0 packets dropped by kernel
Hi, We need a sosreport from the reproducing system to see what is going on. I am particularly interested in the output of 'ovsdb-tool -m show-log' to review the transactions history. There has been some bug reports related to adding ports to ovs bridge and this sounds like to be just another consequence of the same bug. It's fixed in 2.7.2, so could please check if that resolves the issue? http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.7.2/1.git20170719.el7fdp/ Please capture the requested data before updating OVS because it might still be useful even if 2.7.2 doesn't show the issue. Thanks, fbl
fbl - ovsdb-tool output : https://gist.github.com/jtaleric/28d908f289f073c102337d02546f1c8b I will update to 2.7 and re-run today.
Hi, If the output provided in comment#5 was from after or during reproducing the issue, then I think we can rule out external triggers, like an agent adding and removing ports in a loop. Please try with our latest 2.7 to see if the fix in upstream helps with this issue as well. fbl
(In reply to Flavio Leitner from comment #6) > Hi, > > If the output provided in comment#5 was from after or during reproducing the > issue, then I think we can rule out external triggers, like an agent adding > and removing ports in a loop. > Yes - was after reproducing the bug. > Please try with our latest 2.7 to see if the fix in upstream helps with this > issue as well. > > fbl Yup. This was my next step.
(In reply to Flavio Leitner from comment #6) > Hi, > > If the output provided in comment#5 was from after or during reproducing the > issue, then I think we can rule out external triggers, like an agent adding > and removing ports in a loop. > > Please try with our latest 2.7 to see if the fix in upstream helps with this > issue as well. > > fbl Updating to OVS2.7.2 did not help. Disabling l3_ha (set it to false across controllers) seems to have helped. 1830-~1930 is l3_ha:true 2015-~2027 l3_ha:false https://snapshot.raintank.io/dashboard/snapshot/btt0hoOW1rjKOSteD0ze2BQFNceEXikK?refresh=10s&panelId=72&fullscreen&orgId=2
The only difference with HA routers or legacy as far as the data plane would be VRRP traffic from keepalived. @Flavio do you have the info that you need? This is impacting some OSP 12 CI jobs.
@Assaf / @Flavio - Tracing this a bit more[1], ovs-vswitchd is working fine until neutron delete-router begins. When Rally begins to delete the Neutron router resources, we start seeing issues with keystone : 2017-08-23 02:27:15.563 117667 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Unable to establish connection to http://192.168.24.60:35357/v3/auth/tokens: HTTPConnectionPool(host='192.168.24.60', port=35357): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x866bf90>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)): ConnectFailure: Unable to establish connection to http://192.168.24.60:35357/v3/auth/tokens: HTTPConnectionPool(host='192.168.24.60', port=35357): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x866bf90>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)) Which could be related to : https://bugzilla.redhat.com/show_bug.cgi?id=1470363 Is there something we can add to the neutron-agent to clean up these keepalived/vrrp ports even if keystone/auth fails? [1] https://gist.github.com/jtaleric/21dbd06d011435c80a9f7f2b4a3971ad
Any leads here or ongoing work?
(In reply to Assaf Muller from comment #11) > Any leads here or ongoing work? My current understanding is that this is more likely to be a neutron, keystone, or keepalived problem than in openvswitch. The logs referenced in comment 10 seem to indicate that keepalived is failing to create sockets used for heartbeat messages, if that results in multiple VRRP masters it would explain the flow thrashing in ovs-vswitchd. Of course I have little understanding of the neutron/keystone/keepalived interactions, it would be good to have someone with more knowledge of these components weigh in.
My guess is that the high CPU utilization is not leaving enough CPU time for other processes including keystone, which is causing timeouts in API requests. I think that the keystone errors are red herrings. We could verify this by running the same workloads on the same OSP version, but swapping out OVS versions. @Joe, can you try downgrading OVS to 2.6 and running the Rally scenarios again?
(In reply to Assaf Muller from comment #14) > My guess is that the high CPU utilization is not leaving enough CPU time for > other processes including keystone, which is causing timeouts in API > requests. I think that the keystone errors are red herrings. We could verify > this by running the same workloads on the same OSP version, but swapping out > OVS versions. @Joe, can you try downgrading OVS to 2.6 and running the Rally > scenarios again? Hey Assaf - the CPU utilization spike isn't in question when we see the authentication errors... I think the authentication errors are due to how ha-proxy is configured, and the recent switch to Fernet tokens vs UUID (more CPU usage due to crypto) https://bugzilla.redhat.com/show_bug.cgi?id=1470363 Once I have another OSP12 deployment I will re-run through this.
Moving this to openstack-neutron as I think it is due to neutron not cleaning up properly.
I have reran this test without OVS used as the control-plane -- ie, I had TripleO use ovs_bridge for all of the control-plane. I have since switched to using a single bond with vlan (ie, no bridges). I have not been able to recreate this bug. However, I have noticed not all resources on the controllers are being cleaned up (all are qdhcp ns): [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:15:39 UTC 2017 1188 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:15:44 UTC 2017 1187 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:15:51 UTC 2017 1175 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:17:33 UTC 2017 1001 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:17:45 UTC 2017 983 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:19:30 UTC 2017 761 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:20:04 UTC 2017 749 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:21:31 UTC 2017 749 [root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l Fri Oct 27 18:34:30 UTC 2017 749 No OVS For control plane: https://snapshot.raintank.io/dashboard/snapshot/2qRYNCdiUsPslqFJyjAV0j5TIJga8YH1?refresh=10s&panelId=72&fullscreen&orgId=2 OVS as control plane : https://snapshot.raintank.io/dashboard/snapshot/3c2YAifqcpaa7Vfb3tjAQBcI7ZES4jCS?panelId=72&fullscreen&orgId=2
I have been able to recreate this bug using the non-ovs control-plane environment. [root@overcloud-controller-0 heat-admin]# ps aux | grep dnsmasq | wc -l 113 [root@overcloud-controller-0 heat-admin]# ip netns | wc -l 895 [root@overcloud-controller-0 heat-admin]# [root@overcloud-controller-1 heat-admin]# ps aux | grep dnsmaq | wc -l 1 [root@overcloud-controller-1 heat-admin]# ip netns | wc -l 1039 [root@overcloud-controller-2 heat-admin]# ps aux | dnsmasq | wc -l 0 [root@overcloud-controller-2 heat-admin]# ip netns | wc -l 1478 Looking at the utilization : https://snapshot.raintank.io/dashboard/snapshot/Dl9XBj6A22aElPtdsTrhJX7EPUjTpI0b?panelId=72&fullscreen&orgId=2
Issuing, [root@overcloud-controller-2 heat-admin]# for netns in `ip netns`; do ip netns delete $netns; done I see : https://snapshot.raintank.io/dashboard/snapshot/MaakwJJ2PhzfW8DdS7Q3krIHYN82geqF?panelId=72&fullscreen&orgId=2 Furthermore, running : [root@overcloud-controller-[0-2] heat-admin]# for port in `ovs-vsctl show | grep "Port \"tap" | awk '{print $2}' | sed s/\"//g`; do ovs-vsctl del-port $port; done OVS returns to normal operations https://snapshot.raintank.io/dashboard/snapshot/myovd1lzDJCCfLlRFR4zY0Pc8Gbc10KG?panelId=72&fullscreen&orgId=2
Continuing to diagnose/look into this, I ran Rally multiple times, cleaned just controller-0, which I will get into how I cleaned it. Reviewing the last 3 graphs here [1]. We can see the qrouter resources growing on controller-1 and never returning (I would of expected this to be zero, like the other controllers at time 1600. The qdhcp namespaces are growing out of control... ovs-agent sets these ports to 4095, which you can see here[2]. the dhcp-agent has a cleanup function [3], which only runs at the start of the dhcp-agent, issuing a restart to the dhcp-agent, I see these namespaces get cleaned up (as you can see on controller-0 graph at time 17:08). Could we recommend running the dhcp-agent cleanup on a interval? [1] https://snapshot.raintank.io/dashboard/snapshot/Uhoh0lRBHR5uRZJbnxd6XOEVns3C8S7D [2] https://gist.github.com/jtaleric/4b3f728825cc1cc02edcbcb12f1e0089 [3] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L1384
Out of suspicion, I thought the qdhcp resources were not being cleaned up due to being in some sort of weird state due to [1]. And that does seem to be the case, once I bump the inotify.max_user_instances, and re-run the exact test, we see the qdhcp resources return to 0 [2]. This doesn't explain why restarting neutron-dhcp-agent cleans up the defunct namespaces, but the normal clean-up function doesn't? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1474515 [2] https://snapshot.raintank.io/dashboard/snapshot/W2WPViO3DQ5xc592LPfdRfzk7noSdqdi?panelId=13&fullscreen&orgId=2
I'd like to take a look, both dhcp and router namespaces should've been deleted gracefully. For example, when a router is removed, this code will get executed [0] and if there's stale namespaces, they would get eventually be cleaned up when a fullsync is performed [1] which only happens if an exception occurs and on startup. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/router_info.py#L420 [1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L574
Reviewing the l3-agent logs, we do see : 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceede d timeout 10 seconds 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last): 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 30, in execute 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command txn.add(self) 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__ 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command self.gen.next() 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command self._nested_txn = None 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__ 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit() 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command timeout=self.timeout) 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceeded timeout 10 seconds 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command [root@overcloud-controller-2 neutron]# zcat l3-agent.log-20171108.gz | grep "d TimeoutException" | wc -l 1791
Also, due to that, the agent did 149 fullsyncs: # zgrep "Starting fullsync periodic_sync_routers_task" l3-agent*gz | wc -l 149 Example: Attempt to delete port qr-c5d287ff-30 from OVS and the associated device: l3-agent.log-20171108.gz:2017-11-07 13:21:09.587 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 l3-agent.log-20171108.gz:2017-11-07 13:21:09.656 124640 DEBUG neutron.agent.l3.router_info [-] Deleting stale internal router device: qr-c5d287ff-30 _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:603 l3-agent.log-20171108.gz:2017-11-07 13:22:37.646 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 l3-agent.log-20171108.gz:2017-11-07 13:26:49.033 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 This port was added and removed by OVS tons of times: [root@overcloud-controller-2 neutron]# !43 zgrep c5d287ff ../openvswitch/ovs-vswitchd.log-20171108.gz | grep added | wc -l 492 [root@overcloud-controller-2 neutron]# !44 zgrep c5d287ff ../openvswitch/ovs-vswitchd.log-20171108.gz | grep deleted | wc -l 491 And finally, the port remained there: [root@overcloud-controller-2 neutron]# zgrep c5d287ff ../openvswitch/ovs-vswitchd.log-20171108.gz [...] 2017-11-07T13:38:56.166Z|57485|bridge|INFO|bridge br-int: deleted interface qr-c5d287ff-30 on port 11370 2017-11-07T13:38:56.245Z|57489|bridge|INFO|bridge br-int: added interface qr-c5d287ff-30 on port 11370 A lot of stale ports remain after the test run: [root@overcloud-controller-2 neutron]# ovs-vsctl show | grep Port | wc -l 1340 Also, ovsdb-server logs are full of connection errors: 2017-11-07T13:46:10.509Z|01018|reconnect|ERR|tcp:127.0.0.1:41730: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:18.193Z|01019|reconnect|ERR|tcp:127.0.0.1:42654: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:21.124Z|01020|reconnect|ERR|tcp:127.0.0.1:43234: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:29.333Z|01021|reconnect|ERR|tcp:127.0.0.1:43796: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:34.055Z|01022|reconnect|ERR|tcp:127.0.0.1:44102: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:40.800Z|01023|reconnect|ERR|tcp:127.0.0.1:46852: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:48.279Z|01024|reconnect|ERR|tcp:127.0.0.1:48370: no response to inactivity probe after 5 seconds, disconnecting 2017-11-07T13:46:53.494Z|01025|reconnect|ERR|tcp:127.0.0.1:49050: no response to inactivity probe after 5 seconds, disconnecting This suggests that the neutron-openvswitch-agent is busy and fails to send the response to inactivity probes from server, then it gets disconnected. We could increase the inactivity probe to a higher value (15s?): ovs-vsctl -- set controller <_uuid> inactivity_probe=15000 But this would only alleviate the problem since the root cause is the ovs-agent being "stuck". The below iteration of rpc_loop in ovs agent (where most of the test happens) took 1175 seconds (~20 minutes) to complete: 2017-11-07 13:36:47.955 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 - port information retrieved. Elapsed:17.885 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2062 2017-11-07 13:56:05.623 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 1321, 'removed': 0}}. Elapsed:1175.553 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804 This iteration above took 20 minutes but right before, the agent crashed and had to perform a full sync which lead to those so many errors. The reason why it crashed is: 2017-11-07 13:35:54.877 597682 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection reset by peer): RuntimeError: OVS transaction timed out 2017-11-07 13:35:54.895 597682 CRITICAL neutron [-] Unhandled error: TypeError: int() can't convert non-string with explicit base 2017-11-07 13:35:54.895 597682 CRITICAL neutron [-] Unhandled error: TypeError: int() can't convert non-string with explicit base 2017-11-07 13:35:54.895 597682 ERROR neutron Traceback (most recent call last): 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 10, in <module> 2017-11-07 13:35:54.895 597682 ERROR neutron sys.exit(main()) 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main 2017-11-07 13:35:54.895 597682 ERROR neutron agent_main.main() 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 51, in main 2017-11-07 13:35:54.895 597682 ERROR neutron mod.main() 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/main.py", line 35, in main 2017-11-07 13:35:54.895 597682 ERROR neutron 'neutron.plugins.ml2.drivers.openvswitch.agent.' 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 375, in run_apps 2017-11-07 13:35:54.895 597682 ERROR neutron hub.joinall(services) 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall 2017-11-07 13:35:54.895 597682 ERROR neutron t.wait() 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait 2017-11-07 13:35:54.895 597682 ERROR neutron return self._exit_event.wait() 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait 2017-11-07 13:35:54.895 597682 ERROR neutron current.throw(*self._exc) 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main 2017-11-07 13:35:54.895 597682 ERROR neutron result = function(*args, **kwargs) 2017-11-07 13:35:54.895 597682 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 65, in _launch 2017-11-07 13:35:54.895 597682 ERROR neutron raise e 2017-11-07 13:35:54.895 597682 ERROR neutron TypeError: int() can't convert non-string with explicit base 2017-11-07 13:35:54.895 597682 ERROR neutron And then, systemd starts it again: 2017-11-07 13:35:58.155 17401 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 11.0.2.dev2 During that period we can see timeout messages and connection drops to ovsdb: [root@overcloud-controller-2 neutron]# grep "connection dropped" bad_iteration.log | wc -l 56 [root@overcloud-controller-2 neutron]# grep "exceeded timeout" bad_iteration.log | wc -l 10 When the agent restarts it has to process 2642 devices at that moment: 17-11-07 13:36:47.956 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Starting to process devices in:{'current': set([u'77072f78-e354-4598-90bc-2a01607f9809', u'34a7559b-aeef-4980-892d-5450cf4fdc2b', u'0271e328-0727-4f83-800f-e67bc161b776', u'52625ba1-6fc2-49f9-83df-fe18d0c136cc', u'7291bb17-575b-4700-92d2-cd85ad08da6a', u'f82af294-68b6-4963-90f2-b18d80f90be4', u'10896f5e-b470-47b5-81fe-fdb98caa93bf', u'00339cff-369b-4460-80bb-44d43c3fb8da', u'488e4157-f1e9-4f02-9cd0-dd418ced9564', u'd1e81972-de50-4c8c-96b5-8d .... [root@overcloud-controller-2 neutron]# grep "Starting to proc" bad_iteration.log | grep -o "u'" | wc -l 2642 Processing all of them takes around 20 minutes (basically this is a full sync) and also leaves resources (OVS ports mainly) behind. I'll try to dig further in the exception that caused the OVS agent to restart so that we can avoid a full sync while the test is running. However, the fact that the agent takes so long to do a full sync with so many ports is due to its current design. I'm pretty sure that we can optimize it but this will require a lot of work.
Further update after a closer look to logs: During the rpc loop execution of the OVS agent, 2642 ports are to be processed and process_network_ports() is called at [0] which gets first called at around: 2017-11-07 13:36:47 As the function is processing the devices, a timeout exception occurs and it doesn't complete: 2017-11-07 13:56:05.619 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent RuntimeError: ofctl request version=0x4,msg_type=0xe,msg_len=0x40,xid=0x12f844db,OFPFlowMod(buffer_id=4294967295,command=0,cookie=7394935990473861018L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],match=OFPMatch(oxm_fields={'in_port': 11484}),out_group=0,out_port=0,priority=2,table_id=0) timed out This exception is raised and the rpc loop iteration finishes: 2017-11-07 13:56:05.619 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Error while processing VIF ports: RuntimeError: ofctl request version=0x4,msg_type=0xe,msg_len=0x40,xid=0x12f844db,OFPFlowMod(buffer_id=4294967295,command=0,cookie=7394935990473861018L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],match=OFPMatch(oxm_fields={'in_port': 11484}),out_group=0,out_port=0,priority=2,table_id=0) timed out 2017-11-07 13:56:05.623 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 1321, 'removed': 0}}. Elapsed:1175.553 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804 After this, coming iterations won't succeed because they're unable to communicate with OVS: 2017-11-07 13:59:03.665 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Failed to communicate with the switch: RuntimeError: ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0xae6014dc,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) timed out 2017-11-07 13:59:03.666 17401 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. And keeps like this until more than 8 hours later: 2017-11-07 22:25:54.633 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 - star ting polling. Elapsed:36.183 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2039 2017-11-07 22:25:54.633 17401 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent out of sync with plugin! 2017-11-07 22:25:54.634 17401 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): ListPortsCommand(bridge=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/b ackend/ovs_idl/transaction.py:84 2017-11-07 22:29:38.318 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 - port information retrieved. Elapsed:259.868 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2062 2017-11-07 22:41:03.055 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] process_network_ports - iteration:839 - treat_devices_added_or_updated completed. Skipped 0 devices of 1321 devices currently available. Time elapsed: 684.726 process_network_ports /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1663 2017-11-07 22:41:20.534 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 completed. Processed ports statistics: {'regular': {'updated': 637, 'added': 1321, 'removed': 0}}. Elapsed:962.084 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804 Iteration 839 took 962 seconds (16 minutes) and looks that it completes successfully. I can't really tell why it stayed like that until 22:41 but in ovs-vswitchd logs we can see that it's adding and deleting devices all the time (DEBUG level would help here): 2017-11-07T20:56:00.688Z|986468|bridge|INFO|bridge br-int: added interface ha-a1a195a9-c9 on port 12043 2017-11-07T20:56:00.735Z|986469|bridge|INFO|bridge br-int: added interface ha-8139fe8c-ed on port 12048 2017-11-07T20:56:00.759Z|986470|bridge|INFO|bridge br-int: added interface qr-63c9d5a3-1f on port 11368 2017-11-07T20:56:00.844Z|986471|bridge|INFO|bridge br-int: added interface ha-a7d06805-bf on port 11098 2017-11-07T20:56:00.984Z|986472|bridge|INFO|bridge br-int: deleted interface ha-87309246-d8 on port 12043 2017-11-07T20:56:00.985Z|986473|bridge|INFO|bridge br-int: deleted interface ha-b5a217b0-48 on port 11098 2017-11-07T20:56:00.986Z|986474|bridge|INFO|bridge br-int: deleted interface ha-0d126e69-92 on port 12048 2017-11-07T20:56:00.987Z|986475|bridge|INFO|bridge br-int: deleted interface qr-4b19305a-77 on port 11368 2017-11-07T20:56:01.025Z|986476|bridge|INFO|bridge br-int: added interface ha-87309246-d8 on port 12043 2017-11-07T20:56:01.111Z|986477|bridge|INFO|bridge br-int: added interface ha-b5a217b0-48 on port 11098 2017-11-07T20:56:01.169Z|986478|bridge|INFO|bridge br-int: added interface ha-0d126e69-92 on port 12048 2017-11-07T20:56:01.207Z|986479|bridge|INFO|bridge br-int: added interface qr-4b19305a-77 on port 11368 2017-11-07T20:56:01.348Z|986480|bridge|INFO|bridge br-int: deleted interface ha-a1a195a9-c9 on port 12043 2017-11-07T20:56:01.351Z|986481|bridge|INFO|bridge br-int: deleted interface ha-8139fe8c-ed on port 12048 2017-11-07T20:56:01.352Z|986482|bridge|INFO|bridge br-int: deleted interface qr-63c9d5a3-1f on port 11368 2017-11-07T20:56:01.353Z|986483|bridge|INFO|bridge br-int: deleted interface ha-a7d06805-bf on port 11098 Still now we can see those traces... tcpdump on port 6633 reveals that it's neutron-openvswitch-agent the one adding and deleting interfaces all the time. I'm not sure as why it was able to complete the sync 8 hours later but it's indeed related with l3 agent logs as these errors can be seen until 22:21h being the last error the following one: 2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceeded timeou t 10 seconds 2017-11-07 22:12:13.221 124640 DEBUG neutron.agent.l3.router_info [-] process router delete process_delete /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:1099 After this ^, the l3 agent idles and everything seems fine (Except for the add/delete loop). This add/delete loop persist now even though I stopped all neutron agents: 2017-11-10T13:40:09.568Z|6103901|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","54208b80-2af8-4393-8a44-d6bd51c4606b"]]],"row":{"mac_in_use":"9a:4f:fe:ad:f5:ea","ifindex":2960482},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","70deb264-70cc-46bd-b9c7-ea8c451d03cf"]]],"row":{"mac_in_use":"ba:65:56:71:58:64","ifindex":2960481},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","d6007dca-f989-4c94-b1b5-9cd9b4ff44de"]]],"row":{"mac_in_use":"ce:d1:92:7b:e7:8d","ifindex":2960479},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","2e60fcbe-6b75-406f-be07-787b345aeba0"]]],"row":{"mac_in_use":"6a:2c:ec:23:5c:42","ifindex":2960480},"op":"update","table":"Interface"}], id=1317397 2017-11-10T13:40:09.916Z|6103933|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","3a680741-02b8-4e63-a018-827c60071d11"]]],"row":{"mac_in_use":"3a:9c:05:55:c2:ec","ifindex":2960486},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","15a84122-a3fb-4fc1-b5bc-cd918f50880d"]]],"row":{"mac_in_use":"f2:7a:37:5d:52:33","ifindex":2960483},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","7dbf8616-cddd-4d09-9363-4d1d68823a52"]]],"row":{"mac_in_use":"62:39:a7:d6:5a:5f","ifindex":2960484},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","20a86e8f-090f-49c5-9bf0-ddcc1032bd3d"]]],"row":{"mac_in_use":"fe:f2:c5:bc:40:2e","ifindex":2960485},"op":"update","table":"Interface"}], id=1317398 2017-11-10T13:40:09.923Z|6103934|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","3a680741-02b8-4e63-a018-827c60071d11"]]],"row":{"mac_in_use":"3a:9c:05:55:c2:ec","ifindex":2960486},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","15a84122-a3fb-4fc1-b5bc-cd918f50880d"]]],"row":{"mac_in_use":"f2:7a:37:5d:52:33","ifindex":2960483},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","7dbf8616-cddd-4d09-9363-4d1d68823a52"]]],"row":{"mac_in_use":"62:39:a7:d6:5a:5f","ifindex":2960484},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","20a86e8f-090f-49c5-9bf0-ddcc1032bd3d"]]],"row":{"mac_in_use":"fe:f2:c5:bc:40:2e","ifindex":2960485},"op":"update","table":"Interface"}], id=1317399 A watch on the added/deleted interfaces reveals that the MAC address of them keeps changing... It is ovs-vswitchd the one responsible for that since noone else is connected to ovsdb-server. Restarting openvswitch fixed it and those interfaces are no longer added/deleted. Maybe OVS folks can take a look at this weird behavior. [root@overcloud-controller-2 neutron]# rpm -qa | grep openvswitch openvswitch-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64 python-openvswitch-2.7.2-4.git20170719.el7fdp.noarch openstack-neutron-openvswitch-11.0.2-0.20170926082939.5b0191f.el7ost.noarch openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64 [0] https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2082
(In reply to Daniel Alvarez Sanchez from comment #27) > Still now we can see those traces... tcpdump on port 6633 reveals that it's > neutron-openvswitch-agent the one adding and deleting interfaces all the > time. Edit: it was ovs-vswitchd and not ovs agent. Tcpdump showed the updates to the agent's monitors and I got confused. After that I stopped all neutron agents and devices were still added/deleted to OVS by ovs-vswitchd (dbg traces revealed that).
This upstream bug may be related to ovsdb timeouts.
Eh, forgot the link to the bug: https://bugs.launchpad.net/neutron/+bug/1627106
One way to test if it is something to do with ovsdb native interface (for instance, deleting interfaces can be slower with it), would be to set ovsdb_interface=vsctl and see if the problem still occurs.
We need to understand why vswitchd keeps doing updates while nothing seem to run to push them in.
Hi Anil - can you try and reproduce this with the latest version of OSP 12 that has OVS 2.9? Thanks
Hi Anil, did you have a chance to test this yet?
Recently Joe has run create-list-routers on ml2/ovs on OSP13 and I think he didn't see any neutron resources not cleaned issue. Do you want me to still run on OSP12 ( I don't have any lab setup now, if this bug is of a high priority I can request one)? Thanks Anil
Let me ask for a setup to test this. Thanks Anil
Yes tests are running fine on OSP 13 w/OVS 2.9. We can close this. Thanks Brian.