Bug 1473654 - vswitchd Sharp CPU utilization due to neutron not cleaning resources
vswitchd Sharp CPU utilization due to neutron not cleaning resources
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Assaf Muller
Toni Freger
: AutomationBlocker, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-21 07:54 EDT by Joe Talerico
Modified: 2017-11-15 13:09 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
perf record of vswitchd (18.58 MB, application/octet-stream)
2017-07-21 07:55 EDT, Joe Talerico
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1731494 None None None 2017-11-10 10:12 EST

  None (edit)
Description Joe Talerico 2017-07-21 07:54:26 EDT
Description of problem:
ovs-vswitchd CPU utilization jumps 50x during our normal Rally API tests (control plane - no data plane).[1]

This can begin to break down the Cloud - I have seen services like RabbitMQ lose its heartbeat become partitioned and the rest is history. 


[1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?from=1500545668722&to=1500547429680&orgId=1&var-Cloud=openstack-12-ha-noncontainer&var-NodeType=*&var-Node=overcloud-controller-0&var-Interface=interface-br-ex&var-Disk=disk-sda&var-OSD=All&var-Pool=All&var-cpus0=All&var-cpus00=All&panelId=78&fullscreen

Version-Release number of selected component (if applicable):
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.7.0-8.git20170530.el7fdb.x86_64
openvswitch-2.7.0-8.git20170530.el7fdb.x86_64
openvswitch-ovn-host-2.7.0-8.git20170530.el7fdb.x86_64
openvswitch-ovn-common-2.7.0-8.git20170530.el7fdb.x86_64
openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-openvswitch-2.7.0-8.git20170530.el7fdb.noarch
[root@overcloud-controller-0 heat-admin]# 


How reproducible:
100%

Steps to Reproduce:
1. with 3x controller ha deployment, execute Neutron Rally create-list-routers

Actual results:
Link in the description to Grafana dashboard.

Expected results:
Utilization doesn't jump 50x


Additional info:
From vswitchd log (tail):
2017-07-21T11:51:14.589Z|2164542|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008
2017-07-21T11:51:14.615Z|2164543|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995
2017-07-21T11:51:14.639Z|2164544|bridge|INFO|bridge br-int: added interface ha-4f9e2bce-e1 on port 1008
2017-07-21T11:51:14.713Z|2164545|bridge|INFO|bridge br-int: deleted interface qr-504e5f81-9a on port 1008
2017-07-21T11:51:14.715Z|2164546|bridge|INFO|bridge br-int: deleted interface ha-8ae9c391-16 on port 995
2017-07-21T11:51:14.743Z|2164547|bridge|INFO|bridge br-int: added interface qr-504e5f81-9a on port 1008
2017-07-21T11:51:14.763Z|2164548|bridge|INFO|bridge br-int: added interface ha-8ae9c391-16 on port 995
2017-07-21T11:51:14.833Z|2164549|bridge|INFO|bridge br-int: deleted interface ha-7bba4721-43 on port 995
2017-07-21T11:51:14.835Z|2164550|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008
2017-07-21T11:51:14.858Z|2164551|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995
2017-07-21T11:51:14.885Z|2164552|bridge|INFO|bridge br-int: added interface ha-4f9e2bce-e1 on port 1008
2017-07-21T11:51:14.961Z|2164553|bridge|INFO|bridge br-int: deleted interface qr-504e5f81-9a on port 1008
2017-07-21T11:51:14.963Z|2164554|bridge|INFO|bridge br-int: deleted interface ha-8ae9c391-16 on port 995
2017-07-21T11:51:14.987Z|2164555|bridge|INFO|bridge br-int: added interface qr-504e5f81-9a on port 1008
2017-07-21T11:51:15.012Z|2164556|bridge|INFO|bridge br-int: added interface ha-8ae9c391-16 on port 995
2017-07-21T11:51:15.081Z|2164557|bridge|INFO|bridge br-int: deleted interface ha-7bba4721-43 on port 995
2017-07-21T11:51:15.083Z|2164558|bridge|INFO|bridge br-int: deleted interface ha-4f9e2bce-e1 on port 1008
2017-07-21T11:51:15.108Z|2164559|bridge|INFO|bridge br-int: added interface ha-7bba4721-43 on port 995

From /var/log/messages : (tail)
Jul 21 07:50:16 localhost NetworkManager[1264]: <info>  [1500637816.7172] manager: (qr-504e5f81-9a): new Generic device (/org/freedesktop/NetworkManager/Devices/1027729)
Jul 21 07:50:16 localhost kernel: device ha-8ae9c391-16 entered promiscuous mode
Jul 21 07:50:16 localhost NetworkManager[1264]: <info>  [1500637816.7434] manager: (ha-8ae9c391-16): new Generic device (/org/freedesktop/NetworkManager/Devices/1027730)
Jul 21 07:50:16 localhost kernel: device ha-7bba4721-43 left promiscuous mode
Jul 21 07:50:16 localhost kernel: device ha-7bba4721-43 entered promiscuous mode
Jul 21 07:50:16 localhost kernel: device ha-8ae9c391-16 left promiscuous mode
Jul 21 07:50:16 localhost NetworkManager[1264]: <info>  [1500637816.8710] manager: (ha-7bba4721-43): new Generic device (/org/freedesktop/NetworkManager/Devices/1027731)
Jul 21 07:50:16 localhost kernel: device ha-4f9e2bce-e1 entered promiscuous mode
Jul 21 07:50:16 localhost NetworkManager[1264]: <info>  [1500637816.8896] manager: (ha-4f9e2bce-e1): new Generic device (/org/freedesktop/NetworkManager/Devices/1027732)
Jul 21 07:50:16 localhost kernel: device qr-504e5f81-9a left promiscuous mode
Jul 21 07:50:16 localhost kernel: device qr-504e5f81-9a entered promiscuous mode
Jul 21 07:50:16 localhost kernel: device ha-4f9e2bce-e1 left promiscuous mode
Jul 21 07:50:16 localhost NetworkManager[1264]: <info>  [1500637816.9925] manager: (qr-504e5f81-9a): new Generic device (/org/freedesktop/NetworkManager/Devices/1027733)
Jul 21 07:50:17 localhost kernel: device ha-8ae9c391-16 entered promiscuous mode
Jul 21 07:50:17 localhost kernel: device ha-7bba4721-43 left promiscuous mode
Jul 21 07:50:17 localhost NetworkManager[1264]: <info>  [1500637817.0129] manager: (ha-8ae9c391-16): new Generic device (/org/freedesktop/NetworkManager/Devices/1027734)
Jul 21 07:50:17 localhost kernel: device ha-7bba4721-43 entered promiscuous mode
Jul 21 07:50:17 localhost NetworkManager[1264]: <info>  [1500637817.1254] manager: (ha-7bba4721-43): new Generic device (/org/freedesktop/NetworkManager/Devices/1027735)
Jul 21 07:50:17 localhost kernel: device ha-8ae9c391-16 left promiscuous mode
Jul 21 07:50:17 localhost kernel: device ha-4f9e2bce-e1 entered promiscuous mode
Jul 21 07:50:17 localhost kernel: device qr-504e5f81-9a left promiscuous mode
Jul 21 07:50:17 localhost NetworkManager[1264]: <info>  [1500637817.1502] manager: (ha-4f9e2bce-e1): new Generic device (/org/freedesktop/NetworkManager/Devices/1027736)


perf report of the vswitchd - 

  17.11%  ovs-vswitchd    [kernel.kallsyms]   [k] snmp_fold_field                                                                                                                             
  11.09%  ovs-vswitchd    [kernel.kallsyms]   [k] internal_get_stats                                                                                                                          
   8.08%  ovs-vswitchd    [kernel.kallsyms]   [k] mutex_spin_on_owner                                                                                                                         
   4.17%  ovs-vswitchd    [kernel.kallsyms]   [k] find_next_bit                                                                                                                               
   3.16%  ovs-vswitchd    libc-2.17.so        [.] _int_malloc                                                                                                                                 
   2.25%  ovs-vswitchd    libc-2.17.so        [.] _int_free                                                                                                                                   
   1.39%  ovs-vswitchd    [kernel.kallsyms]   [k] __memcpy                                                                                                                                    
   1.35%  ovs-vswitchd    libc-2.17.so        [.] malloc                                                                                                                                      
   1.15%  ovs-vswitchd    [kernel.kallsyms]   [k] osq_lock                                                                                                                                    
   1.09%  ovs-vswitchd    libc-2.17.so        [.] __strlen_sse2_pminub                                                                                                                        
   0.98%  ovs-vswitchd    [kernel.kallsyms]   [k] strlen                                                                                                                                      
   0.93%  ovs-vswitchd    [kernel.kallsyms]   [k] netlink_broadcast_filtered                                                                                                                  
   0.92%  ovs-vswitchd    libc-2.17.so        [.] __libc_calloc                                                                                                                               
   0.90%  ovs-vswitchd    [kernel.kallsyms]   [k] strcpy                                                                                                                                      
   0.74%  ovs-vswitchd    libc-2.17.so        [.] malloc_consolidate                                                                                                                          
   0.68%  ovs-vswitchd    libpthread-2.17.so  [.] pthread_mutex_lock                                                                                                                          
   0.65%  ovs-vswitchd    libpthread-2.17.so  [.] pthread_mutex_unlock                                                                                                                        
   0.64%  ovs-vswitchd    ovs-vswitchd        [.] 0x000000000027f3e7                                                                                                                          
   0.59%  ovs-vswitchd    [kernel.kallsyms]   [k] _raw_spin_lock_irqsave                                                                                                                      
   0.56%  ovs-vswitchd    libc-2.17.so        [.] __strcmp_sse42                                                                                                                              
   0.54%  ovs-vswitchd    [kernel.kallsyms]   [k] kobject_uevent_env                                                                                                                          
   0.52%  ovs-vswitchd    ovs-vswitchd        [.] 0x00000000002c58d3                                                                                                                          
   0.52%  ovs-vswitchd    [kernel.kallsyms]   [k] __memset                                                                                                                                    
   0.50%  ovs-vswitchd    [kernel.kallsyms]   [k] dst_ifdown      

Reviewing the controllers, Rally might not be cleaning up, however it seems to be qdhcp namespaces (not router ns).

[root@overcloud-controller-0 heat-admin]# ip netns | grep qdhcp | wc -l
795
Comment 1 Joe Talerico 2017-07-21 07:55 EDT
Created attachment 1302316 [details]
perf record of vswitchd
Comment 2 Joe Talerico 2017-07-24 11:34:00 EDT
Neutron creates an HA port per-router, referenced from Assaf's blog post [1] to send keepalived traffic.

Attempting to diagnose why ovs-vswitchd had the sharp CPU utilization, I began deleting resources off of the switch. Deleting tap and qr devices had zero impact on the utilization. Deleting all of the ha-ports, I witnessed [2], which CPU utilization returned.

[1] https://assafmuller.com/2014/08/16/layer-3-high-availability/

[2] https://snapshot.raintank.io/dashboard/snapshot/M8NiZW2ooIfqyUnAP7Bo3IpgaT81xtQW?panelId=72&fullscreen&orgId=2
Comment 3 Joe Talerico 2017-07-25 13:00:22 EDT
Packet capture on one of the HA ports

[root@overcloud-controller-1 neutron]# ip l s up ha-0cc675d6-46
[root@overcloud-controller-1 neutron]# tcpdump -vvv -i ha-0cc675d6-46                                                                       
tcpdump: listening on ha-0cc675d6-46, link-type EN10MB (Ethernet), capture size 262144 bytes
16:56:25.692546 IP6 (hlim 1, next-header Options (0) payload length: 36) overcloud-controller-1.perf.lab.eng.rdu.redhat.com > ff02::16: HBH (rtalert: 0x0000) (padn) [icmp6 sum ok] ICMP6, multicast listener report v2, 1 group record(s) [gaddr ff02::1:ff1f:6769 to_ex { }]
16:56:26.314133 IP (tos 0xc0, ttl 255, id 8109, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:28.315246 IP (tos 0xc0, ttl 255, id 8110, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:30.316924 IP (tos 0xc0, ttl 255, id 8111, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:32.318545 IP (tos 0xc0, ttl 255, id 8112, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:34.320108 IP (tos 0xc0, ttl 255, id 8113, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:36.321738 IP (tos 0xc0, ttl 255, id 8114, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
16:56:38.323299 IP (tos 0xc0, ttl 255, id 8115, offset 0, flags [none], proto VRRP (112), length 40)
    169.254.192.10 > 224.0.0.18: vrrp 169.254.192.10 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20, addrs: 169.254.0.1
^C
8 packets captured
8 packets received by filter
0 packets dropped by kernel
Comment 4 Flavio Leitner 2017-07-26 10:27:23 EDT
Hi,

We need a sosreport from the reproducing system to see what is going on.
I am particularly interested in the output of 'ovsdb-tool -m show-log' to review the transactions history.

There has been some bug reports related to adding ports to ovs bridge and this sounds like to be just another consequence of the same bug.  It's fixed in 2.7.2, so could please check if that resolves the issue?

http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.7.2/1.git20170719.el7fdp/

Please capture the requested data before updating OVS because it might still be useful even if 2.7.2 doesn't show the issue.

Thanks,
fbl
Comment 5 Joe Talerico 2017-08-03 07:27:20 EDT
fbl - ovsdb-tool output : https://gist.github.com/jtaleric/28d908f289f073c102337d02546f1c8b

I will update to 2.7 and re-run today.
Comment 6 Flavio Leitner 2017-08-03 08:58:49 EDT
Hi,

If the output provided in comment#5 was from after or during reproducing the issue, then I think we can rule out external triggers, like an agent adding and removing ports in a loop.

Please try with our latest 2.7 to see if the fix in upstream helps with this issue as well.

fbl
Comment 7 Joe Talerico 2017-08-03 10:57:51 EDT
(In reply to Flavio Leitner from comment #6)
> Hi,
> 
> If the output provided in comment#5 was from after or during reproducing the
> issue, then I think we can rule out external triggers, like an agent adding
> and removing ports in a loop.
>

Yes - was after reproducing the bug.
 
> Please try with our latest 2.7 to see if the fix in upstream helps with this
> issue as well.
> 
> fbl

Yup. This was my next step.
Comment 8 Joe Talerico 2017-08-04 16:28:02 EDT
(In reply to Flavio Leitner from comment #6)
> Hi,
> 
> If the output provided in comment#5 was from after or during reproducing the
> issue, then I think we can rule out external triggers, like an agent adding
> and removing ports in a loop.
> 
> Please try with our latest 2.7 to see if the fix in upstream helps with this
> issue as well.
> 
> fbl

Updating to OVS2.7.2 did not help. 

Disabling l3_ha (set it to false across controllers) seems to have helped.

1830-~1930 is l3_ha:true 
2015-~2027 l3_ha:false

https://snapshot.raintank.io/dashboard/snapshot/btt0hoOW1rjKOSteD0ze2BQFNceEXikK?refresh=10s&panelId=72&fullscreen&orgId=2
Comment 9 Assaf Muller 2017-08-23 08:07:43 EDT
The only difference with HA routers or legacy as far as the data plane would be VRRP traffic from keepalived.

@Flavio do you have the info that you need? This is impacting some OSP 12 CI jobs.
Comment 10 Joe Talerico 2017-08-23 14:52:07 EDT
@Assaf / @Flavio - Tracing this a bit more[1], ovs-vswitchd is working fine until neutron delete-router begins.

When Rally begins to delete the Neutron router resources, we start seeing issues with keystone : 

2017-08-23 02:27:15.563 117667 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Unable to establish connection to http://192.168.24.60:35357/v3/auth/tokens: HTTPConnectionPool(host='192.168.24.60', port=35357): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x866bf90>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)): ConnectFailure: Unable to establish connection to http://192.168.24.60:35357/v3/auth/tokens: HTTPConnectionPool(host='192.168.24.60', port=35357): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x866bf90>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))

Which could be related to : https://bugzilla.redhat.com/show_bug.cgi?id=1470363

Is there something we can add to the neutron-agent to clean up these keepalived/vrrp ports even if keystone/auth fails?

[1] https://gist.github.com/jtaleric/21dbd06d011435c80a9f7f2b4a3971ad
Comment 11 Assaf Muller 2017-08-28 09:42:40 EDT
Any leads here or ongoing work?
Comment 12 Lance Richardson 2017-08-28 09:55:29 EDT
(In reply to Assaf Muller from comment #11)
> Any leads here or ongoing work?

My current understanding is that this is more likely to be a neutron, keystone,
or keepalived problem than in openvswitch. The logs referenced in comment
10 seem to indicate that keepalived is failing to create sockets used for
heartbeat messages, if that results in multiple VRRP masters it would explain
the flow thrashing in ovs-vswitchd. Of course I have little understanding
of the neutron/keystone/keepalived interactions, it would be good to have
someone with more knowledge of these components weigh in.
Comment 14 Assaf Muller 2017-09-05 07:46:44 EDT
My guess is that the high CPU utilization is not leaving enough CPU time for other processes including keystone, which is causing timeouts in API requests. I think that the keystone errors are red herrings. We could verify this by running the same workloads on the same OSP version, but swapping out OVS versions. @Joe, can you try downgrading OVS to 2.6 and running the Rally scenarios again?
Comment 15 Joe Talerico 2017-09-05 14:33:37 EDT
(In reply to Assaf Muller from comment #14)
> My guess is that the high CPU utilization is not leaving enough CPU time for
> other processes including keystone, which is causing timeouts in API
> requests. I think that the keystone errors are red herrings. We could verify
> this by running the same workloads on the same OSP version, but swapping out
> OVS versions. @Joe, can you try downgrading OVS to 2.6 and running the Rally
> scenarios again?

Hey Assaf - the CPU utilization spike isn't in question when we see the authentication errors... I think the authentication errors are due to how ha-proxy is configured, and the recent switch to Fernet tokens vs UUID (more CPU usage due to crypto) https://bugzilla.redhat.com/show_bug.cgi?id=1470363

Once I have another OSP12 deployment I will re-run through this.
Comment 17 Joe Talerico 2017-10-18 09:22:06 EDT
Moving this to openstack-neutron as I think it is due to neutron not cleaning up properly.
Comment 19 Joe Talerico 2017-10-27 14:38:59 EDT
I have reran this test without OVS used as the control-plane -- ie, I had TripleO use ovs_bridge for all of the control-plane. I have since switched to using a single bond with vlan (ie, no bridges). I have not been able to recreate this bug. 

However, I have noticed not all resources on the controllers are being cleaned up (all are qdhcp ns):
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:15:39 UTC 2017
1188
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:15:44 UTC 2017
1187
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:15:51 UTC 2017
1175
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:17:33 UTC 2017
1001
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:17:45 UTC 2017
983
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:19:30 UTC 2017
761
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:20:04 UTC 2017
749
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:21:31 UTC 2017
749
[root@overcloud-controller-0 heat-admin]# date; ip netns | wc -l
Fri Oct 27 18:34:30 UTC 2017
749

No OVS For control plane:
https://snapshot.raintank.io/dashboard/snapshot/2qRYNCdiUsPslqFJyjAV0j5TIJga8YH1?refresh=10s&panelId=72&fullscreen&orgId=2

OVS as control plane :
https://snapshot.raintank.io/dashboard/snapshot/3c2YAifqcpaa7Vfb3tjAQBcI7ZES4jCS?panelId=72&fullscreen&orgId=2
Comment 20 Joe Talerico 2017-10-30 11:03:57 EDT
I have been able to recreate this bug using the non-ovs control-plane environment. 

[root@overcloud-controller-0 heat-admin]# ps aux | grep dnsmasq | wc -l 
113
[root@overcloud-controller-0 heat-admin]# ip netns | wc -l
895
[root@overcloud-controller-0 heat-admin]# 

[root@overcloud-controller-1 heat-admin]# ps aux | grep dnsmaq | wc -l
1
[root@overcloud-controller-1 heat-admin]# ip netns | wc -l
1039

[root@overcloud-controller-2 heat-admin]# ps aux | dnsmasq | wc -l
0
[root@overcloud-controller-2 heat-admin]# ip netns | wc -l
1478

Looking at the utilization :
https://snapshot.raintank.io/dashboard/snapshot/Dl9XBj6A22aElPtdsTrhJX7EPUjTpI0b?panelId=72&fullscreen&orgId=2
Comment 21 Joe Talerico 2017-10-30 12:08:19 EDT
Issuing, 
[root@overcloud-controller-2 heat-admin]# for netns in `ip netns`; do ip netns delete $netns; done

I see :
https://snapshot.raintank.io/dashboard/snapshot/MaakwJJ2PhzfW8DdS7Q3krIHYN82geqF?panelId=72&fullscreen&orgId=2

Furthermore, running :

[root@overcloud-controller-[0-2] heat-admin]# for port in `ovs-vsctl show | grep "Port \"tap" | awk '{print $2}' | sed s/\"//g`; do ovs-vsctl del-port $port; done                                                     

OVS returns to normal operations

https://snapshot.raintank.io/dashboard/snapshot/myovd1lzDJCCfLlRFR4zY0Pc8Gbc10KG?panelId=72&fullscreen&orgId=2
Comment 22 Joe Talerico 2017-11-03 15:04:29 EDT
Continuing to diagnose/look into this, I ran Rally multiple times, cleaned just controller-0, which I will get into how I cleaned it. 

Reviewing the last 3 graphs here [1]. We can see the qrouter resources growing on controller-1 and never returning (I would of expected this to be zero, like the other controllers at time 1600.

The qdhcp namespaces are growing out of control... ovs-agent sets these ports to 4095, which you can see here[2]. 

the dhcp-agent has a cleanup function [3], which only runs at the start of the dhcp-agent, issuing a restart to the dhcp-agent, I see these namespaces get cleaned up (as you can see on controller-0 graph at time 17:08). 

Could we recommend running the dhcp-agent cleanup on a interval?

[1] https://snapshot.raintank.io/dashboard/snapshot/Uhoh0lRBHR5uRZJbnxd6XOEVns3C8S7D

[2] https://gist.github.com/jtaleric/4b3f728825cc1cc02edcbcb12f1e0089

[3] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L1384
Comment 23 Joe Talerico 2017-11-06 08:24:51 EST
Out of suspicion, I thought the qdhcp resources were not being cleaned up due to being in some sort of weird state due to [1]. And that does seem to be the case, once I bump the inotify.max_user_instances, and re-run the exact test, we see the qdhcp resources return to 0 [2]. 

This doesn't explain why restarting neutron-dhcp-agent cleans up the defunct namespaces, but the normal clean-up function doesn't?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1474515

[2] https://snapshot.raintank.io/dashboard/snapshot/W2WPViO3DQ5xc592LPfdRfzk7noSdqdi?panelId=13&fullscreen&orgId=2
Comment 24 Daniel Alvarez Sanchez 2017-11-08 09:34:10 EST
I'd like to take a look, both dhcp and router namespaces should've been deleted gracefully. For example, when a router is removed, this code will get executed [0] and if there's stale namespaces, they would get eventually be cleaned up when a fullsync is performed [1] which only happens if an exception occurs and on startup.

[0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/router_info.py#L420
[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L574
Comment 25 Joe Talerico 2017-11-09 18:09:04 EST
Reviewing the l3-agent logs, we do see :

2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceede
d timeout 10 seconds
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 30, in execute
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command     txn.add(self)
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command     self.gen.next()
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command     self._nested_txn = None
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command     timeout=self.timeout)
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceeded timeout 10 seconds
2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command 

[root@overcloud-controller-2 neutron]# zcat l3-agent.log-20171108.gz | grep "d TimeoutException" | wc -l
1791
Comment 26 Daniel Alvarez Sanchez 2017-11-10 05:59:12 EST
Also, due to that, the agent did 149 fullsyncs:

# zgrep "Starting fullsync periodic_sync_routers_task" l3-agent*gz | wc -l                                                                               
149


Example: Attempt to delete port qr-c5d287ff-30 from OVS and the associated device:

l3-agent.log-20171108.gz:2017-11-07 13:21:09.587 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
l3-agent.log-20171108.gz:2017-11-07 13:21:09.656 124640 DEBUG neutron.agent.l3.router_info [-] Deleting stale internal router device: qr-c5d287ff-30 _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:603
l3-agent.log-20171108.gz:2017-11-07 13:22:37.646 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
l3-agent.log-20171108.gz:2017-11-07 13:26:49.033 124640 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelPortCommand(if_exists=True, bridge=br-int, port=qr-c5d287ff-30) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84


This port was added and removed by OVS tons of times:

[root@overcloud-controller-2 neutron]# !43
zgrep c5d287ff  ../openvswitch/ovs-vswitchd.log-20171108.gz | grep added |  wc -l
492
[root@overcloud-controller-2 neutron]# !44
zgrep c5d287ff  ../openvswitch/ovs-vswitchd.log-20171108.gz | grep deleted |  wc -l
491

And finally, the port remained there:

[root@overcloud-controller-2 neutron]# zgrep c5d287ff  ../openvswitch/ovs-vswitchd.log-20171108.gz
[...]
2017-11-07T13:38:56.166Z|57485|bridge|INFO|bridge br-int: deleted interface qr-c5d287ff-30 on port 11370
2017-11-07T13:38:56.245Z|57489|bridge|INFO|bridge br-int: added interface qr-c5d287ff-30 on port 11370


A lot of stale ports remain after the test run:
[root@overcloud-controller-2 neutron]# ovs-vsctl show | grep Port | wc -l
1340


Also, ovsdb-server logs are full of connection errors:
2017-11-07T13:46:10.509Z|01018|reconnect|ERR|tcp:127.0.0.1:41730: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:18.193Z|01019|reconnect|ERR|tcp:127.0.0.1:42654: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:21.124Z|01020|reconnect|ERR|tcp:127.0.0.1:43234: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:29.333Z|01021|reconnect|ERR|tcp:127.0.0.1:43796: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:34.055Z|01022|reconnect|ERR|tcp:127.0.0.1:44102: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:40.800Z|01023|reconnect|ERR|tcp:127.0.0.1:46852: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:48.279Z|01024|reconnect|ERR|tcp:127.0.0.1:48370: no response to inactivity probe after 5 seconds, disconnecting
2017-11-07T13:46:53.494Z|01025|reconnect|ERR|tcp:127.0.0.1:49050: no response to inactivity probe after 5 seconds, disconnecting


This suggests that the neutron-openvswitch-agent is busy and fails to send the response to inactivity probes from server, then it gets disconnected. We could increase the inactivity probe to a higher value (15s?):

ovs-vsctl -- set controller <_uuid> inactivity_probe=15000

But this would only alleviate the problem since the root cause is the ovs-agent being "stuck". The below iteration of rpc_loop in ovs agent (where most of the test happens) took 1175 seconds (~20 minutes) to complete:

2017-11-07 13:36:47.955 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 - port information retrieved. Elapsed:17.885 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2062

2017-11-07 13:56:05.623 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 1321, 'removed': 0}}. Elapsed:1175.553 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804


This iteration above took 20 minutes but right before, the agent crashed and had to perform a full sync which lead to those so many errors. The reason why it crashed is:

2017-11-07 13:35:54.877 597682 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection reset by peer): RuntimeError: OVS transaction timed out
2017-11-07 13:35:54.895 597682 CRITICAL neutron [-] Unhandled error: TypeError: int() can't convert non-string with explicit base

2017-11-07 13:35:54.895 597682 CRITICAL neutron [-] Unhandled error: TypeError: int() can't convert non-string with explicit base
2017-11-07 13:35:54.895 597682 ERROR neutron Traceback (most recent call last):
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/bin/neutron-openvswitch-agent", line 10, in <module>
2017-11-07 13:35:54.895 597682 ERROR neutron     sys.exit(main())
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2017-11-07 13:35:54.895 597682 ERROR neutron     agent_main.main()
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 51, in main
2017-11-07 13:35:54.895 597682 ERROR neutron     mod.main()
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/main.py", line 35, in main
2017-11-07 13:35:54.895 597682 ERROR neutron     'neutron.plugins.ml2.drivers.openvswitch.agent.'
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 375, in run_apps
2017-11-07 13:35:54.895 597682 ERROR neutron     hub.joinall(services)
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall
2017-11-07 13:35:54.895 597682 ERROR neutron     t.wait()
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2017-11-07 13:35:54.895 597682 ERROR neutron     return self._exit_event.wait()
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait
2017-11-07 13:35:54.895 597682 ERROR neutron     current.throw(*self._exc)
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
2017-11-07 13:35:54.895 597682 ERROR neutron     result = function(*args, **kwargs)
2017-11-07 13:35:54.895 597682 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 65, in _launch
2017-11-07 13:35:54.895 597682 ERROR neutron     raise e
2017-11-07 13:35:54.895 597682 ERROR neutron TypeError: int() can't convert non-string with explicit base
2017-11-07 13:35:54.895 597682 ERROR neutron 

And then, systemd starts it again:

2017-11-07 13:35:58.155 17401 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 11.0.2.dev2


During that period we can see timeout messages and connection drops to ovsdb:

[root@overcloud-controller-2 neutron]# grep "connection dropped" bad_iteration.log  | wc -l
56
[root@overcloud-controller-2 neutron]# grep "exceeded timeout" bad_iteration.log  | wc -l                                                                                                     
10


When the agent restarts it has to process 2642 devices at that moment:

17-11-07 13:36:47.956 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Starting to process devices in:{'current': set([u'77072f78-e354-4598-90bc-2a01607f9809', u'34a7559b-aeef-4980-892d-5450cf4fdc2b', u'0271e328-0727-4f83-800f-e67bc161b776', u'52625ba1-6fc2-49f9-83df-fe18d0c136cc', u'7291bb17-575b-4700-92d2-cd85ad08da6a', u'f82af294-68b6-4963-90f2-b18d80f90be4', u'10896f5e-b470-47b5-81fe-fdb98caa93bf', u'00339cff-369b-4460-80bb-44d43c3fb8da', u'488e4157-f1e9-4f02-9cd0-dd418ced9564', u'd1e81972-de50-4c8c-96b5-8d ....

[root@overcloud-controller-2 neutron]# grep "Starting to proc" bad_iteration.log | grep -o "u'" | wc -l
2642


Processing all of them takes around 20 minutes (basically this is a full sync) and also leaves resources (OVS ports mainly) behind.


I'll try to dig further in the exception that caused the OVS agent to restart so that we can avoid a full sync while the test is running. However, the fact that the agent takes so long to do a full sync with so many ports is due to its current design. I'm pretty sure that we can optimize it but this will require a lot of work.
Comment 27 Daniel Alvarez Sanchez 2017-11-10 10:11:08 EST
Further update after a closer look to logs:

During the rpc loop execution of the OVS agent, 2642 ports are to be processed and process_network_ports() is called at [0] which gets first called at around:
2017-11-07 13:36:47

As the function is processing the devices, a timeout exception occurs and it doesn't complete:

2017-11-07 13:56:05.619 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent RuntimeError: ofctl request version=0x4,msg_type=0xe,msg_len=0x40,xid=0x12f844db,OFPFlowMod(buffer_id=4294967295,command=0,cookie=7394935990473861018L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],match=OFPMatch(oxm_fields={'in_port': 11484}),out_group=0,out_port=0,priority=2,table_id=0) timed out


This exception is raised and the rpc loop iteration finishes:

2017-11-07 13:56:05.619 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Error while processing VIF ports: RuntimeError: ofctl request version=0x4,msg_type=0xe,msg_len=0x40,xid=0x12f844db,OFPFlowMod(buffer_id=4294967295,command=0,cookie=7394935990473861018L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],match=OFPMatch(oxm_fields={'in_port': 11484}),out_group=0,out_port=0,priority=2,table_id=0) timed out


2017-11-07 13:56:05.623 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 1321, 'removed': 0}}. Elapsed:1175.553 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804

After this, coming iterations won't succeed because they're unable to communicate with OVS:

2017-11-07 13:59:03.665 17401 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Failed to communicate with the switch: RuntimeError: ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0xae6014dc,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) timed out

2017-11-07 13:59:03.666 17401 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.

And keeps like this until more than 8 hours later:

2017-11-07 22:25:54.633 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 - star
ting polling. Elapsed:36.183 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2039
2017-11-07 22:25:54.633 17401 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent out of sync with plugin!
2017-11-07 22:25:54.634 17401 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): ListPortsCommand(bridge=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/b
ackend/ovs_idl/transaction.py:84

2017-11-07 22:29:38.318 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 - port information retrieved. Elapsed:259.868 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2062
2017-11-07 22:41:03.055 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] process_network_ports - iteration:839 - treat_devices_added_or_updated completed. Skipped 0 devices of 1321 devices currently available. Time elapsed: 684.726 process_network_ports /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1663
2017-11-07 22:41:20.534 17401 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-77a402ac-4766-4652-8d95-9b2e8e884512 - - - - -] Agent rpc_loop - iteration:839 completed. Processed ports statistics: {'regular': {'updated': 637, 'added': 1321, 'removed': 0}}. Elapsed:962.084 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1804

Iteration 839 took 962 seconds (16 minutes) and looks that it completes successfully.


I can't really tell why it stayed like that until 22:41 but in ovs-vswitchd logs we can see that it's adding and deleting devices all the time (DEBUG level would help here):


2017-11-07T20:56:00.688Z|986468|bridge|INFO|bridge br-int: added interface ha-a1a195a9-c9 on port 12043
2017-11-07T20:56:00.735Z|986469|bridge|INFO|bridge br-int: added interface ha-8139fe8c-ed on port 12048
2017-11-07T20:56:00.759Z|986470|bridge|INFO|bridge br-int: added interface qr-63c9d5a3-1f on port 11368
2017-11-07T20:56:00.844Z|986471|bridge|INFO|bridge br-int: added interface ha-a7d06805-bf on port 11098
2017-11-07T20:56:00.984Z|986472|bridge|INFO|bridge br-int: deleted interface ha-87309246-d8 on port 12043
2017-11-07T20:56:00.985Z|986473|bridge|INFO|bridge br-int: deleted interface ha-b5a217b0-48 on port 11098
2017-11-07T20:56:00.986Z|986474|bridge|INFO|bridge br-int: deleted interface ha-0d126e69-92 on port 12048
2017-11-07T20:56:00.987Z|986475|bridge|INFO|bridge br-int: deleted interface qr-4b19305a-77 on port 11368
2017-11-07T20:56:01.025Z|986476|bridge|INFO|bridge br-int: added interface ha-87309246-d8 on port 12043
2017-11-07T20:56:01.111Z|986477|bridge|INFO|bridge br-int: added interface ha-b5a217b0-48 on port 11098
2017-11-07T20:56:01.169Z|986478|bridge|INFO|bridge br-int: added interface ha-0d126e69-92 on port 12048
2017-11-07T20:56:01.207Z|986479|bridge|INFO|bridge br-int: added interface qr-4b19305a-77 on port 11368
2017-11-07T20:56:01.348Z|986480|bridge|INFO|bridge br-int: deleted interface ha-a1a195a9-c9 on port 12043
2017-11-07T20:56:01.351Z|986481|bridge|INFO|bridge br-int: deleted interface ha-8139fe8c-ed on port 12048
2017-11-07T20:56:01.352Z|986482|bridge|INFO|bridge br-int: deleted interface qr-63c9d5a3-1f on port 11368
2017-11-07T20:56:01.353Z|986483|bridge|INFO|bridge br-int: deleted interface ha-a7d06805-bf on port 11098

Still now we can see those traces... tcpdump on port 6633 reveals that it's neutron-openvswitch-agent the one adding and deleting interfaces all the time.
I'm not sure as why it was able to complete the sync 8 hours later but it's indeed related with l3 agent logs as these errors can be seen until 22:21h being the last error the following one:

2017-11-07 22:12:13.219 124640 ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DelPortCommand object at 0x184d6290>] exceeded timeou
t 10 seconds
2017-11-07 22:12:13.221 124640 DEBUG neutron.agent.l3.router_info [-] process router delete process_delete /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:1099

After this ^, the l3 agent idles and everything seems fine (Except for the add/delete loop).

This add/delete loop persist now even though I stopped all neutron agents:

2017-11-10T13:40:09.568Z|6103901|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","54208b80-2af8-4393-8a44-d6bd51c4606b"]]],"row":{"mac_in_use":"9a:4f:fe:ad:f5:ea","ifindex":2960482},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","70deb264-70cc-46bd-b9c7-ea8c451d03cf"]]],"row":{"mac_in_use":"ba:65:56:71:58:64","ifindex":2960481},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","d6007dca-f989-4c94-b1b5-9cd9b4ff44de"]]],"row":{"mac_in_use":"ce:d1:92:7b:e7:8d","ifindex":2960479},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","2e60fcbe-6b75-406f-be07-787b345aeba0"]]],"row":{"mac_in_use":"6a:2c:ec:23:5c:42","ifindex":2960480},"op":"update","table":"Interface"}], id=1317397


2017-11-10T13:40:09.916Z|6103933|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","3a680741-02b8-4e63-a018-827c60071d11"]]],"row":{"mac_in_use":"3a:9c:05:55:c2:ec","ifindex":2960486},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","15a84122-a3fb-4fc1-b5bc-cd918f50880d"]]],"row":{"mac_in_use":"f2:7a:37:5d:52:33","ifindex":2960483},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","7dbf8616-cddd-4d09-9363-4d1d68823a52"]]],"row":{"mac_in_use":"62:39:a7:d6:5a:5f","ifindex":2960484},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","20a86e8f-090f-49c5-9bf0-ddcc1032bd3d"]]],"row":{"mac_in_use":"fe:f2:c5:bc:40:2e","ifindex":2960485},"op":"update","table":"Interface"}], id=1317398


2017-11-10T13:40:09.923Z|6103934|jsonrpc|DBG|unix:/var/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"lock":"ovs_vswitchd","op":"assert"},{"where":[["_uuid","==",["uuid","3a680741-02b8-4e63-a018-827c60071d11"]]],"row":{"mac_in_use":"3a:9c:05:55:c2:ec","ifindex":2960486},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","15a84122-a3fb-4fc1-b5bc-cd918f50880d"]]],"row":{"mac_in_use":"f2:7a:37:5d:52:33","ifindex":2960483},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","7dbf8616-cddd-4d09-9363-4d1d68823a52"]]],"row":{"mac_in_use":"62:39:a7:d6:5a:5f","ifindex":2960484},"op":"update","table":"Interface"},{"where":[["_uuid","==",["uuid","20a86e8f-090f-49c5-9bf0-ddcc1032bd3d"]]],"row":{"mac_in_use":"fe:f2:c5:bc:40:2e","ifindex":2960485},"op":"update","table":"Interface"}], id=1317399

A watch on the added/deleted interfaces reveals that the MAC address of them keeps changing...

It is ovs-vswitchd the one responsible for that since noone else is connected to ovsdb-server. Restarting openvswitch fixed it and those interfaces are no longer added/deleted. Maybe OVS folks can take a look at this weird behavior.


[root@overcloud-controller-2 neutron]# rpm -qa | grep openvswitch
openvswitch-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
python-openvswitch-2.7.2-4.git20170719.el7fdp.noarch
openstack-neutron-openvswitch-11.0.2-0.20170926082939.5b0191f.el7ost.noarch
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64




[0] https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2082
Comment 28 Daniel Alvarez Sanchez 2017-11-10 10:18:05 EST
(In reply to Daniel Alvarez Sanchez from comment #27)
> Still now we can see those traces... tcpdump on port 6633 reveals that it's
> neutron-openvswitch-agent the one adding and deleting interfaces all the
> time.

Edit: it was ovs-vswitchd and not ovs agent. Tcpdump showed the updates to the agent's monitors and I got confused. After that I stopped all neutron agents and devices were still added/deleted to OVS by ovs-vswitchd (dbg traces revealed that).
Comment 29 Ihar Hrachyshka 2017-11-15 12:47:05 EST
This upstream bug may be related to ovsdb timeouts.
Comment 30 Ihar Hrachyshka 2017-11-15 12:47:27 EST
Eh, forgot the link to the bug: https://bugs.launchpad.net/neutron/+bug/1627106
Comment 31 Terry Wilson 2017-11-15 13:09:49 EST
One way to test if it is something to do with ovsdb native interface (for instance, deleting interfaces can be slower with it), would be to set ovsdb_interface=vsctl and see if the problem still occurs.

Note You need to log in before you can comment on or make changes to this bug.