1890974 – [OSP13][OVN] High delay (20-30min) to reach outside network

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1890974 - [OSP13][OVN] High delay (20-30min) to reach outside network

Summary: [OSP13][OVN] High delay (20-30min) to reach outside network

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn2.11
Sub Component:
Version:	FDP 20.G
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dumitru Ceara
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-23 12:48 UTC by ggrimaux
Modified:	2023-12-15 19:53 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-01 15:07:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5309	0	None	None	None	2020-12-01 15:07:35 UTC

Description ggrimaux 2020-10-23 12:48:13 UTC

Description of problem:
Client recently updated their OSP13 from z10 to z12. They are using OVN in their environment.

They are spawning instances through heat templates with Tower and sometimes (currently frequently) the instance is spawned, get an ip through dhcp, but the next steps in the deployment where it curls an internal URL fails. It is being tried for like 20-30 minutes and the curl keep failing. After 30 minutes the curl is not tried anymore but it looks like you can access the outside network.

ovn-controller.log from the compute node around the time the port was created:
2020-10-23T06:07:09.381Z|18828|timeval|WARN|Unreasonably long 40122ms poll interval (40016ms user, 10ms system)
2020-10-23T06:07:09.381Z|18829|timeval|WARN|faults: 10388 minor, 0 major
2020-10-23T06:07:09.381Z|18830|timeval|WARN|context switches: 0 voluntary, 24 involuntary

From on of the controller:
2020-10-23T06:06:29.937Z|25829|timeval|WARN|Unreasonably long 1140ms poll interval (1136ms user, 0ms system)
2020-10-23T06:06:29.937Z|25830|timeval|WARN|faults: 8 minor, 0 major
2020-10-23T06:06:29.937Z|25831|timeval|WARN|context switches: 0 voluntary, 3 involuntary
2020-10-23T06:07:09.154Z|25832|timeval|WARN|Unreasonably long 37364ms poll interval (37263ms user, 11ms system)
2020-10-23T06:07:09.154Z|25833|timeval|WARN|faults: 2337 minor, 0 major
2020-10-23T06:07:09.154Z|25834|timeval|WARN|context switches: 0 voluntary, 69 involuntary
2020-10-23T06:07:09.154Z|25835|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (97% CPU usage)
2020-10-23T06:07:09.154Z|25836|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.183.12:59198<->10.10.183.11:6642) at lib/stream-fd.c:157 (97% CPU usage)
2020-10-23T06:07:09.171Z|25837|poll_loop|INFO|wakeup due to 0-ms timeout at tcp:10.10.183.11:6642 (97% CPU usage)
2020-10-23T06:07:46.291Z|25838|timeval|WARN|Unreasonably long 37119ms poll interval (37028ms user, 2ms system)
2020-10-23T06:07:46.291Z|25839|timeval|WARN|faults: 1814 minor, 0 major
2020-10-23T06:07:46.291Z|25840|timeval|WARN|context switches: 0 voluntary, 80 involuntary
2020-10-23T06:07:46.291Z|25841|coverage|INFO|Dropped 2 log messages in last 77 seconds (most recently, 37 seconds ago) due to excessive rate
2020-10-23T06:07:46.291Z|25842|coverage|INFO|Skipping details of duplicate event coverage for hash=4bbd7269
2020-10-23T06:07:46.291Z|25843|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)
2020-10-23T06:07:46.291Z|25844|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.183.12:59198<->10.10.183.11:6642) at lib/stream-fd.c:157 (99% CPU usage)
2020-10-23T06:07:47.178Z|25845|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)
2020-10-23T06:07:53.810Z|25846|timeval|WARN|Unreasonably long 1056ms poll interval (1052ms user, 0ms system)
2020-10-23T06:07:53.810Z|25847|timeval|WARN|context switches: 0 voluntary, 13 involuntary

From nb.log we see these every minutes (before and after instance creation):
2020-10-23T06:06:52.280Z|45205|socket_util|ERR|Dropped 39 log messages in last 58 seconds (most recently, 3 seconds ago) due to excessive rate
2020-10-23T06:06:52.280Z|45206|socket_util|ERR|6641:10.10.183.11: bind: Cannot assign requested address
2020-10-23T06:06:52.280Z|45207|ovsdb_jsonrpc_server|ERR|Dropped 39 log messages in last 58 seconds (most recently, 3 seconds ago) due to excessive rate
2020-10-23T06:06:52.280Z|45208|ovsdb_jsonrpc_server|ERR|ptcp:6641:10.10.183.11: listen failed: Cannot assign requested address

And sb also every minutes:
2020-10-23T06:07:01.932Z|45543|socket_util|ERR|Dropped 34 log messages in last 59 seconds (most recently, 2 seconds ago) due to excessive rate
2020-10-23T06:07:01.932Z|45544|socket_util|ERR|6642:10.10.183.11: bind: Cannot assign requested address
2020-10-23T06:07:01.932Z|45545|ovsdb_jsonrpc_server|ERR|Dropped 34 log messages in last 59 seconds (most recently, 2 seconds ago) due to excessive rate
2020-10-23T06:07:01.932Z|45546|ovsdb_jsonrpc_server|ERR|ptcp:6642:10.10.183.11: listen failed: Cannot assign requested address

Interestingly is that the IP they are trying to curl is 10.10.x.x so maybe its related to this ? Maybe not since those are there all the time.

sosreport from all 3 controllers, 1 compute nodes and the ovn databases are uploaded.
I will put details about that one instance we are currently investigating in the next private comment.

I need your help to understand why sometimes the network is very slow.

Thank you.


Version-Release number of selected component (if applicable):
OSP13 z12 
Latest containers

How reproducible:
Frequently right now. Some days no issues. Look random.

Steps to Reproduce:
1. Spawn instances
2. 
3.

Actual results:
There is a huge delay for the instance to be able to reach the outside network

Expected results:
Instance can reach the outside network "as soon" as the ip is allocated (what you would expect)

Additional info:
sosreports and ovn databases are available.

Comment 21 Jianlin Shi 2020-11-09 06:44:14 UTC

tested with script in https://bugzilla.redhat.com/show_bug.cgi?id=1818128#c10:

result in ovn2.11.1-54:

[root@dell-per740-12 bz1890974]# bash rep.sh
from-lport  1001 (inport == @pg && ip) drop
  to-lport  1001 (outport == @pg && ip) drop
Mon Nov  9 01:34:17 EST 2020

<=== start time

+ flow_num_before=0
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 0 ']'
+ flow_num_before=11223
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 11223 ']'
+ break
+ date
Mon Nov  9 01:34:45 EST 2020

<=== end time
it takes 28 seconds

result on ovn2.11.1-55:

[root@dell-per740-12 bz1890974]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.11-2.11.3-76.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.11-2.11.1-55.el7fdp.x86_64
ovn2.11-host-2.11.1-55.el7fdp.x86_64
ovn2.11-central-2.11.1-55.el7fdp.x86_64

[root@dell-per740-12 bz1890974]# bash rep.sh 
from-lport  1001 (inport == @pg && ip) drop
  to-lport  1001 (outport == @pg && ip) drop
Mon Nov  9 01:40:03 EST 2020

<=== start time

+ flow_num_before=0
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 0 ']'
+ flow_num_before=11223
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 11223 ']'
+ break
+ date
Mon Nov  9 01:40:14 EST 2020

<=== end time

it takes 13 seconds.

performance improved.

Comment 22 Jianlin Shi 2020-11-09 07:43:45 UTC

Verified on rhel8 version:

[root@dell-per740-42 bz1890974]# bash rep.sh
from-lport  1001 (inport == @pg && ip) drop                        
  to-lport  1001 (outport == @pg && ip) drop
Mon Nov  9 01:46:57 EST 2020
+ flow_num_before=0
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 0 ']'
+ flow_num_before=11223
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 11223 ']'
+ break
+ date
Mon Nov  9 01:47:19 EST 2020

<=== it takes 22 seconds

[root@dell-per740-42 bz1890974]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.11-2.11.3-73.el8fdp.x86_64
ovn2.11-2.11.1-54.el8fdp.x86_64
ovn2.11-host-2.11.1-54.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.11-central-2.11.1-54.el8fdp.x86_64

on ovn2.11.1-55:

[root@dell-per740-42 bz1890974]# bash rep.sh
from-lport  1001 (inport == @pg && ip) drop
  to-lport  1001 (outport == @pg && ip) drop
Mon Nov  9 01:51:32 EST 2020
+ flow_num_before=0
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 0 ']'
+ flow_num_before=11223
+ sleep 1
+ :
++ ovs-ofctl dump-flows br-int
++ wc -l
+ flow_num=11223
+ '[' 11223 -eq 11223 ']'
+ break
+ date
Mon Nov  9 01:51:41 EST 2020

<=== it takes 9 seconds

[root@dell-per740-42 bz1890974]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.11-2.11.3-73.el8fdp.x86_64
ovn2.11-2.11.1-55.el8fdp.x86_64
ovn2.11-host-2.11.1-55.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.11-central-2.11.1-55.el8fdp.x86_64

Comment 24 errata-xmlrpc 2020-12-01 15:07:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.11 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5309

Note You need to log in before you can comment on or make changes to this bug.