Description of problem: Client recently updated their OSP13 from z10 to z12. They are using OVN in their environment. They are spawning instances through heat templates with Tower and sometimes (currently frequently) the instance is spawned, get an ip through dhcp, but the next steps in the deployment where it curls an internal URL fails. It is being tried for like 20-30 minutes and the curl keep failing. After 30 minutes the curl is not tried anymore but it looks like you can access the outside network. ovn-controller.log from the compute node around the time the port was created: 2020-10-23T06:07:09.381Z|18828|timeval|WARN|Unreasonably long 40122ms poll interval (40016ms user, 10ms system) 2020-10-23T06:07:09.381Z|18829|timeval|WARN|faults: 10388 minor, 0 major 2020-10-23T06:07:09.381Z|18830|timeval|WARN|context switches: 0 voluntary, 24 involuntary From on of the controller: 2020-10-23T06:06:29.937Z|25829|timeval|WARN|Unreasonably long 1140ms poll interval (1136ms user, 0ms system) 2020-10-23T06:06:29.937Z|25830|timeval|WARN|faults: 8 minor, 0 major 2020-10-23T06:06:29.937Z|25831|timeval|WARN|context switches: 0 voluntary, 3 involuntary 2020-10-23T06:07:09.154Z|25832|timeval|WARN|Unreasonably long 37364ms poll interval (37263ms user, 11ms system) 2020-10-23T06:07:09.154Z|25833|timeval|WARN|faults: 2337 minor, 0 major 2020-10-23T06:07:09.154Z|25834|timeval|WARN|context switches: 0 voluntary, 69 involuntary 2020-10-23T06:07:09.154Z|25835|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (97% CPU usage) 2020-10-23T06:07:09.154Z|25836|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.183.12:59198<->10.10.183.11:6642) at lib/stream-fd.c:157 (97% CPU usage) 2020-10-23T06:07:09.171Z|25837|poll_loop|INFO|wakeup due to 0-ms timeout at tcp:10.10.183.11:6642 (97% CPU usage) 2020-10-23T06:07:46.291Z|25838|timeval|WARN|Unreasonably long 37119ms poll interval (37028ms user, 2ms system) 2020-10-23T06:07:46.291Z|25839|timeval|WARN|faults: 1814 minor, 0 major 2020-10-23T06:07:46.291Z|25840|timeval|WARN|context switches: 0 voluntary, 80 involuntary 2020-10-23T06:07:46.291Z|25841|coverage|INFO|Dropped 2 log messages in last 77 seconds (most recently, 37 seconds ago) due to excessive rate 2020-10-23T06:07:46.291Z|25842|coverage|INFO|Skipping details of duplicate event coverage for hash=4bbd7269 2020-10-23T06:07:46.291Z|25843|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2020-10-23T06:07:46.291Z|25844|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.183.12:59198<->10.10.183.11:6642) at lib/stream-fd.c:157 (99% CPU usage) 2020-10-23T06:07:47.178Z|25845|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2020-10-23T06:07:53.810Z|25846|timeval|WARN|Unreasonably long 1056ms poll interval (1052ms user, 0ms system) 2020-10-23T06:07:53.810Z|25847|timeval|WARN|context switches: 0 voluntary, 13 involuntary From nb.log we see these every minutes (before and after instance creation): 2020-10-23T06:06:52.280Z|45205|socket_util|ERR|Dropped 39 log messages in last 58 seconds (most recently, 3 seconds ago) due to excessive rate 2020-10-23T06:06:52.280Z|45206|socket_util|ERR|6641:10.10.183.11: bind: Cannot assign requested address 2020-10-23T06:06:52.280Z|45207|ovsdb_jsonrpc_server|ERR|Dropped 39 log messages in last 58 seconds (most recently, 3 seconds ago) due to excessive rate 2020-10-23T06:06:52.280Z|45208|ovsdb_jsonrpc_server|ERR|ptcp:6641:10.10.183.11: listen failed: Cannot assign requested address And sb also every minutes: 2020-10-23T06:07:01.932Z|45543|socket_util|ERR|Dropped 34 log messages in last 59 seconds (most recently, 2 seconds ago) due to excessive rate 2020-10-23T06:07:01.932Z|45544|socket_util|ERR|6642:10.10.183.11: bind: Cannot assign requested address 2020-10-23T06:07:01.932Z|45545|ovsdb_jsonrpc_server|ERR|Dropped 34 log messages in last 59 seconds (most recently, 2 seconds ago) due to excessive rate 2020-10-23T06:07:01.932Z|45546|ovsdb_jsonrpc_server|ERR|ptcp:6642:10.10.183.11: listen failed: Cannot assign requested address Interestingly is that the IP they are trying to curl is 10.10.x.x so maybe its related to this ? Maybe not since those are there all the time. sosreport from all 3 controllers, 1 compute nodes and the ovn databases are uploaded. I will put details about that one instance we are currently investigating in the next private comment. I need your help to understand why sometimes the network is very slow. Thank you. Version-Release number of selected component (if applicable): OSP13 z12 Latest containers How reproducible: Frequently right now. Some days no issues. Look random. Steps to Reproduce: 1. Spawn instances 2. 3. Actual results: There is a huge delay for the instance to be able to reach the outside network Expected results: Instance can reach the outside network "as soon" as the ip is allocated (what you would expect) Additional info: sosreports and ovn databases are available.
tested with script in https://bugzilla.redhat.com/show_bug.cgi?id=1818128#c10: result in ovn2.11.1-54: [root@dell-per740-12 bz1890974]# bash rep.sh from-lport 1001 (inport == @pg && ip) drop to-lport 1001 (outport == @pg && ip) drop Mon Nov 9 01:34:17 EST 2020 <=== start time + flow_num_before=0 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 0 ']' + flow_num_before=11223 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 11223 ']' + break + date Mon Nov 9 01:34:45 EST 2020 <=== end time it takes 28 seconds result on ovn2.11.1-55: [root@dell-per740-12 bz1890974]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.11-2.11.3-76.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.11-2.11.1-55.el7fdp.x86_64 ovn2.11-host-2.11.1-55.el7fdp.x86_64 ovn2.11-central-2.11.1-55.el7fdp.x86_64 [root@dell-per740-12 bz1890974]# bash rep.sh from-lport 1001 (inport == @pg && ip) drop to-lport 1001 (outport == @pg && ip) drop Mon Nov 9 01:40:03 EST 2020 <=== start time + flow_num_before=0 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 0 ']' + flow_num_before=11223 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 11223 ']' + break + date Mon Nov 9 01:40:14 EST 2020 <=== end time it takes 13 seconds. performance improved.
Verified on rhel8 version: [root@dell-per740-42 bz1890974]# bash rep.sh from-lport 1001 (inport == @pg && ip) drop to-lport 1001 (outport == @pg && ip) drop Mon Nov 9 01:46:57 EST 2020 + flow_num_before=0 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 0 ']' + flow_num_before=11223 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 11223 ']' + break + date Mon Nov 9 01:47:19 EST 2020 <=== it takes 22 seconds [root@dell-per740-42 bz1890974]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.11-2.11.3-73.el8fdp.x86_64 ovn2.11-2.11.1-54.el8fdp.x86_64 ovn2.11-host-2.11.1-54.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.11-central-2.11.1-54.el8fdp.x86_64 on ovn2.11.1-55: [root@dell-per740-42 bz1890974]# bash rep.sh from-lport 1001 (inport == @pg && ip) drop to-lport 1001 (outport == @pg && ip) drop Mon Nov 9 01:51:32 EST 2020 + flow_num_before=0 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 0 ']' + flow_num_before=11223 + sleep 1 + : ++ ovs-ofctl dump-flows br-int ++ wc -l + flow_num=11223 + '[' 11223 -eq 11223 ']' + break + date Mon Nov 9 01:51:41 EST 2020 <=== it takes 9 seconds [root@dell-per740-42 bz1890974]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.11-2.11.3-73.el8fdp.x86_64 ovn2.11-2.11.1-55.el8fdp.x86_64 ovn2.11-host-2.11.1-55.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.11-central-2.11.1-55.el8fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.11 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5309