Hide Forgot
Description of problem: Slow network performance in the openstack environment, particularly with file transfers between 2 instances in same tenant diff compute node.. Customer performed a transfer test with 4GB file. Looks like there is a significant network issue as it took about 50 minutes to transfer from one instance to another on different compute nodes in the same tenant. [BOS-ED6][edurand@amadeus-gw ~]$rsync --progress edurand@couch-01:~/CentOS-7-x86_64-DVD-1511.iso CentOS-7-x86_64-DVD-1511.iso CentOS-7-x86_64-DVD-1511.iso 4329570304 100% 1.33MB/s 0:51:40 (xfer#1, to-check=0/1) sent 30 bytes received 4330098911 bytes 1396580.86 bytes/sec total size is 4329570304 speedup is 1.00 Process ovs-vswitchd is running 100%+ CPU consistently: top output below top - 14:57:05 up 47 days, 2:42, 2 users, load average: 4.13, 4.00, 4.22 Tasks: 1083 total, 3 running, 1080 sleeping, 0 stopped, 0 zombie %Cpu(s): 11.3 us, 2.3 sy, 0.0 ni, 86.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 65694708 total, 17672964 free, 19198204 used, 28823540 buff/cache KiB Swap: 0 total, 0 free, 0 used. 45660312 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1306 root 10 -10 1304752 157788 9252 R 102.0 0.2 8552:26 ovs-vswitchd Version-Release number of selected component (if applicable): openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch openvswitch-2.4.0-1.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Do file transfer test between 2 vm's in same tenant but different compute node 2. 3. Actual results: slow network Expected results: normal network throughput Additional info: ### sosreport-overcloud-compute-0.localdomain.01615486-20160411162738/var/log/openvswitch/ovs-vswitchd.log Logs are filled with the following messages: 2016-04-11T20:33:50.741Z|1533806|poll_loop|INFO|Dropped 91355 log messages in last 4 seconds (most recently, 0 seconds ago) due to excessive rate 2016-04-11T20:33:50.741Z|1533807|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (111% CPU usage) 2016-04-11T20:33:56.741Z|1533808|poll_loop|INFO|Dropped 184734 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2016-04-11T20:33:56.741Z|1533809|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (100% CPU usage) 2016-04-11T20:33:58.824Z|1533810|bond|INFO|bond bond1: shift 5926kB of load (with hash 1) from p3p1 to p3p2 (now carrying 34120kB and 27404kB load, respectively) 2016-04-11T20:33:58.824Z|1533811|bond|INFO|bond bond1: shift 6846kB of load (with hash 41) from em2 to p3p2 (now carrying 30347kB and 34250kB load, respectively) 2016-04-11T20:33:58.824Z|1533812|bond|INFO|bond bond1: shift 2053kB of load (with hash 15) from p3p2 to em2 (now carrying 32197kB and 32400kB load, respectively) 2016-04-11T20:33:58.824Z|1533813|bond|INFO|bond bond1: shift 1766kB of load (with hash 55) from p3p1 to em1 (now carrying 32354kB and 32869kB load, respectively) 2016-04-11T20:34:02.742Z|1533814|poll_loop|INFO|Dropped 134423 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2016-04-11T20:34:02.742Z|1533815|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (95% CPU usage) ## netstat output from one of the instances doing the file transfer test [BOS-ED6][edurand@amadeus-gw ~]$netstat -i Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1400 19058896 0 0 0 12211967 0 0 0 BMRU lo 65536 614638 0 0 0 614638 0 0 0 LRU
What's the current status of this? I read in the support ticket about packet loss in the network, is that the current line of thinking? You can add an internal port to the bridge of each host, assign an IP address and do a larger transfer between the hosts to check if the host-to-host communication is good. Thanks, fbl