Bug 1327376

Summary: Slow network performance and high cpu usage by ovs-vswitchd
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openvswitchAssignee: Flavio Leitner <fleitner>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: aloughla, apevec, atragler, chrisw, fleitner, jmelvin, mleitner, rhos-maint, rkhan, srevivo
Target Milestone: ---Keywords: Unconfirmed, ZStream
Target Release: 7.0 (Kilo)Flags: fleitner: needinfo? (jmelvin)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-25 04:28:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jeremy 2016-04-15 02:29:58 UTC
Description of problem: Slow network performance in the openstack environment, particularly with file transfers between 2 instances in same tenant diff compute node.. 
Customer performed a transfer test with 4GB file.  Looks like there is a significant network issue as it took about 50 minutes to transfer from one instance to another on different compute nodes in the same tenant.

[BOS-ED6][edurand@amadeus-gw ~]$rsync --progress
edurand@couch-01:~/CentOS-7-x86_64-DVD-1511.iso
CentOS-7-x86_64-DVD-1511.iso
CentOS-7-x86_64-DVD-1511.iso
  4329570304 100%    1.33MB/s    0:51:40 (xfer#1, to-check=0/1)

sent 30 bytes  received 4330098911 bytes  1396580.86 bytes/sec
total size is 4329570304  speedup is 1.00

Process ovs-vswitchd is running 100%+ CPU consistently: top output below
top - 14:57:05 up 47 days,  2:42,  2 users,  load average: 4.13, 4.00, 4.22
Tasks: 1083 total,   3 running, 1080 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.3 us,  2.3 sy,  0.0 ni, 86.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65694708 total, 17672964 free, 19198204 used, 28823540 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 45660312 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                          
  1306 root      10 -10 1304752 157788   9252 R 102.0  0.2   8552:26 ovs-vswitchd

Version-Release number of selected component (if applicable):
openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch 
openvswitch-2.4.0-1.el7.x86_64 


How reproducible:
100%

Steps to Reproduce:
1. Do file transfer test between 2 vm's in same tenant but different compute node
2.
3.

Actual results:
slow network 

Expected results:
normal network throughput

Additional info:

### sosreport-overcloud-compute-0.localdomain.01615486-20160411162738/var/log/openvswitch/ovs-vswitchd.log Logs are filled with the following messages:
2016-04-11T20:33:50.741Z|1533806|poll_loop|INFO|Dropped 91355 log messages in last 4 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:33:50.741Z|1533807|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (111% CPU usage)
2016-04-11T20:33:56.741Z|1533808|poll_loop|INFO|Dropped 184734 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:33:56.741Z|1533809|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (100% CPU usage)
2016-04-11T20:33:58.824Z|1533810|bond|INFO|bond bond1: shift 5926kB of load (with hash 1) from p3p1 to p3p2 (now carrying 34120kB and 27404kB load, respectively)
2016-04-11T20:33:58.824Z|1533811|bond|INFO|bond bond1: shift 6846kB of load (with hash 41) from em2 to p3p2 (now carrying 30347kB and 34250kB load, respectively)
2016-04-11T20:33:58.824Z|1533812|bond|INFO|bond bond1: shift 2053kB of load (with hash 15) from p3p2 to em2 (now carrying 32197kB and 32400kB load, respectively)
2016-04-11T20:33:58.824Z|1533813|bond|INFO|bond bond1: shift 1766kB of load (with hash 55) from p3p1 to em1 (now carrying 32354kB and 32869kB load, respectively)
2016-04-11T20:34:02.742Z|1533814|poll_loop|INFO|Dropped 134423 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:34:02.742Z|1533815|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (95% CPU usage)


## netstat output from one of the instances doing the file transfer test
[BOS-ED6][edurand@amadeus-gw ~]$netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
TX-OVR Flg
eth0      1400 19058896      0      0 0      12211967      0      0
0 BMRU
lo       65536   614638      0      0 0        614638      0      0
0 LRU

Comment 2 Flavio Leitner 2016-06-28 19:35:10 UTC
What's the current status of this?

I read in the support ticket about packet loss in the network, is that the current line of thinking?

You can add an internal port to the bridge of each host, assign an IP address and do a larger transfer between the hosts to check if the host-to-host communication is good.

Thanks,
fbl