Bug 1327376 - Slow network performance and high cpu usage by ovs-vswitchd [NEEDINFO]
Summary: Slow network performance and high cpu usage by ovs-vswitchd
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Flavio Leitner
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-15 02:29 UTC by Jeremy
Modified: 2020-06-11 12:50 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-25 04:28:03 UTC
Target Upstream Version:
fleitner: needinfo? (jmelvin)


Attachments (Terms of Use)

Description Jeremy 2016-04-15 02:29:58 UTC
Description of problem: Slow network performance in the openstack environment, particularly with file transfers between 2 instances in same tenant diff compute node.. 
Customer performed a transfer test with 4GB file.  Looks like there is a significant network issue as it took about 50 minutes to transfer from one instance to another on different compute nodes in the same tenant.

[BOS-ED6][edurand@amadeus-gw ~]$rsync --progress
edurand@couch-01:~/CentOS-7-x86_64-DVD-1511.iso
CentOS-7-x86_64-DVD-1511.iso
CentOS-7-x86_64-DVD-1511.iso
  4329570304 100%    1.33MB/s    0:51:40 (xfer#1, to-check=0/1)

sent 30 bytes  received 4330098911 bytes  1396580.86 bytes/sec
total size is 4329570304  speedup is 1.00

Process ovs-vswitchd is running 100%+ CPU consistently: top output below
top - 14:57:05 up 47 days,  2:42,  2 users,  load average: 4.13, 4.00, 4.22
Tasks: 1083 total,   3 running, 1080 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.3 us,  2.3 sy,  0.0 ni, 86.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65694708 total, 17672964 free, 19198204 used, 28823540 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 45660312 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                          
  1306 root      10 -10 1304752 157788   9252 R 102.0  0.2   8552:26 ovs-vswitchd

Version-Release number of selected component (if applicable):
openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch 
openvswitch-2.4.0-1.el7.x86_64 


How reproducible:
100%

Steps to Reproduce:
1. Do file transfer test between 2 vm's in same tenant but different compute node
2.
3.

Actual results:
slow network 

Expected results:
normal network throughput

Additional info:

### sosreport-overcloud-compute-0.localdomain.01615486-20160411162738/var/log/openvswitch/ovs-vswitchd.log Logs are filled with the following messages:
2016-04-11T20:33:50.741Z|1533806|poll_loop|INFO|Dropped 91355 log messages in last 4 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:33:50.741Z|1533807|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (111% CPU usage)
2016-04-11T20:33:56.741Z|1533808|poll_loop|INFO|Dropped 184734 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:33:56.741Z|1533809|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (100% CPU usage)
2016-04-11T20:33:58.824Z|1533810|bond|INFO|bond bond1: shift 5926kB of load (with hash 1) from p3p1 to p3p2 (now carrying 34120kB and 27404kB load, respectively)
2016-04-11T20:33:58.824Z|1533811|bond|INFO|bond bond1: shift 6846kB of load (with hash 41) from em2 to p3p2 (now carrying 30347kB and 34250kB load, respectively)
2016-04-11T20:33:58.824Z|1533812|bond|INFO|bond bond1: shift 2053kB of load (with hash 15) from p3p2 to em2 (now carrying 32197kB and 32400kB load, respectively)
2016-04-11T20:33:58.824Z|1533813|bond|INFO|bond bond1: shift 1766kB of load (with hash 55) from p3p1 to em1 (now carrying 32354kB and 32869kB load, respectively)
2016-04-11T20:34:02.742Z|1533814|poll_loop|INFO|Dropped 134423 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2016-04-11T20:34:02.742Z|1533815|poll_loop|INFO|wakeup due to 0-ms timeout at ofproto/bond.c:670 (95% CPU usage)


## netstat output from one of the instances doing the file transfer test
[BOS-ED6][edurand@amadeus-gw ~]$netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
TX-OVR Flg
eth0      1400 19058896      0      0 0      12211967      0      0
0 BMRU
lo       65536   614638      0      0 0        614638      0      0
0 LRU

Comment 2 Flavio Leitner 2016-06-28 19:35:10 UTC
What's the current status of this?

I read in the support ticket about packet loss in the network, is that the current line of thinking?

You can add an internal port to the bridge of each host, assign an IP address and do a larger transfer between the hosts to check if the host-to-host communication is good.

Thanks,
fbl


Note You need to log in before you can comment on or make changes to this bug.