Bug 1386514

Summary: [fdBeta] ovs-vswitchd takes 100% CPU time
Product: Red Hat Enterprise Linux 7 Reporter: Marcin Mirecki <mmirecki>
Component: openvswitchAssignee: Lance Richardson <lrichard>
Status: CLOSED CURRENTRELEASE QA Contact: qding
Severity: urgent Docs Contact:
Priority: high    
Version: 7.1CC: alukiano, atragler, danken, fleitner, kzhang, lrichard, mkalfon, mmirecki, myakove, qding
Target Milestone: rcKeywords: AutomationBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.5.0-22.git20160727.el7fdb Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1397502 1397504 1475772 (view as bug list) Environment:
Last Closed: 2017-01-12 17:16:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1366899, 1397502, 1397504, 1475772, 1477658, 1496194    
Attachments:
Description Flags
Result of: strace -Cvfr -p `pidof ovs-vswitchd` -o vswitchd.strace
none
patched ovs with fix and debug
none
openswitch logs after patch from 2016-10-30
none
ovs-vswitchd log
none
ovs-vswitchd log
none
strace log none

Description Marcin Mirecki 2016-10-19 07:11:54 UTC
Created attachment 1211993 [details]
Result of: strace -Cvfr -p `pidof ovs-vswitchd` -o vswitchd.strace

When the openvswitch service is started, the ovs-vswitchd process takes a 100% of CPU time.

Top dump:
 1227 root      10 -10   48208  11240   9908 R 100.0  0.1   1534:17 ovs-vswitchd                                                                                
The error seems to have appeared without any specific cause.
Once it started, it is now always occurring whenever the openvswitch service is started.
                                                   
Result of: strace -Cvfr -p `pidof ovs-vswitchd` -o vswitchd.strace
shows the following recurring piece of log being generated at an enormous rate:


1227       0.000013 recvfrom(3, 0x55f5a48f6dcd, 339, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000013 recvmsg(11, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000012 accept(10, 0x7ffc6ea263d0, [128]) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000014 recvmsg(14, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000012 recvmsg(13, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000013 recvmsg(13, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000012 recvmsg(13, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
1227       0.000014 poll([{fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=11, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=3, events=POLLIN}], 6, 0) = 0 (Timeout)
1227       0.000015 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={53405, 379161}, ru_stime={38423, 776487}, ru_maxrss=11240, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=496, ru_majflt=0, ru_nswap=0, ru_inblock=1688, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=53779241, ru_nivcsw=260447}) = 0

Comment 2 Thadeu Lima de Souza Cascardo 2016-10-30 09:09:35 UTC
Created attachment 1215516 [details]
patched ovs with fix and debug

Hi.

Can you test this package and let me know if it fixes the problem? If it doesn't, can you enable debug for poll-loop and send the logs, please?

ovs-appctl vlog/set poll_loop:DBG

Thanks.
Cascardo.

Comment 3 Marcin Mirecki 2016-11-02 13:11:06 UTC
Created attachment 1216532 [details]
openswitch logs after patch from 2016-10-30

The patch seems to have fixed the issue.

I attach the collected openvswitch logs.

Comment 4 Thadeu Lima de Souza Cascardo 2016-11-03 18:09:50 UTC
Hi, Marcin.

The fix has been applied upstream, can you test master and see if it fixes your problem? Just in case it is really fixed by the fix and not the debug messages I added.

Thanks.
Cascardo.

Comment 5 Mor 2016-11-13 08:07:50 UTC
Created attachment 1220123 [details]
ovs-vswitchd log

Comment 6 Mor 2016-11-13 08:10:53 UTC
Hi guys, 
I have the problem reproduced on my environment: (Open vSwitch) 2.6.90.

Comment 7 Thadeu Lima de Souza Cascardo 2016-11-14 12:22:26 UTC
The logs seem to indicate timeouts every 500ms for revalidator and every 5s for stats update. All normal timeouts. Also, only 0% CPU use is shown. Either the logs do not represent the time when the issue happened, or there is something that is not being accounted.

Cascardo.

Comment 8 Mor 2016-11-14 12:52:02 UTC
I grabbed the log only after verifying that ovs-vswitchd process consumes high CPU usage. I also set the OVS logging level to debug (ovs-appctl vlog/set dbg). I will try to upgrade to OVS 2.6.1.

Thanks,
Mor

Comment 9 qding 2016-11-18 10:06:55 UTC
Hi  Marcin,

I'm network-QE and reviewing the BZ, but cannot reproduce the issue. Could you please give more info? like kernel version, openvswitch package, steps, etc.

Thanks
QJ

Comment 10 Mor 2016-11-21 13:37:07 UTC
Hi,

Its not always reproducible on my environment. And I have no steps to reproduce it.

I am attaching additional debug level logs taken from OVN host.

Kernel version: 
3.10.0-512.el7.x86_64

OVS packages:
openvswitch-2.6.90-1.fc24.x86_64

Comment 11 Mor 2016-11-21 13:37:52 UTC
Created attachment 1222398 [details]
ovs-vswitchd log

Comment 12 Mor 2016-11-21 17:22:20 UTC
Created attachment 1222447 [details]
strace log

Comment 13 Mor 2016-11-21 17:58:02 UTC
I will try to upgrade to the latest master release and update with my findings.

Comment 14 Mor 2016-11-22 07:21:09 UTC
It seems to be fixed with the latest master release.

Comment 15 qding 2016-11-22 07:39:00 UTC
(In reply to Mor from comment #10)
> 
> Its not always reproducible on my environment. And I have no steps to
> reproduce it.
> 
> I am attaching additional debug level logs taken from OVN host.
> 
> Kernel version: 
> 3.10.0-512.el7.x86_64
> 
> OVS packages:
> openvswitch-2.6.90-1.fc24.x86_64

Thank you for the update.
Without steps, could you please give something like command line history? I'd like to know what you have done before the issue happens to help me to reproduce.

thanks
QJ

Comment 16 Mor 2016-11-22 10:19:21 UTC
In our case, the changes are done by a software provider (which provides CMS access to OVN). This software component uses the OVS API. So unfortunately, the command-line history won't us to track the changes.

Comment 18 Flavio Leitner 2016-11-22 16:36:59 UTC
The validated fix is applied to fdBeta package.
Please open a fresh a bug if there is another issue to be resolved.
Thanks!

Comment 19 qding 2016-11-23 01:17:03 UTC
(In reply to Mor from comment #16)
> In our case, the changes are done by a software provider (which provides CMS
> access to OVN). This software component uses the OVS API. So unfortunately,
> the command-line history won't us to track the changes.

Now the issue has been fixed in openvswitch-2.5.0-22.git20160727.el7fdb. I cannot reproduce the issue. Could you please help verify it?

thank you
QJ

Comment 21 Mor 2016-11-24 07:47:31 UTC
Hi QJ,

In our environment we require OVS 2.6.0 or higher, so I unable to verify it on 2.5.0.

THanks.
Mor

Comment 22 qding 2016-11-24 07:59:22 UTC
(In reply to Mor from comment #21)
> Hi QJ,
> 
> In our environment we require OVS 2.6.0 or higher, so I unable to verify it
> on 2.5.0.
> 
It's ok. Thank you anyway.

Comment 24 Marcin Mirecki 2017-06-09 08:21:36 UTC
clearing needinfo (already resolved)