Bug 1074097

Summary: With RHEV-H 6.5, the main vdsm process is using excessive cpu time on an idle system.
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED DUPLICATE QA Contact: Aharon Canan <acanan>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.3.0CC: amureini, bazulay, cpelland, cshao, gouyang, gwatson, hadong, huiwa, iheim, leiwang, loberman, lpeer, mkalinin, nsoffer, scohen, s.kieske, tnisan, yaniwang, ybronhei, ycui, yeylon
Target Milestone: ---   
Target Release: 3.3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-24 15:04:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vgs output on a hypervisor with high CPU
none
vgs output on 6.4 hypervisor - normal CPU none

Description Gordon Watson 2014-03-07 23:03:59 UTC
Description of problem:

After a customer upgraded RHEV-H to a 6.5 version, the main vdsm process is using a lot of cpu time on an otherwise idle system. Two hosts were upgraded and both experienced this. One was downgraded to 6.4 and the problem no longer occurs. The other hosts in the cluster are still on 6.4 and do not exhibit this problem.


Version-Release number of selected component (if applicable):

RHEV-H 6.5 20140213 and 20140217.


How reproducible:

Only reproducible so far in this customer's environment.


Steps to Reproduce:
1. 
2.
3.

Actual results:

On RHEV-H 6.5 with no active VMs, the main vdsm process is using a high amount of cpu time.

Expected results:

As seen on our lab systems, when the system is otherwise idle the main vdsm process uses something in the range of 6% cpu.


Additional info:

Details will be provided in subsequent updates.

Comment 14 Marina Kalinin 2014-03-18 19:35:42 UTC
Created attachment 876092 [details]
vgs output on a hypervisor with high CPU

Comment 15 Marina Kalinin 2014-03-18 19:38:06 UTC
Created attachment 876093 [details]
vgs output on 6.4 hypervisor - normal CPU

Comment 17 Yeela Kaplan 2014-03-19 07:10:55 UTC
Marina,
Can you please also attach the actual vdsm logs?
Thanks

Comment 22 loberman 2014-03-25 13:03:02 UTC
Hello

Based on straces taken, we see most of the time in vdsm spent in futex waits.
See below.

Are you maybe looking for an ltrace to profile here. 

[root@do-rhevh3 vdsm]#  strace -p 25815 -f -v
[pid 25876] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 25861] <... futex resumed> )       = 0
[pid 13289] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4176] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4174] <... futex resumed> )       = 0
[pid  4166] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4163] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4157] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4153] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4150] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4147] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4138] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4137] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4132] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4129] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4122] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4120] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4114] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4073] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4068] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4067] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4064] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4063] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4061] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4058] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4056] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 25876] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 25861] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13289] <... futex resumed> )       = 0
[pid  4176] <... futex resumed> )       = 0
[pid  4174] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4166] <... futex resumed> )       = 0
[pid  4163] <... futex resumed> )       = 0
[pid  4157] <... futex resumed> )       = 0
[pid  4153] <... futex resumed> )       = 0
[pid  4150] <... futex resumed> )       = 0
[pid  4147] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4138] <... futex resumed> )       = 0
[pid  4137] <... futex resumed> )       = 0
[pid  4132] <... futex resumed> )       = 0
[pid  4129] <... futex resumed> )       = 0
[pid  4122] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4120] <... futex resumed> )       = 0
[pid  4114] <... futex resumed> )       = 0
[pid  4073] <... futex resumed> )       = 0
[pid  4068] <... futex resumed> )       = 0

Comment 25 loberman 2014-03-27 10:05:49 UTC
Hello

I called the customer in Dubai this morning early.

Theo (the Red Hat consultant) has provisioned and installed two new RHEVH hosts using the 3PAR storage and the problem is no longer apparent. The high CPU spin VDSM issue on RHEVH 6.5 is no longer an issue in these new RHEVH hosts.

He has one more old configuration left, on which he was attempting to get RHEL installed on so he could then stage a "fat" RHEV so we could perform further debugging, but he has had some issue there.

We may never get root cause here, and we know we could never reproduce in-house. If he is able to get RHEL and RHEVH installed I have asked him to capture a vmcore and try the older kernel for 6.4 but that may not happen.

So for now I would change the severity of this issue and accept that we had some anomaly in the original installation that we may never get to root cause for.

Thank You
Laurence

Comment 33 Nir Soffer 2014-04-24 14:51:27 UTC
This seem to be a duplicate of bug 1090664. While this bug was opened first, we still have insufficient data, while in the other bug we have profiling resutls. I suggest to close this as dupicate and continue tracking this issue in bug 1090664.