1074097 – With RHEV-H 6.5, the main vdsm process is using excessive cpu time on an idle system.

Bug 1074097 - With RHEV-H 6.5, the main vdsm process is using excessive cpu time on an idle system.

Summary: With RHEV-H 6.5, the main vdsm process is using excessive cpu time on an idle...

Keywords:
Status:	CLOSED DUPLICATE of bug 1090664
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.3
Assignee:	Nir Soffer
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-07 23:03 UTC by Gordon Watson
Modified:	2019-04-28 09:04 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-04-24 15:04:06 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vgs output on a hypervisor with high CPU (1.61 MB, text/plain) 2014-03-18 19:35 UTC, Marina Kalinin	no flags	Details
vgs output on 6.4 hypervisor - normal CPU (1.53 MB, text/plain) 2014-03-18 19:38 UTC, Marina Kalinin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	752423	0	None	None	None	Never

Description Gordon Watson 2014-03-07 23:03:59 UTC

Description of problem:

After a customer upgraded RHEV-H to a 6.5 version, the main vdsm process is using a lot of cpu time on an otherwise idle system. Two hosts were upgraded and both experienced this. One was downgraded to 6.4 and the problem no longer occurs. The other hosts in the cluster are still on 6.4 and do not exhibit this problem.


Version-Release number of selected component (if applicable):

RHEV-H 6.5 20140213 and 20140217.


How reproducible:

Only reproducible so far in this customer's environment.


Steps to Reproduce:
1. 
2.
3.

Actual results:

On RHEV-H 6.5 with no active VMs, the main vdsm process is using a high amount of cpu time.

Expected results:

As seen on our lab systems, when the system is otherwise idle the main vdsm process uses something in the range of 6% cpu.


Additional info:

Details will be provided in subsequent updates.

Comment 14 Marina Kalinin 2014-03-18 19:35:42 UTC

Created attachment 876092 [details]
vgs output on a hypervisor with high CPU

Comment 15 Marina Kalinin 2014-03-18 19:38:06 UTC

Created attachment 876093 [details]
vgs output on 6.4 hypervisor - normal CPU

Comment 17 Yeela Kaplan 2014-03-19 07:10:55 UTC

Marina,
Can you please also attach the actual vdsm logs?
Thanks

Comment 22 loberman 2014-03-25 13:03:02 UTC

Hello

Based on straces taken, we see most of the time in vdsm spent in futex waits.
See below.

Are you maybe looking for an ltrace to profile here. 

[root@do-rhevh3 vdsm]#  strace -p 25815 -f -v
[pid 25876] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 25861] <... futex resumed> )       = 0
[pid 13289] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4176] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4174] <... futex resumed> )       = 0
[pid  4166] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4163] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4157] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4153] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4150] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4147] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4138] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4137] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4132] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4129] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4122] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4120] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4114] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4073] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4068] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4067] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4064] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4063] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4061] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4058] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4056] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 25876] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 25861] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13289] <... futex resumed> )       = 0
[pid  4176] <... futex resumed> )       = 0
[pid  4174] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4166] <... futex resumed> )       = 0
[pid  4163] <... futex resumed> )       = 0
[pid  4157] <... futex resumed> )       = 0
[pid  4153] <... futex resumed> )       = 0
[pid  4150] <... futex resumed> )       = 0
[pid  4147] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4138] <... futex resumed> )       = 0
[pid  4137] <... futex resumed> )       = 0
[pid  4132] <... futex resumed> )       = 0
[pid  4129] <... futex resumed> )       = 0
[pid  4122] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4120] <... futex resumed> )       = 0
[pid  4114] <... futex resumed> )       = 0
[pid  4073] <... futex resumed> )       = 0
[pid  4068] <... futex resumed> )       = 0

Comment 25 loberman 2014-03-27 10:05:49 UTC

Hello

I called the customer in Dubai this morning early.

Theo (the Red Hat consultant) has provisioned and installed two new RHEVH hosts using the 3PAR storage and the problem is no longer apparent. The high CPU spin VDSM issue on RHEVH 6.5 is no longer an issue in these new RHEVH hosts.

He has one more old configuration left, on which he was attempting to get RHEL installed on so he could then stage a "fat" RHEV so we could perform further debugging, but he has had some issue there.

We may never get root cause here, and we know we could never reproduce in-house. If he is able to get RHEL and RHEVH installed I have asked him to capture a vmcore and try the older kernel for 6.4 but that may not happen.

So for now I would change the severity of this issue and accept that we had some anomaly in the original installation that we may never get to root cause for.

Thank You
Laurence

Comment 33 Nir Soffer 2014-04-24 14:51:27 UTC

This seem to be a duplicate of bug 1090664. While this bug was opened first, we still have insufficient data, while in the other bug we have profiling resutls. I suggest to close this as dupicate and continue tracking this issue in bug 1090664.

Note You need to log in before you can comment on or make changes to this bug.

amureini
bazulay
cpelland
cshao
gouyang
gwatson
hadong
huiwa
iheim
leiwang
loberman
lpeer
mkalinin
nsoffer
scohen
s.kieske
tnisan
yaniwang
ybronhei
ycui
yeylon