Bug 553249

Summary: hypercall device - Vm becomes non responsive on Sysmark benchmark (when more than 7 vm's running simultaneously)
Product: Red Hat Enterprise Linux 5 Reporter: RHEL Program Management <pm-rhel>
Component: kvmAssignee: Gleb Natapov <gleb>
Status: CLOSED ERRATA QA Contact: Oded Ramraz <oramraz>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.4CC: cpelland, danken, knoel, lihuang, llim, mpastern, ohochman, ovirt-maint, pm-eus, Rhev-m-bugs, riek, syeghiay, tburke, virt-maint, ykaul
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kvm-83-105.el5_4.18 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-09 10:02:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 503759    
Bug Blocks: 553441    

Description RHEL Program Management 2010-01-07 14:27:08 UTC
This bug has been copied from bug #503759 and has been proposed
to be backported to 5.4 z-stream (EUS).

Comment 9 Yaniv Kaul 2010-01-19 16:44:05 UTC
I hope you have the agent + driver on the guest.

Comment 10 lihuang 2010-01-19 16:50:37 UTC
(In reply to comment #9)
> I hope you have the agent + driver on the guest.    

Yes. both installed and running.
install from rhevm-guest-tools-2.1-39917.iso

Comment 17 Yaniv Kaul 2010-01-20 15:45:23 UTC
1. "-uuid `uuidgen`" means you are giving the VM a different UUID every run. That's strange.
2. "-usbdevice tablet" is not needed if you have the drivers and agents of Spice.

Comment 18 lihuang 2010-01-20 16:01:37 UTC
(In reply to comment #17)
> 1. "-uuid `uuidgen`" means you are giving the VM a different UUID every run.
> That's strange.
 Is the uuid saved and reused after I quit and restart vm ?  (since no migration involved ).difficult for me to remember every uuid i have used when start vm from command line directly ... 
  

> 2. "-usbdevice tablet" is not needed if you have the drivers and agents of
> Spice.    

  If this affect current test. I will restart it. otherwise, I will keep the vm running for 12hour.

Comment 24 Oded Ramraz 2010-01-26 07:33:43 UTC
I don't really understand why you tried to reproduce this bug without "Sysmark" utility.
I managed to reproduce this issue with 20 1GB VM's which are running "Sysmark" utility simultaneously .
I reopen this bug.
Please contact me if you need help with reproducing this issue.

Comment 25 Yaniv Kaul 2010-01-26 07:44:19 UTC
(In reply to comment #24)
> I don't really understand why you tried to reproduce this bug without "Sysmark"
> utility.
> I managed to reproduce this issue with 20 1GB VM's which are running "Sysmark"
> utility simultaneously .
> I reopen this bug.
> Please contact me if you need help with reproducing this issue.    

Oded - which KVM version have you used? Also, it's a good idea to try with newer hypercall drivers (which fix the issue on their side) - they should be available only on the next build (or you can take from nightly - talk to Barak first - I WHQL'ed them already).

Comment 26 Oded Ramraz 2010-01-26 07:53:10 UTC
## Host CPU information:

$h.GetCpuStatistics()

Cores  : 4
Model  : Intel(R) Xeon(R) CPU           E5310  @ 1.60GHz
Speed  : 1596

## KVM version:

kvm-83-147.el5

I talk to Barak about the new hypercall drivers.

Comment 27 Oded Ramraz 2010-01-26 12:34:01 UTC
I managed to reproduce this issue with the new hypercall driver,
I'm not sure if it's the same bug because the VM is stuck but it does not consumes 100 percent CPU.

gnatapov is investigating the stuck process.

Comment 28 Eduardo Habkost 2010-01-26 14:36:30 UTC
(In reply to comment #27)
> gnatapov is investigating the stuck process.    

Reassigning to Gleb, then.

Comment 29 Dor Laor 2010-01-27 11:55:06 UTC
Please check if the driver calls the reset device when it unloads (hope that it try to unload)

Comment 30 Oded Ramraz 2010-01-28 08:58:48 UTC
I think that there are two different issues here (both occurs during "Sysmark" benchmark execution):

In both cases the VM is stuck but in one case it consumes 100 percent CPU ( probably the hypercall issue) and in the second it does not consumes 100 percent CPU.

The second case is quite common and i'm able to reproduce it ( also with the new hypercall drivers )

I have few questions:

1. Do you want me to open separate busy for the second case?
2. Can you assign someone to check this issue immediately , Gleb is quite busy with other things and we have a RHEVM release in few weeks.

I'll try to add memory dump for my stuck Guests.

Comment 32 Oded Ramraz 2010-02-07 16:41:23 UTC
When the hypercall issue occurs , the VM is consuming 100 percent CPU.
The phenomenon i see is a little bit different: VM stuck after boot but it does not consumes 100 percent CPU. I'm moving this bug back to ON_QA and open a new one.

Comment 34 Oded Ramraz 2010-02-08 10:50:25 UTC
I run the test with 15 1GB guests ( on 5.5 host - see comment 7 )
Didn't manage to reproduce the 100 percent CPU issue.
You can move this bug status to VERIFIED

Comment 35 lihuang 2010-02-08 11:34:00 UTC
5.5 and 5.4-z shared the same patch. so setting to verified.

Comment 38 errata-xmlrpc 2010-02-09 10:02:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0088.html