Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1395916

Summary: [Scale] "VM not responding" during VM start up when 1000 Disks are loaded
Product: [oVirt] ovirt-engine Reporter: mlehrer
Component: GeneralAssignee: Francesco Romani <fromani>
Status: CLOSED DUPLICATE QA Contact: guy chen <guchen>
Severity: medium Docs Contact:
Priority: high    
Version: 4.0.5.1CC: bugs, eberman, mgoldboi, michal.skrivanek, mlehrer, oourfali, rgolan, tjelinek
Target Milestone: ---Keywords: Performance
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 15:12:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mlehrer 2016-11-17 00:02:39 UTC
Description of problem:
Issue 1: With 1000 Disks and 200 VMs loaded, attempting to start VM will result in VM 'not responding' status for almost 20-30s, eventually after 1 minute the VM starts.

Issue 2: With 1000 Disks and 200 VMs loaded, launching/starting VMs in groups of 5 had to be carefully paced otherwise the VMs would remain 'not responding' status. Generally we are able to use around 5 "virtual users or Jmeter threads to aggressively start VMs in groups of 5, with 1000 Disks loaded it was problematic and required spacing the 5 users with 35s pauses in between requesting further VMs to be started.  This was noticed because many VMs would get stuck in "VM not responding" status and then slowly catch up once all VM start requests were stopped for 10-15 minutes. 

Version-Release number of selected component (if applicable):
RHEL 7.3	
vdsm-4.18.15.2-1	
rhevm: 4.0.5	

How reproducible:
Very

Steps to Reproduce:
1. Create 200 VMs with each 5 containing 5 Disks
2. Attempt to start 1 VM individually
3. Or attempt to start VMs in batches of 5

Actual results:
When a VM is starting its status becomes "VM not responding"  for 20 seconds or longer until it updates to 'starting' status.

Batching VM starts showed many VMs would get stuck in "VM not responding" status and then slowly catch up once all VM start requests were stopped for 10-15 minutes.
Expected results:
VMs to not enter not responding status

Additional info:

Engine, VDSM logs available here
https://drive.google.com/open?id=0B8V1DXeGhPPWQk5jaElsaDFONnM

Comment 1 Yaniv Kaul 2016-11-17 07:08:55 UTC
Is it really not responding? What is the state of the process when this happens?
Do you have libvirt logs? host logs? 
What is the CPU usage of the host when this happens?

Comment 2 Roy Golan 2016-11-17 11:43:35 UTC
Libvirt monitor is unresponsive:

- VM created by engine here:
2016-11-15 12:51:04,609 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (default task-50) [aa00c14] START, CreateVDSCommand(HostName = ucs1-b420-1.eng.lab.tlv.redhat.com, CreateVmVDSCommandParameters:{runAsync='true', hostId='283a78eb-7382-4a58-b8f4-1de6141ee0a1', vmId='75de2ec4-aa46-4c04-aa3b-1494c0b49e8f', vm='VM [run_vm_RFE_-1]'}), log id: 6587d4b1

- Engine got response and saved the initial state to WaitForLunch
2016-11-15 12:51:04,789 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-50) [aa00c14] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 53effb24

- ~1 second go by

- Subsequent engine monitoring detected NonResponding status reported by VDSM
2016-11-15 12:51:05,776 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [] VM '75de2ec4-aa46-4c04-aa3b-1494c0b49e8f'(run_vm_RFE_-1) moved from 'WaitForLaunch' --> 'NotResponding

- VDSM report libvirt monitor is unresponsive
jsonrpc.Executor/1::WARNING::2016-11-15 12:51:05,772::vm::4842::virt.vm::(_setUnresponsiveIfTimeout) vmId=`75de2ec4-aa46-4c04-aa3b-1494c0b49e8f`::monitor become unresponsive (command timeout, age=4379472.67)

Comment 3 Francesco Romani 2016-11-17 11:55:36 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1382583 should solve this. Please test again with the fix (will be in oVirt 4.0.6)

Comment 4 mlehrer 2016-11-17 13:44:48 UTC
(In reply to Yaniv Kaul from comment #1)
> Is it really not responding? What is the state of the process when this
> happens?
> Do you have libvirt logs? host logs? 
> What is the CPU usage of the host when this happens?

I've updated the logs[1] after performing 2 more manual VM starts.
Although based on the comments above we'll need to retest to see if this occurs in that build.

[1] Updated the logs to include libvirt logs and vdsm logs after performing 2 more manual VM starts.  They're found in the shared folder, "BZ_1395916".  
When performing this test from the UI the amount of time it takes for the respective libvirt log to update can vary between 1s to 7s depending on whether the the VM has been powered on previously.  The error message of "VM not responding" will occur each time as shown in the video/logs found here:

https://drive.google.com/drive/folders/0B8V1DXeGhPPWQk5jaElsaDFONnM?usp=sharing

Comment 5 Tomas Jelinek 2016-11-18 13:30:06 UTC
Could you please try it again when https://bugzilla.redhat.com/show_bug.cgi?id=1382583 is going to be in build?

Comment 6 Tomas Jelinek 2016-12-14 11:33:17 UTC
So, the https://bugzilla.redhat.com/show_bug.cgi?id=1398415 is ON_QA on 4.0.6. Could you please test this again to see if the issue is gone now?

Comment 7 mlehrer 2016-12-14 15:03:25 UTC
(In reply to Tomas Jelinek from comment #6)
> So, the https://bugzilla.redhat.com/show_bug.cgi?id=1398415 is ON_QA on
> 4.0.6. Could you please test this again to see if the issue is gone now?

Confirmed that the issue is not reproducible on rhevm: 4.0.6.3-0.1.el7ev

Comment 8 Tomas Jelinek 2016-12-14 15:12:48 UTC
great, so closing as duplicate since it is now confirmed.

*** This bug has been marked as a duplicate of bug 1382583 ***