Hide Forgot
Description of problem: Issue 1: With 1000 Disks and 200 VMs loaded, attempting to start VM will result in VM 'not responding' status for almost 20-30s, eventually after 1 minute the VM starts. Issue 2: With 1000 Disks and 200 VMs loaded, launching/starting VMs in groups of 5 had to be carefully paced otherwise the VMs would remain 'not responding' status. Generally we are able to use around 5 "virtual users or Jmeter threads to aggressively start VMs in groups of 5, with 1000 Disks loaded it was problematic and required spacing the 5 users with 35s pauses in between requesting further VMs to be started. This was noticed because many VMs would get stuck in "VM not responding" status and then slowly catch up once all VM start requests were stopped for 10-15 minutes. Version-Release number of selected component (if applicable): RHEL 7.3 vdsm-4.18.15.2-1 rhevm: 4.0.5 How reproducible: Very Steps to Reproduce: 1. Create 200 VMs with each 5 containing 5 Disks 2. Attempt to start 1 VM individually 3. Or attempt to start VMs in batches of 5 Actual results: When a VM is starting its status becomes "VM not responding" for 20 seconds or longer until it updates to 'starting' status. Batching VM starts showed many VMs would get stuck in "VM not responding" status and then slowly catch up once all VM start requests were stopped for 10-15 minutes. Expected results: VMs to not enter not responding status Additional info: Engine, VDSM logs available here https://drive.google.com/open?id=0B8V1DXeGhPPWQk5jaElsaDFONnM
Is it really not responding? What is the state of the process when this happens? Do you have libvirt logs? host logs? What is the CPU usage of the host when this happens?
Libvirt monitor is unresponsive: - VM created by engine here: 2016-11-15 12:51:04,609 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (default task-50) [aa00c14] START, CreateVDSCommand(HostName = ucs1-b420-1.eng.lab.tlv.redhat.com, CreateVmVDSCommandParameters:{runAsync='true', hostId='283a78eb-7382-4a58-b8f4-1de6141ee0a1', vmId='75de2ec4-aa46-4c04-aa3b-1494c0b49e8f', vm='VM [run_vm_RFE_-1]'}), log id: 6587d4b1 - Engine got response and saved the initial state to WaitForLunch 2016-11-15 12:51:04,789 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-50) [aa00c14] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 53effb24 - ~1 second go by - Subsequent engine monitoring detected NonResponding status reported by VDSM 2016-11-15 12:51:05,776 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [] VM '75de2ec4-aa46-4c04-aa3b-1494c0b49e8f'(run_vm_RFE_-1) moved from 'WaitForLaunch' --> 'NotResponding - VDSM report libvirt monitor is unresponsive jsonrpc.Executor/1::WARNING::2016-11-15 12:51:05,772::vm::4842::virt.vm::(_setUnresponsiveIfTimeout) vmId=`75de2ec4-aa46-4c04-aa3b-1494c0b49e8f`::monitor become unresponsive (command timeout, age=4379472.67)
https://bugzilla.redhat.com/show_bug.cgi?id=1382583 should solve this. Please test again with the fix (will be in oVirt 4.0.6)
(In reply to Yaniv Kaul from comment #1) > Is it really not responding? What is the state of the process when this > happens? > Do you have libvirt logs? host logs? > What is the CPU usage of the host when this happens? I've updated the logs[1] after performing 2 more manual VM starts. Although based on the comments above we'll need to retest to see if this occurs in that build. [1] Updated the logs to include libvirt logs and vdsm logs after performing 2 more manual VM starts. They're found in the shared folder, "BZ_1395916". When performing this test from the UI the amount of time it takes for the respective libvirt log to update can vary between 1s to 7s depending on whether the the VM has been powered on previously. The error message of "VM not responding" will occur each time as shown in the video/logs found here: https://drive.google.com/drive/folders/0B8V1DXeGhPPWQk5jaElsaDFONnM?usp=sharing
Could you please try it again when https://bugzilla.redhat.com/show_bug.cgi?id=1382583 is going to be in build?
So, the https://bugzilla.redhat.com/show_bug.cgi?id=1398415 is ON_QA on 4.0.6. Could you please test this again to see if the issue is gone now?
(In reply to Tomas Jelinek from comment #6) > So, the https://bugzilla.redhat.com/show_bug.cgi?id=1398415 is ON_QA on > 4.0.6. Could you please test this again to see if the issue is gone now? Confirmed that the issue is not reproducible on rhevm: 4.0.6.3-0.1.el7ev
great, so closing as duplicate since it is now confirmed. *** This bug has been marked as a duplicate of bug 1382583 ***