Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 832370

Summary: A live Vm's status is changed to "Not responding" during installation.
Product: [Retired] oVirt Reporter: Mark Wu <wudxw>
Component: ovirt-engine-coreAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED WORKSFORME QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.1 RCCC: abaron, acathrow, bazulay, danken, dneary, ecohen, iheim, jkt, mgoldboi, michal.skrivanek, oschreib
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-06 11:20:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm.log none

Description Mark Wu 2012-06-15 09:21:15 UTC
Description of problem:
Vm's status is changed to "Not responding" during installation. I can see the installation process is still on going in spice add-on window. I also notice that the engine still polls the VM's stat with vmGetStats from vdsm.log and vdsm reports the status of that VM is up.  And host status is 'up' in the web ui.
No related messages found in engine.log,  so I suspect it's a bug of engine webadmin.


./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,994::BindingXMLRPC::859::vds::(wrapper) client [192.168.122.1]::call vmGetStats with ('a2ba8756-1b3c-47f7-9793-7f78aaf7e021',) {}
./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,996::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '2162.34', 'cpuIdle': '86.68', 'elapsedTime': '4433', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}




Version-Release number of selected component (if applicable):
ovirt-engine-webadmin-portal-3.1.0_0001-1.8.fc17.noarch
ovirt-engine-core-3.1.0_0001-1.8.fc17.noarch

How reproducible:
I have seen two times.

Steps to Reproduce:
Install a VM, and wait for about 40mins.  In my case, the installation is pretty slow. 
  
Actual results:


Expected results:


Additional info:

Comment 1 Barak 2012-06-19 07:23:18 UTC
From the above log,

The 'monitorResponse' is reported as -1, this indicates that the monitor responsiveness is very slow.
In the engine this is the indication to change the VMs status to "Not Responding"

// if monitorResponse returns negative it means its erroneous
        if (xmlRpcStruct.contains(VdsProperties.monitorResponse)) {
            int response = Integer.parseInt(xmlRpcStruct.getItem(VdsProperties.monitorResponse).toString());
            if (response < 0) {
                vm.setstatus(VMStatus.NotResponding);
 

As for the reason for that to happen, the usual case I have seen it, is when a lot of io needs to be written to the disk and the disk performance is not good enough.

Comment 2 Dan Kenigsberg 2012-06-19 08:15:01 UTC
monitorResponse is set to -1 if a call to a libvirt virDomain API has returned VIR_ERR_OPERATION_TIMEOUT. Mark, could you see traces of this in vdsm.log?

Is the state of monitorResponse==-1 persistent, or short and transient?

Comment 3 Mark Wu 2012-06-20 07:56:33 UTC
In my test, once it entered "not responding" state, I didn't see it recover back any more. Probably it's caused by the stats collecting in vdsm. 'statsAge' got logger than vm_command_timeout(default value is 60). I am not sure why the stats were lost. I will try to reproduce it once I restore my ovirt-engine setup. 

Thread-13268::DEBUG::2012-06-15 16:30:14,176::libvirtvm::240::vm.Vm::(_getDiskStats) vmId=`a2ba8756-1b3c-47f7-9793-7f78aaf7e021`::Disk hdc stats not available
Thread-13268::DEBUG::2012-06-15 16:30:14,176::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '62.53', 'cpuIdle': '86.68', 'elapsedTime': '2333', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}

Comment 4 Mark Wu 2012-06-20 07:57:48 UTC
Created attachment 593147 [details]
vdsm.log

Comment 6 Dave Neary 2013-03-18 09:50:26 UTC
Dan, Michael, what action is required here? Is this still an issue?

Thanks,
Dave.

Comment 7 Michal Skrivanek 2014-01-06 11:20:14 UTC
this is quite an old issue caused either by slow response or other error in libvirt's stats retrieval. 
We also recently decreased the number of queries for networks which should lighten the load a bit too.
However in overload conditions it is possible to see this and is expected

Please reopen if still relevant or you see this on a non-busy system (please include libvirt log and performance stats at that moment)