Bug 832370 - A live Vm's status is changed to "Not responding" during installation.
A live Vm's status is changed to "Not responding" during installation.
Product: oVirt
Classification: Community
Component: ovirt-engine-core (Show other bugs)
3.1 RC
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Nobody's working on this, feel free to take it
Depends On:
  Show dependency treegraph
Reported: 2012-06-15 05:21 EDT by Mark Wu
Modified: 2014-01-06 06:20 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-01-06 06:20:14 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
vdsm.log (8.60 MB, application/octet-stream)
2012-06-20 03:57 EDT, Mark Wu
no flags Details

  None (edit)
Description Mark Wu 2012-06-15 05:21:15 EDT
Description of problem:
Vm's status is changed to "Not responding" during installation. I can see the installation process is still on going in spice add-on window. I also notice that the engine still polls the VM's stat with vmGetStats from vdsm.log and vdsm reports the status of that VM is up.  And host status is 'up' in the web ui.
No related messages found in engine.log,  so I suspect it's a bug of engine webadmin.

./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,994::BindingXMLRPC::859::vds::(wrapper) client []::call vmGetStats with ('a2ba8756-1b3c-47f7-9793-7f78aaf7e021',) {}
./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,996::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '2162.34', 'cpuIdle': '86.68', 'elapsedTime': '4433', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}

Version-Release number of selected component (if applicable):

How reproducible:
I have seen two times.

Steps to Reproduce:
Install a VM, and wait for about 40mins.  In my case, the installation is pretty slow. 
Actual results:

Expected results:

Additional info:
Comment 1 Barak 2012-06-19 03:23:18 EDT
From the above log,

The 'monitorResponse' is reported as -1, this indicates that the monitor responsiveness is very slow.
In the engine this is the indication to change the VMs status to "Not Responding"

// if monitorResponse returns negative it means its erroneous
        if (xmlRpcStruct.contains(VdsProperties.monitorResponse)) {
            int response = Integer.parseInt(xmlRpcStruct.getItem(VdsProperties.monitorResponse).toString());
            if (response < 0) {

As for the reason for that to happen, the usual case I have seen it, is when a lot of io needs to be written to the disk and the disk performance is not good enough.
Comment 2 Dan Kenigsberg 2012-06-19 04:15:01 EDT
monitorResponse is set to -1 if a call to a libvirt virDomain API has returned VIR_ERR_OPERATION_TIMEOUT. Mark, could you see traces of this in vdsm.log?

Is the state of monitorResponse==-1 persistent, or short and transient?
Comment 3 Mark Wu 2012-06-20 03:56:33 EDT
In my test, once it entered "not responding" state, I didn't see it recover back any more. Probably it's caused by the stats collecting in vdsm. 'statsAge' got logger than vm_command_timeout(default value is 60). I am not sure why the stats were lost. I will try to reproduce it once I restore my ovirt-engine setup. 

Thread-13268::DEBUG::2012-06-15 16:30:14,176::libvirtvm::240::vm.Vm::(_getDiskStats) vmId=`a2ba8756-1b3c-47f7-9793-7f78aaf7e021`::Disk hdc stats not available
Thread-13268::DEBUG::2012-06-15 16:30:14,176::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '62.53', 'cpuIdle': '86.68', 'elapsedTime': '2333', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Comment 4 Mark Wu 2012-06-20 03:57:48 EDT
Created attachment 593147 [details]
Comment 6 Dave Neary 2013-03-18 05:50:26 EDT
Dan, Michael, what action is required here? Is this still an issue?

Comment 7 Michal Skrivanek 2014-01-06 06:20:14 EST
this is quite an old issue caused either by slow response or other error in libvirt's stats retrieval. 
We also recently decreased the number of queries for networks which should lighten the load a bit too.
However in overload conditions it is possible to see this and is expected

Please reopen if still relevant or you see this on a non-busy system (please include libvirt log and performance stats at that moment)

Note You need to log in before you can comment on or make changes to this bug.