832370 – A live Vm's status is changed to "Not responding" during installation.

Bug 832370 - A live Vm's status is changed to "Not responding" during installation.

Summary: A live Vm's status is changed to "Not responding" during installation.

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.1 RC
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nobody's working on this, feel free to take it
QA Contact:
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-15 09:21 UTC by Mark Wu
Modified:	2014-01-06 11:20 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-06 11:20:14 UTC
oVirt Team:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm.log (8.60 MB, application/octet-stream) 2012-06-20 07:57 UTC, Mark Wu	no flags	Details
View All

Description Mark Wu 2012-06-15 09:21:15 UTC

Description of problem:
Vm's status is changed to "Not responding" during installation. I can see the installation process is still on going in spice add-on window. I also notice that the engine still polls the VM's stat with vmGetStats from vdsm.log and vdsm reports the status of that VM is up.  And host status is 'up' in the web ui.
No related messages found in engine.log,  so I suspect it's a bug of engine webadmin.


./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,994::BindingXMLRPC::859::vds::(wrapper) client [192.168.122.1]::call vmGetStats with ('a2ba8756-1b3c-47f7-9793-7f78aaf7e021',) {}
./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,996::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '2162.34', 'cpuIdle': '86.68', 'elapsedTime': '4433', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}




Version-Release number of selected component (if applicable):
ovirt-engine-webadmin-portal-3.1.0_0001-1.8.fc17.noarch
ovirt-engine-core-3.1.0_0001-1.8.fc17.noarch

How reproducible:
I have seen two times.

Steps to Reproduce:
Install a VM, and wait for about 40mins.  In my case, the installation is pretty slow. 
  
Actual results:


Expected results:


Additional info:

Comment 1 Barak 2012-06-19 07:23:18 UTC

From the above log,

The 'monitorResponse' is reported as -1, this indicates that the monitor responsiveness is very slow.
In the engine this is the indication to change the VMs status to "Not Responding"

// if monitorResponse returns negative it means its erroneous
        if (xmlRpcStruct.contains(VdsProperties.monitorResponse)) {
            int response = Integer.parseInt(xmlRpcStruct.getItem(VdsProperties.monitorResponse).toString());
            if (response < 0) {
                vm.setstatus(VMStatus.NotResponding);
 

As for the reason for that to happen, the usual case I have seen it, is when a lot of io needs to be written to the disk and the disk performance is not good enough.

Comment 2 Dan Kenigsberg 2012-06-19 08:15:01 UTC

monitorResponse is set to -1 if a call to a libvirt virDomain API has returned VIR_ERR_OPERATION_TIMEOUT. Mark, could you see traces of this in vdsm.log?

Is the state of monitorResponse==-1 persistent, or short and transient?

Comment 3 Mark Wu 2012-06-20 07:56:33 UTC

In my test, once it entered "not responding" state, I didn't see it recover back any more. Probably it's caused by the stats collecting in vdsm. 'statsAge' got logger than vm_command_timeout(default value is 60). I am not sure why the stats were lost. I will try to reproduce it once I restore my ovirt-engine setup. 

Thread-13268::DEBUG::2012-06-15 16:30:14,176::libvirtvm::240::vm.Vm::(_getDiskStats) vmId=`a2ba8756-1b3c-47f7-9793-7f78aaf7e021`::Disk hdc stats not available
Thread-13268::DEBUG::2012-06-15 16:30:14,176::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '62.53', 'cpuIdle': '86.68', 'elapsedTime': '2333', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}

Comment 4 Mark Wu 2012-06-20 07:57:48 UTC

Created attachment 593147 [details]
vdsm.log

Comment 6 Dave Neary 2013-03-18 09:50:26 UTC

Dan, Michael, what action is required here? Is this still an issue?

Thanks,
Dave.

Comment 7 Michal Skrivanek 2014-01-06 11:20:14 UTC

this is quite an old issue caused either by slow response or other error in libvirt's stats retrieval. 
We also recently decreased the number of queries for networks which should lighten the load a bit too.
However in overload conditions it is possible to see this and is expected

Please reopen if still relevant or you see this on a non-busy system (please include libvirt log and performance stats at that moment)

Note You need to log in before you can comment on or make changes to this bug.