Description of problem: Vm's status is changed to "Not responding" during installation. I can see the installation process is still on going in spice add-on window. I also notice that the engine still polls the VM's stat with vmGetStats from vdsm.log and vdsm reports the status of that VM is up. And host status is 'up' in the web ui. No related messages found in engine.log, so I suspect it's a bug of engine webadmin. ./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,994::BindingXMLRPC::859::vds::(wrapper) client [192.168.122.1]::call vmGetStats with ('a2ba8756-1b3c-47f7-9793-7f78aaf7e021',) {} ./vdsm.log:Thread-17105::DEBUG::2012-06-15 17:05:13,996::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '2162.34', 'cpuIdle': '86.68', 'elapsedTime': '4433', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Version-Release number of selected component (if applicable): ovirt-engine-webadmin-portal-3.1.0_0001-1.8.fc17.noarch ovirt-engine-core-3.1.0_0001-1.8.fc17.noarch How reproducible: I have seen two times. Steps to Reproduce: Install a VM, and wait for about 40mins. In my case, the installation is pretty slow. Actual results: Expected results: Additional info:
From the above log, The 'monitorResponse' is reported as -1, this indicates that the monitor responsiveness is very slow. In the engine this is the indication to change the VMs status to "Not Responding" // if monitorResponse returns negative it means its erroneous if (xmlRpcStruct.contains(VdsProperties.monitorResponse)) { int response = Integer.parseInt(xmlRpcStruct.getItem(VdsProperties.monitorResponse).toString()); if (response < 0) { vm.setstatus(VMStatus.NotResponding); As for the reason for that to happen, the usual case I have seen it, is when a lot of io needs to be written to the disk and the disk performance is not good enough.
monitorResponse is set to -1 if a call to a libvirt virDomain API has returned VIR_ERR_OPERATION_TIMEOUT. Mark, could you see traces of this in vdsm.log? Is the state of monitorResponse==-1 persistent, or short and transient?
In my test, once it entered "not responding" state, I didn't see it recover back any more. Probably it's caused by the stats collecting in vdsm. 'statsAge' got logger than vm_command_timeout(default value is 60). I am not sure why the stats were lost. I will try to reproduce it once I restore my ovirt-engine setup. Thread-13268::DEBUG::2012-06-15 16:30:14,176::libvirtvm::240::vm.Vm::(_getDiskStats) vmId=`a2ba8756-1b3c-47f7-9793-7f78aaf7e021`::Disk hdc stats not available Thread-13268::DEBUG::2012-06-15 16:30:14,176::BindingXMLRPC::865::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '6878', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': -2L, 'hash': '5685342449532524952', 'pauseCode': 'NOERR', 'clientIp': '192.168.122.1', 'kvmEnable': 'true', 'network': {}, 'vmId': 'a2ba8756-1b3c-47f7-9793-7f78aaf7e021', 'displayType': 'qxl', 'cpuUser': '13.32', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '6104988', 'apparentsize': '5368709120', 'writeLatency': '43576967', 'imageID': '1152db7d-3e52-4d6f-9e8c-48d31137798e', 'flushLatency': '76625', 'readRate': '24384.89', 'truesize': '5368713216', 'writeRate': '119733.22'}}, 'monitorResponse': '-1', 'statsAge': '62.53', 'cpuIdle': '86.68', 'elapsedTime': '2333', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}
Created attachment 593147 [details] vdsm.log
Dan, Michael, what action is required here? Is this still an issue? Thanks, Dave.
this is quite an old issue caused either by slow response or other error in libvirt's stats retrieval. We also recently decreased the number of queries for networks which should lighten the load a bit too. However in overload conditions it is possible to see this and is expected Please reopen if still relevant or you see this on a non-busy system (please include libvirt log and performance stats at that moment)