| Summary: | [vdsm] "vdsClient list" and "vdsClient list table" return different output when there is a VM in "powering down" status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||
| Component: | vdsm | Assignee: | Martin Betak <mbetak> | ||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.3.0 | CC: | amureini, bazulay, ebenahar, iheim, john.baldwin, lpeer, mbetak, michal.skrivanek, yeylon | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.5.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | virt | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-04-03 09:43:09 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | 1045626 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
can yoy please rephrase what are you saying from the UI point of view? If the VM is powering down the host can't move to maint. It should be reported as such (i.e. still there) in the UI though. vdsClient output is not relevant, it's not supported (In reply to Michal Skrivanek from comment #1) > can yoy please rephrase what are you saying from the UI point of view? If > the VM is powering down the host can't move to maint. It should be reported > as such (i.e. still there) in the UI though. Indeed. The host cannot be moved to maintenance because it is reported as running a VM. Engine fails in GetAllVmStatsVDS: 2013-11-06 10:59:55,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-100) Failed to refresh VDS , vds = f12bf7ae-3566-41d8-ad15-60fef1712ecb : nott-vds2, error = org.ovirt.engi ne.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Virtual machine does not exist, continuing.: org.ovirt.engine.core.vdsbroker.vdsbroker.VDS ErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Virtual machine does not exist I guess engine still reports that the host runs a VM because it fails to get the VMs status from host. Only restart to VDSM service solved the issue. > vdsClient output is not relevant, it's not supported So should I change the bug title? I wonder if we didn't catch the VM down event...or if we did and reported assuch and on the engine site it got lost due to previous "non responding" state....or something else:) Elad, still confused about what you're saying. So - were you able to see the VM in UI? In what state? At that time when vdsm was returning "Virtual machine does not exist". There are lot of errors in the log couple days back, like No space left on device when talking to postgres, that may corrupt something….are you sure it's not related? (In reply to Michal Skrivanek from comment #4) > Elad, still confused about what you're saying. > So - were you able to see the VM in UI? In what state? At that time when > vdsm was returning "Virtual machine does not exist". From the time the VM moved to "not-responding" on UI, until I restarted vdsm service, the VM appeared on UI, first as "not-responding" and after few minutes, it became "down". During this all time, "list table" returned "Virtual machine does not exist" > > There are lot of errors in the log couple days back, like No space left on > device when talking to postgres, that may corrupt something….are you sure > it's not related? I'm not sure about the "No space left on device" message on engine log, but I guess it is related to the low disk space on engine that was caused by the fact that server.log file was very big (I deleted the server/log file). This message appears in the log until 2013-10-27 21:49:42,378 and the bug has occurred a week after that (4/11), so I'm pretty sure it is not related. So if I understand you correctly Elad, and the logs, the following took action: - you had a VM iscsi1-9 running on SPM host nott-vds2 - the storage domain became non-responsive - as a result the engine marked the VM as "NotResponding" (engine.log 16:18:35) - you tried to destroy the VM (multiple times) (first attempt 16:27:05) - vdsm destroyed the VM successfully (libvirt & qemu logs point to engine time 16:27) - you decided to restart VDSM (I cannot see precisely when from the logs) - after VDSM restart the storage domain was successfully reconstructed - engine tried to getVmStats but there was no sign of the VM (since it was destroyed and the service restarted) so the VM was marked "Down" Were your actions consistent with this flow or have I perhaps misunderstood you? Thank you. What you described is compatible to my scenario.
> - engine tried to getVmStats but there was no sign of the VM (since it was
> destroyed and the service restarted) so the VM was marked "Down"
Engine reported the VM as Down, "vdsClient list" reported the VM as "Powering down" and "list table" as "Virtual machine does not exist"
So ignoring the output of vdsClient (which is not supported) the behavior of the system was exactly as expected: you powered off the VM and it went down? bug 1045626 may be related I'd also consider moving the VM state to Unknown if we didn't get any VM stats right after NonResponding state has been resolved Only after a long time (around an hour) It may be the effect of bug 1045626 because of the vdsm restart during the nonresponding host state I'd suggest to retest this in 3.4 to be completely sure… There's not much else to be investigated here, we hope it is fixed. Please feel free to reopen if you happen to see the same behavior in 3.4+ |
Created attachment 819246 [details] logs Description of problem: I've encountered a situation in which a VM was reported as 'not-responding' by engine and after a few minutes as 'down'. During this whole time, I ran these 2 commands: - vdsClient -s 0 list table - vdsClient -s 0 list The first command (vdsClient -s 0 list table) returned me this output: [root@nott-vds2 ~]# vdsClient -s 0 list table Virtual machine does not exist The second command (vdsClient -s 0 list) returned me this output (abbreviated): [root@nott-vds2 ~]# vdsClient -s 0 list 524b35df-c7f9-45d0-a85f-aefa5653f9a4 Status = Powering down acpiEnable = true emulatedMachine = rhel6.5.0 memGuaranteedSize = 1024 timeOffset = 0 displaySecurePort = 5901 spiceSslCipherSuite = DEFAULT cpuType = Conroe smp = 1 pauseCode = NOERR custom = {} vmType = kvm memSize = 1024 smpCoresPerSocket = 1 vmName = iscsi1-9 Engine report that the host has a running VM, and therefore, maintenance to the host leaves it as 'preparing for maintenance' forever. The same VM is reported as 'Down' status by engine. Also, there is no running qemu process on host: [root@nott-vds2 ~]# pgrep qemu [root@nott-vds2 ~]# [root@nott-vds2 ~]# virsh -r list Id Name State ---------------------------------------------------- [root@nott-vds2 ~] on vdsm.log: Thread-11644::DEBUG::2013-11-04 16:18:39,945::BindingXMLRPC::991::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '438', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '-7987696128395049696', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:75:a8:01', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': '524b35df-c7f9-45d0-a85f-aefa5653f9a4', 'displayType': 'qxl', 'cpuUser': '0.17', 'disks': {u'vda': {'readLatency': '351920', 'apparentsize': '5368709120', 'writeLatency': '0', 'imageID': '83dd143c-a329-463d-adf2-47565177e453', 'flushLatency': '0', 'readRate': '25.47', 'truesize': '5368709120', 'writeRate': '0.00'}, u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '-1', 'statsAge': '64.55', 'elapsedTime': '10556', 'vmType': 'kvm', 'cpuSys': '0.80', 'appsList': [], 'guestIPs': ''}]} Dummy-2242::DEBUG::2013-11-04 16:18:40,274::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/023b67f5-f02e-4390-9dce-3cb4305f9e98/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000' (cwd None) Thread-15770::ERROR::2013-11-04 17:59:46,984::BindingXMLRPC::994::vds::(wrapper) libvirt error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 989, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 329, in vmGetStats return vm.getStats() File "/usr/share/vdsm/API.py", line 348, in getStats stats = v.getStats().copy() File "/usr/share/vdsm/vm.py", line 2576, in getStats stats = self._getStatsInternal() File "/usr/share/vdsm/vm.py", line 2676, in _getStatsInternal stats['balloonInfo'] = self._getBalloonInfo() File "/usr/share/vdsm/vm.py", line 4446, in _getBalloonInfo cur_mem = self._dom.info()[2] File "/usr/share/vdsm/vm.py", line 836, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1874, in info if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid '524b35df-c7f9-45d0-a85f-aefa5653f9a4' On engine.log: 2013-11-04 16:18:35,798 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) [5875ce1a] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Messa ge: VM iscsi1-9 is not responding. Version-Release number of selected component (if applicable): vdsm-4.13.0-0.5.beta1.el6ev.x86_64 libvirt-0.10.2-29.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64 How reproducible: Unknown Steps to Reproduce: On a block pool with one host connected - run a VM on SPM - remove the SD's VG with vgremove so the SD will become 'Inactive' (I didn't remove the SD's VG in my case, I removed another VG, which is not part of this pool. I'm not sure why the domain became 'Inactive') - vdsm will report the vm as: 'monitorResponse': '-1' in vmGetStats and engine will report it as 'not-responding' Actual results: Detailed above Expected results: The 2 commands should report the same VM output. Additional info: logs