Bug 1026441 - [vdsm] "vdsClient list" and "vdsClient list table" return different output when there is a VM in "powering down" status
[vdsm] "vdsClient list" and "vdsClient list table" return different output wh...
Status: CLOSED NEXTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Unspecified
unspecified Severity high
: ---
: 3.5.0
Assigned To: Martin Betak
virt
:
Depends On: 1045626
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-04 11:14 EST by Elad
Modified: 2015-01-20 16:21 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-04-03 05:43:09 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (3.86 MB, application/x-gzip)
2013-11-04 11:14 EST, Elad
no flags Details

  None (edit)
Description Elad 2013-11-04 11:14:29 EST
Created attachment 819246 [details]
logs

Description of problem:
I've encountered a situation in which a VM was reported as 'not-responding' by engine and after a few minutes as 'down'. During this whole time, I ran these 2 commands: 
- vdsClient -s 0 list table
- vdsClient -s 0 list

The first command (vdsClient -s 0 list table) returned me this output:

[root@nott-vds2 ~]# vdsClient -s 0 list table
Virtual machine does not exist

The second command (vdsClient -s 0 list) returned me this output (abbreviated):
[root@nott-vds2 ~]# vdsClient -s 0 list 

524b35df-c7f9-45d0-a85f-aefa5653f9a4
        Status = Powering down      
        acpiEnable = true           
        emulatedMachine = rhel6.5.0 
        memGuaranteedSize = 1024    
        timeOffset = 0              
        displaySecurePort = 5901    
        spiceSslCipherSuite = DEFAULT
        cpuType = Conroe             
        smp = 1
        pauseCode = NOERR
        custom = {}
        vmType = kvm
        memSize = 1024
        smpCoresPerSocket = 1
        vmName = iscsi1-9

Engine report that the host has a running VM, and therefore, maintenance to the host leaves it as 'preparing for maintenance' forever. The same VM is reported as 'Down' status by engine.
Also, there is no running qemu process on host:

[root@nott-vds2 ~]# pgrep qemu
[root@nott-vds2 ~]#
[root@nott-vds2 ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------

[root@nott-vds2 ~]



on vdsm.log:
Thread-11644::DEBUG::2013-11-04 16:18:39,945::BindingXMLRPC::991::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '438', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '-7987696128395049696', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:75:a8:01', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': '524b35df-c7f9-45d0-a85f-aefa5653f9a4', 'displayType': 'qxl', 'cpuUser': '0.17', 'disks': {u'vda': {'readLatency': '351920', 'apparentsize': '5368709120', 'writeLatency': '0', 'imageID': '83dd143c-a329-463d-adf2-47565177e453', 'flushLatency': '0', 'readRate': '25.47', 'truesize': '5368709120', 'writeRate': '0.00'}, u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '-1', 'statsAge': '64.55', 'elapsedTime': '10556', 'vmType': 'kvm', 'cpuSys': '0.80', 'appsList': [], 'guestIPs': ''}]}
Dummy-2242::DEBUG::2013-11-04 16:18:40,274::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/023b67f5-f02e-4390-9dce-3cb4305f9e98/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000' (cwd None)





Thread-15770::ERROR::2013-11-04 17:59:46,984::BindingXMLRPC::994::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 989, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 329, in vmGetStats
    return vm.getStats()
  File "/usr/share/vdsm/API.py", line 348, in getStats
    stats = v.getStats().copy()
  File "/usr/share/vdsm/vm.py", line 2576, in getStats
    stats = self._getStatsInternal()
  File "/usr/share/vdsm/vm.py", line 2676, in _getStatsInternal
    stats['balloonInfo'] = self._getBalloonInfo()
  File "/usr/share/vdsm/vm.py", line 4446, in _getBalloonInfo
    cur_mem = self._dom.info()[2]
  File "/usr/share/vdsm/vm.py", line 836, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1874, in info
    if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '524b35df-c7f9-45d0-a85f-aefa5653f9a4'


On engine.log:

2013-11-04 16:18:35,798 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) [5875ce1a] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Messa
ge: VM iscsi1-9 is not responding.

Version-Release number of selected component (if applicable):
vdsm-4.13.0-0.5.beta1.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
Unknown

Steps to Reproduce:
On a block pool with one host connected 
- run a VM on SPM
- remove the SD's VG with vgremove so the SD will become 'Inactive' (I didn't remove the SD's VG in my case, I removed another VG, which is not part of this pool. I'm not sure why the domain became 'Inactive')  
- vdsm will report the vm as:  'monitorResponse': '-1' in vmGetStats and engine will report it as 'not-responding'


Actual results:
Detailed above


Expected results:
The 2 commands should report the same VM output. 

Additional info: logs
Comment 1 Michal Skrivanek 2013-11-07 10:38:11 EST
can yoy please rephrase what are you saying from the UI point of view? If the VM is powering down the host can't move to maint. It should be reported as such (i.e. still there) in the UI though.
vdsClient output is not relevant, it's not supported
Comment 2 Elad 2013-11-07 11:09:04 EST
(In reply to Michal Skrivanek from comment #1)
> can yoy please rephrase what are you saying from the UI point of view? If
> the VM is powering down the host can't move to maint. It should be reported
> as such (i.e. still there) in the UI though.
Indeed. The host cannot be moved to maintenance because it is reported as running a VM. 

Engine fails in GetAllVmStatsVDS:

2013-11-06 10:59:55,021 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-100) Failed to refresh VDS , vds = f12bf7ae-3566-41d8-ad15-60fef1712ecb : nott-vds2, error = org.ovirt.engi
ne.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Virtual machine does not exist, continuing.: org.ovirt.engine.core.vdsbroker.vdsbroker.VDS
ErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Virtual machine does not exist

I guess engine still reports that the host runs a VM because it fails to get the VMs status from host. 

Only restart to VDSM service solved the issue.

> vdsClient output is not relevant, it's not supported

So should I change the bug title?
Comment 3 Michal Skrivanek 2013-11-08 03:39:15 EST
I wonder if we didn't catch the VM down event...or if we did and reported assuch and on the engine site it got lost due to previous "non responding" state....or something else:)
Comment 4 Michal Skrivanek 2013-11-25 08:53:48 EST
Elad, still confused about what you're saying.
So - were you able to see the VM in UI? In what state? At that time when vdsm was returning "Virtual machine does not exist".

There are lot of errors in the log couple days back, like No space left on device when talking to postgres, that may corrupt something….are you sure it's not related?
Comment 5 Elad 2013-11-27 03:05:38 EST
(In reply to Michal Skrivanek from comment #4)
> Elad, still confused about what you're saying.
> So - were you able to see the VM in UI? In what state? At that time when
> vdsm was returning "Virtual machine does not exist".
From the time the VM moved to "not-responding" on UI, until I restarted vdsm service, the VM appeared on UI, first as "not-responding" and after few minutes, it became "down". During this all time, "list table" returned "Virtual machine does not exist"
> 
> There are lot of errors in the log couple days back, like No space left on
> device when talking to postgres, that may corrupt something….are you sure
> it's not related?
I'm not sure about the "No space left on device" message on engine log, but I guess it is related to the low disk space on engine that was caused by the fact that server.log file was very big (I deleted the server/log file). This message appears in the log until 2013-10-27 21:49:42,378 and the bug has occurred a week after that (4/11), so I'm pretty sure it is not related.
Comment 7 Martin Betak 2014-01-29 12:39:45 EST
So if I understand you correctly Elad, and the logs, the following took action:
- you had a VM iscsi1-9 running on SPM host nott-vds2  
- the storage domain became non-responsive
- as a result the engine marked the VM as "NotResponding" (engine.log 16:18:35)
- you tried to destroy the VM (multiple times) (first attempt 16:27:05)
- vdsm destroyed the VM successfully (libvirt & qemu logs point to engine time 16:27)
- you decided to restart VDSM (I cannot see precisely when from the logs)
- after VDSM restart the storage domain was successfully reconstructed 
- engine tried to getVmStats but there was no sign of the VM (since it was destroyed and the service restarted) so the VM was marked "Down"

Were your actions consistent with this flow or have I perhaps misunderstood you?

Thank you.
Comment 8 Elad 2014-01-30 05:00:18 EST
What you described is compatible to my scenario.

> - engine tried to getVmStats but there was no sign of the VM (since it was
> destroyed and the service restarted) so the VM was marked "Down"
Engine reported the VM as Down, "vdsClient list" reported the VM as "Powering down" and "list table" as "Virtual machine does not exist"
Comment 9 Martin Betak 2014-01-30 05:18:24 EST
So ignoring the output of vdsClient (which is not supported) the behavior of the system was exactly as expected: you powered off the VM and it went down?
Comment 10 Michal Skrivanek 2014-01-31 05:01:31 EST
bug 1045626 may be related

I'd also consider moving the VM state to Unknown if we didn't get any VM stats right after NonResponding state has been resolved
Comment 11 Elad 2014-02-02 04:12:49 EST
Only after a long time (around an hour)
Comment 13 Michal Skrivanek 2014-04-03 05:43:09 EDT
It may be the effect of bug 1045626 because of the vdsm restart during the nonresponding host state
I'd suggest to retest this in 3.4 to be completely sure…

There's not much else to be investigated here, we hope it is fixed. Please feel free to reopen if you happen to see the same behavior in 3.4+

Note You need to log in before you can comment on or make changes to this bug.