Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1467696

Summary: Update of headless VM in "PoweringUP" state is not handled properly by VDSM. Multiple tracebacks appear.
Product: [oVirt] vdsm Reporter: Vitalii Yerys <vyerys>
Component: GeneralAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: Vitalii Yerys <vyerys>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.20.0CC: bugs, mavital, michal.skrivanek, mtessun, tjelinek, vyerys
Target Milestone: ovirt-4.2.0Flags: michal.skrivanek: ovirt-4.2?
mtessun: planning_ack+
mtessun: devel_ack?
mtessun: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-24 14:53:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engine and vdsm logs none

Description Vitalii Yerys 2017-07-04 16:03:23 UTC
Created attachment 1294293 [details]
Engine and vdsm logs

Description of problem:
During latest test run, test was trying to set vm to headless mode during 'PoweringUp' state (test case issue), but this caused some issues with IndexError list out of range in the VDSM logs. and following traceback was returned:

VDSM Log:

ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600a09803830347a625d467a79745665|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --refresh adc72f60-9768-4294-a455-b1f9d1619578/leases (cwd None) (commands:70)
2017-06-30 08:14:20,532+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='86bf22fe-e904-4d28-be5c-684fb2b014e4') Error fetching vm stats (vm:1519)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 1515, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 47, in produce
    balloon(vm, stats, last_sample)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 153, in balloon
    balloon_info = vm.get_balloon_info()
  File "/usr/share/vdsm/virt/vm.py", line 4674, in get_balloon_info
    dev = self._devices[hwclass.BALLOON][0]
IndexError: list index out of range

...
...

2017-06-30 08:14:59,080+0300 ERROR (Thread-97) [utils.Callback] qemuGuestAgentCallback failed (utils:405)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 403, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 107, in qemuGuestAgentCallback
    if response.is_error(self.vm.qemuGuestAgentShutdown()):
  File "/usr/lib/python2.7/site-packages/vdsm/common/response.py", line 79, in is_error
    code = res["status"]["code"]
TypeError: 'NoneType' object has no attribute '__getitem__'


Engine Log:

2017-06-30 08:14:30,731+03 WARN  [org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] Field 'defaultDisplayType' can not be updated when status is 'PoweringUp'
2017-06-30 08:14:30,731+03 WARN  [org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] ObjectIdentityChecker.isUpdateValid:: Not updatable field 'defaultDisplayType' was updated
2017-06-30 08:14:30,734+03 WARN  [org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] Field 'balloonEnabled' can not be updated when status is 'PoweringUp'
2017-06-30 08:14:30,734+03 WARN  [org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] Field 'rngDevice' can not be updated when status is 'PoweringUp'
2017-06-30 08:14:30,735+03 WARN  [org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] Field 'soundDeviceEnabled' can not be updated when status is 'PoweringUp'



This issue on it's own is provoking another one, which is reproduced in next case coming after previous one, during VM creation, and not really clear error message is being reproduced:



2017-06-30 08:25:14,555 - JobsSet - api_utils - ERROR - Failed to syncAction element NOT as expected:
	Status: 400
	Reason: Bad Request
	Detail: [Cannot run VM. Selected display type is not supported by the operating system.]

Engine Log:

2017-06-30 08:25:14,497+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-22) [vms_syncAction_22d42b3b-cdbd-49c5] Lock Acquired to object 'EngineLock:{exclusiveLocks='[86bf22fe-e904-4d28-be5c-684fb2b014e4=VM]', sharedLocks=''}'
2017-06-30 08:25:14,544+03 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-22) [vms_syncAction_22d42b3b-cdbd-49c5] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_ILLEGAL_VM_DISPLAY_TYPE_IS_NOT_SUPPORTED_BY_OS
2017-06-30 08:25:14,545+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-22) [vms_syncAction_22d42b3b-cdbd-49c5] Lock freed to object 'EngineLock:{exclusiveLocks='[86bf22fe-e904-4d28-be5c-684fb2b014e4=VM]', sharedLocks=''}'
2017-06-30 08:25:14,551+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-22) [] Operation Failed: [Cannot run VM. Selected display type is not supported by the operating system.]

VDSM Log:

2017-06-30 08:25:30,235+0300 ERROR (Thread-98) [root] Shutdown by QEMU Guest Agent failed (vm:4600)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 4591, in qemuGuestAgentShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_GUEST_AGENT)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 646, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2403, in shutdownFlags
    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
libvirtError: Guest agent is not responding: QEMU guest agent is not connected
2017-06-30 08:25:30,236+0300 ERROR (Thread-98) [utils.Callback] qemuGuestAgentCallback failed (utils:405)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 403, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 107, in qemuGuestAgentCallback
    if response.is_error(self.vm.qemuGuestAgentShutdown()):
  File "/usr/share/vdsm/virt/vm.py", line 4601, in qemuGuestAgentShutdown
    raise exception.NonResponsiveGuestAgent()


Version-Release number of selected component (if applicable):

ovirt-engine-4.2.0-0.0.master.20170627181935.git9424f9b.el7.centos.noarch
vdsm-4.20.1-89.git2912779.el7.centos.x86_64
kernel_release	3.10.0-663.el7.x86_64

How reproducible:

Was not able to reproduce manually. Looks like it is reproduced in some weird raise conditions.

Steps to Reproduce:

Actual results:
Multiple errors in Engine/VDSM logs. QEMU guest agent is not responding.


Expected results:

Such conditions should be handled properly.


Additional info:

libvirt logs can be obtained here:

https://drive.google.com/a/redhat.com/file/d/0B2-a6L1-H35bUV9XVjcwUW9mUEk/view?usp=sharing

Comment 1 Yaniv Kaul 2017-07-05 08:34:22 UTC
If QEMU guest agent is indeed not responding, why not file a bug on it?

Comment 2 Vitalii Yerys 2017-07-05 08:47:57 UTC
As I understood from the tracebacks, qemu becomes not responsive because of the issues that come before it, which I think are vdsm issues, you can see them in the first part of the bugs description. Sorry if I chose incorrect title for this bug.

Comment 3 Yaniv Kaul 2017-07-05 09:03:24 UTC
(In reply to Vitalii Yerys from comment #2)
> As I understood from the tracebacks, qemu becomes not responsive because of
> the issues that come before it, which I think are vdsm issues, you can see
> them in the first part of the bugs description. Sorry if I chose incorrect
> title for this bug.

The title can be edited. Please fix it.
Does the VM have a balloon device, btw?

Comment 4 Vitalii Yerys 2017-07-05 09:31:24 UTC
> Does the VM have a balloon device, btw?

I think Balloon device was enabled, because VM was booted from template which was created from one of our base VMs (golden_env_mixed_virtio_0) which has this option enabled.

Comment 5 Yaniv Kaul 2017-07-05 10:03:03 UTC
The title is still misleading - does that happen with regular VMs, or only headless?

Can you check the qemu command line or libvirt XML and verify if the balloon device is enabled or not?

Comment 6 Vitalii Yerys 2017-07-05 11:13:15 UTC
> The title is still misleading - does that happen with regular VMs, or only 
> headless?

changed.

> Can you check the qemu command line or libvirt XML and verify if the balloon 
> device is enabled or not?

vdsm log:

2017-06-30 08:14:20,746+0300 INFO  (vm/86bf22fe) [virt.vm] (vmId='86bf22fe-e904-4d28-be5c-684fb2b014e4') <?xml version='1.0' encoding='UTF-8'?>
<domain xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0" type="kvm">

...
    <memballoon model="none" />
...

Art log:
    <memory_policy>
        <ballooning>false</ballooning>
        <guaranteed>2147483648</guaranteed>
        <max>4294967296</max>
    </memory_policy>

Looks like ballooning device was not enabled. Nevertheless I don't think we should see such tracebacks, with:

IndexError: list index out of range

TypeError: 'NoneType' object has no attribute '__getitem__'
etc.

I was not able to reproduce it manually, it happened with the headless VM during test-run.

Comment 7 Yaniv Kaul 2017-07-05 11:18:24 UTC
(In reply to Vitalii Yerys from comment #6)
> > The title is still misleading - does that happen with regular VMs, or only 
> > headless?
> 
> changed.
> 
> > Can you check the qemu command line or libvirt XML and verify if the balloon 
> > device is enabled or not?
> 
> vdsm log:
> 
> 2017-06-30 08:14:20,746+0300 INFO  (vm/86bf22fe) [virt.vm]
> (vmId='86bf22fe-e904-4d28-be5c-684fb2b014e4') <?xml version='1.0'
> encoding='UTF-8'?>
> <domain xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0"
> xmlns:ovirt-vm="http://ovirt.org/vm/1.0" type="kvm">
> 
> ...
>     <memballoon model="none" />
> ...

As expected, the headless VM does not enable the balloon device.

> 
> Art log:
>     <memory_policy>
>         <ballooning>false</ballooning>
>         <guaranteed>2147483648</guaranteed>
>         <max>4294967296</max>
>     </memory_policy>
> 
> Looks like ballooning device was not enabled. Nevertheless I don't think we
> should see such tracebacks, with:
> 
> IndexError: list index out of range
> 
> TypeError: 'NoneType' object has no attribute '__getitem__'
> etc.

Of course not, but now we know some more details on the environment.

> 
> I was not able to reproduce it manually, it happened with the headless VM
> during test-run.

Not even when manually running a headless VM?

Comment 8 Vitalii Yerys 2017-07-05 11:29:45 UTC
> As expected, the headless VM does not enable the balloon device.

But why did I see such messages in the engine log?

[org.ovirt.engine.core.utils.ObjectIdentityChecker] (default task-25) [vms_update_3e4ffa42-53ea-4092] Field 'balloonEnabled' can not be updated when status is 'PoweringUp'

> Not even when manually running a headless VM?

I tried doing it manually, but could not catch the proper moment with 'PoweringUp' I think, I will try to reproduce it few more times, if I succeed - I will post exact steps to reproduce.

Comment 9 Michal Skrivanek 2017-07-27 09:34:27 UTC
vdsm ballooon tracebacks are supposedly fixed by bug 1458901

Comment 10 Michal Skrivanek 2017-07-27 09:40:59 UTC
engine's WARN messages on update fields which can't be updated is known and not worth fixing, IMHO, so the only remaining issue should be about the qemu guest agent calls raising unhandled exceptions.
The fact that the VM is headless is unlikely to matter

for the "Cannot run VM. Selected display type is not supported by the operating system." error I need to know what was the original state of the VM, what was it updated with, how did it look before start. - (check specifically the graphics type and OS)

Comment 11 Vitalii Yerys 2017-08-03 11:03:24 UTC
sorry for late response.

state of VM before start can be seen here:

http://pastebin.test.redhat.com/505747

Comment 12 Michal Skrivanek 2017-08-03 12:57:50 UTC
thanks, but I need all what I asked for, before&after and what was POSTed.
note grapshics details are in graphicsconsoles subresource, so either include that one as well or take a look a GUI what does it say

Comment 13 Vitalii Yerys 2017-08-04 13:05:41 UTC
Sorry for incomplete response previously, here you can find VM state and options from the "setup" up to "teardown" state.

http://pastebin.test.redhat.com/506130

I took this info from the log attached. (art_test_runner.log.3 Engine/vdsm/etc logs are also in the attachments). Unfortunately this is all info there is, I was not able to reproduce the issue previously.

Looks like the VM was not able to set to "DOWN" state after previous test case, which caused some trouble, so it was up when graphics console removal attempt was performed.

Hope it helps.

Comment 14 Michal Skrivanek 2017-08-22 08:42:58 UTC
Yes, I do not see the changed VM to be stopped, then you still see the same configuration until it's powered down. That's a test case error then.

So, the only issue is the qemu-ga traceback on powering up, i.e. when it's unlikely to run?

Comment 15 Vitalii Yerys 2017-08-29 15:33:55 UTC
Sorry for late response. 

> Yes, I do not see the changed VM to be stopped, then you still see the same 
> configuration until it's powered down. That's a test case error then.

Actually this does not look like a test case issue, but more likely something went wrong on previous test case teardown. In the logs (http://pastebin.test.redhat.com/512138) we can clearly see the shutdown request and response with complete state even tho VM did go to "powering_down" it never reached "down" and resulted into "UP" on the current test case execution - but I think this is a matter of separate investigation.


> So, the only issue is the qemu-ga traceback on powering up, i.e. when it's 
> unlikely to run?

Yes. Also what about this tracebacks? Both of them got fixed by: https://bugzilla.redhat.com/show_bug.cgi?id=1458901 ?

Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 1515, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 47, in produce
    balloon(vm, stats, last_sample)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 153, in balloon
    balloon_info = vm.get_balloon_info()
  File "/usr/share/vdsm/virt/vm.py", line 4674, in get_balloon_info
    dev = self._devices[hwclass.BALLOON][0]
IndexError: list index out of range

And

2017-06-30 08:14:59,080+0300 ERROR (Thread-97) [utils.Callback] qemuGuestAgentCallback failed (utils:405)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 403, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 107, in qemuGuestAgentCallback
    if response.is_error(self.vm.qemuGuestAgentShutdown()):
  File "/usr/lib/python2.7/site-packages/vdsm/common/response.py", line 79, in is_error
    code = res["status"]["code"]
TypeError: 'NoneType' object has no attribute '__getitem__'

Comment 16 Tomas Jelinek 2017-11-24 14:53:09 UTC
> Yes. Also what about this tracebacks? Both of them got fixed by:
> https://bugzilla.redhat.com/show_bug.cgi?id=1458901 ?
> 

so, closing this bug.

*** This bug has been marked as a duplicate of bug 1458901 ***