Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2186668

Summary: Make qga poller code more robust when parsing vcpuinfo
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: Lukas Svaty <lsvaty>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.5.3CC: ldixon, lsurette, srevivo, tgolembi, ycui
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-18 21:09:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2023-04-14 04:36:01 UTC
Description of problem:

A very old (unknown actually) QGA on Windows Guests can apparently not return much in 'guestvcpus' command. Or at least not return "online" key.

If that happens, VDSM's qga poller blows up here every 5 seconds, on that VM.

2023-04-12 21:34:09,188-0400 ERROR (qgapoller/3) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f33601920f0>> operation failed (periodic:204)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in __call__
    self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 493, in _poller
    vm_id, self._qga_call_get_vcpus(vm_obj))
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 814, in _qga_call_get_vcpus
    if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable

The problem with this is that it breaks the entire monitoring cycle, not just the VM affected. 

So for example if there are 10 VMs on the host, and the 4th VM has this QGA issue, then 6 VMs on the host don't have their QGA polled, resulting in 6 VMs with missing IPs and other info. It is hard to track down which VMs needs it upgraded to the latest version if the user has 6 VM with missing info.

Upgrading QGA to the latest shipped fixes the issue.

Version-Release number of selected component (if applicable):
RHV 4.4. SP1

How reproducible:
* Still trying to figure out what exactly QGA version that was.

Steps to Reproduce:
* Uknown QGA so far

Comment 2 Lynn Dixon 2023-04-14 19:55:54 UTC
I spun up a Win2016 VM from the old template which should have the old version of QGA installed.  Its showing the following
virtio-win-guest-tools 1.9.10
RHEV-Tools 4.43.10
RHV-Spice-Agent64 4.43.3
REV-Application-Provisioning-Tool 4.34.4
QEMU guest agent:  7.6.2

I am not sure where its getting that version of QEMU guest agent.  This is a screen shot of the versions:  https://share.getcloudapp.com/yAuJJKY6

When I run the following on the host that is running the VM named lynnwin01.ad.shadowman.dev I see this:

virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf guestvcpus lynnwin01.ad.shadowman.dev
error: internal error: 'can-offline' missing in reply of guest-get-vcpus


So it doesn't appear that its picking up the guest CPU's but it does report the IP address.  

Similarly, whenever I start a Win2016 VM that is using this template and old QGA, I see these in the vdsm.log on that host:

2023-04-14 15:50:57,060-0400 ERROR (qgapoller/3) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fdaa8fa32e8>> operation failed (periodic:204)
TypeError: argument of type 'NoneType' is not iterable
2023-04-14 15:51:02,076-0400 ERROR (qgapoller/4) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fdaa8fa32e8>> operation failed (periodic:204)
TypeError: argument of type 'NoneType' is not iterable

If I stop the Win2016 VM on the host, those errors stop.  And the errors return whenever a Win2016 VM is ran.

Comment 3 Germano Veit Michel 2023-04-17 03:54:55 UTC
I don't know why I couldn't reproduce it, but its probably a symptom of this: https://bugzilla.redhat.com/show_bug.cgi?id=1438735

Anyway, I think VDSM code should be more robust for this. If the problem reported breaks Guest Info from QGA for the problematic guest that is acceptable, but its stopping the entire monitoring cycle so many VMs are not polled and are missing guest info, and they have nothing to do with this and are working fine, data just not being collected.

The exception stops the monitoring cycle for all next VMs, thats the problem.

Comment 4 Tomáš Golembiovský 2023-04-18 08:04:34 UTC
This bug was already fixed as part of this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2120381#c12

Comment 5 Germano Veit Michel 2023-04-18 21:09:03 UTC
Ohh, thanks Tomáš! Not sure how I missed that :(

Lynn, turns out you did upgrade to the very latest, but just 2 days later a newer version, with the fix, was available:
vdsm-4.50.2.2-1.el8ev.x86_64                                Tue Mar 28 23:57:02 2023

Fix is here: https://access.redhat.com/errata/RHBA-2022:8694
However you did the right thing, to upgrade that ancient Guest Agent.

Closing as duplicate.

*** This bug has been marked as a duplicate of bug 2120381 ***