Bug 1680398 - QEMU-GA capabilities are reported as none for a few minutes after VM start. Therefore, VM details are not reported by engine
Summary: QEMU-GA capabilities are reported as none for a few minutes after VM start. T...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.8
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.0
: 4.40.14
Assignee: Tomáš Golembiovský
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1661283 1740058
TreeView+ depends on / blocked
 
Reported: 2019-02-24 16:36 UTC by Elad
Modified: 2020-06-17 16:33 UTC (History)
11 users (show)

Fixed In Version: vdsm-4.40.14
Doc Type: Enhancement
Doc Text:
Mechanism for polling QEMU-GA in VDSM has been enhanced to query newly started VMs more often in order to get the stats as soon as the agent becomes available in guest.
Clone Of:
Environment:
Last Closed: 2020-05-20 20:03:03 UTC
oVirt Team: Virt
Embargoed:
rbarry: ovirt-4.3?


Attachments (Terms of Use)
engine, hypervisor and logs from the guest (1.47 MB, application/gzip)
2019-02-24 16:36 UTC, Elad
no flags Details
first_run.tar (2.72 MB, application/gzip)
2019-04-04 07:50 UTC, Polina
no flags Details
second_run.tar (1.27 MB, application/gzip)
2019-04-04 07:51 UTC, Polina
no flags Details
New_Logs (8.45 MB, application/zip)
2019-05-06 13:38 UTC, Shir Fishbain
no flags Details
New_Logs_4.3.6.5V (1.08 MB, application/zip)
2019-09-10 13:20 UTC, Shir Fishbain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 106848 0 master MERGED qga: rewrite poller 2021-02-09 15:45:46 UTC
oVirt gerrit 106849 0 master MERGED qga: do not query caps in every iteration 2021-02-09 15:45:46 UTC
oVirt gerrit 106850 0 master MERGED qga: add libvirt backend 2021-02-09 15:45:46 UTC

Description Elad 2019-02-24 16:36:49 UTC
Created attachment 1538179 [details]
engine, hypervisor and logs from the guest

Description of problem:
After start VM for a RHEL8 guest with qemu GA installed, the VM details, like IP and FQDN, are not reported by engine. In vdsm.log I observed that QEMU-GA capabilities for the VM are commands=[].

Version-Release number of selected component (if applicable):
Hypervisor:
vdsm-4.30.9-1.el7ev.x86_64
qemu-kvm-rhev-2.12.0-21.el7.x86_64
libvirt-4.5.0-10.el7_6.4.x86_64
sanlock-3.6.0-1.el7.x86_64
selinux-policy-3.13.1-229.el7_6.9.noarch
kernel - 3.10.0-957.10.1.el7.x86_64

Guest:
RHEL8
kernel - 4.18.0-67.el8.x86_64
qemu-guest-agent-2.12.0-61.module+el8+2786+5afd5ae3.x86_64
selinux-policy-3.14.1-59.el8.noarch

Engine:
ovirt-engine-4.3.1.1-0.1.el7.noarch

How reproducible:
Sometimes. This causes many failures in RHV QE automation, as we rely on the VM IP for SSH connection.

Steps to Reproduce:
1. Create a VM with RHEL8 installed + qemu GA running in the guest
2. Start the VM 


Actual results:
No QEMU-GA capabilities after VM start

2019-02-24 17:38:50,834+0200 INFO  (qgapoller/1) [vds] New QEMU-GA capabilities for vm_id=568dbc4a-e8ca-43a9-bd48-0cad263ea767, qemu-ga=None, commands=[] (qemuguestagent:171)


In the guest, I observed the following audit messages (which resemble to the ones reported in bug 1663092):


[  632.789546] audit: type=1400 audit(1551023330.829:3): avc:  denied  { read } for  pid=653 comm="qemu-ga" name="b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1
[  632.886759] audit: type=1400 audit(1551023330.829:4): avc:  denied  { open } for  pid=653 comm="qemu-ga" path="/run/udev/data/b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1
[  632.994361] audit: type=1400 audit(1551023330.830:5): avc:  denied  { getattr } for  pid=653 comm="qemu-ga" path="/run/udev/data/b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1


Also, in libvirtd.log, saw these messages:

2019-02-24 15:38:50.833+0000: 3047: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU guest agent is not connected


Only after ~5 minutes, the capabilities are available:

2019-02-24 17:43:50,866+0200 INFO  (qgapoller/0) [vds] New QEMU-GA capabilities for vm_id=568dbc4a-e8ca-43a9-bd48-0cad263ea767, qemu-ga=2.12.0, commands=set([u'guest-get-timezone', u'guest-suspend-hybrid', u'gue
st-get-fsinfo', u'guest-set-time', u'guest-sync', u'guest-get-memory-block-info', u'guest-fsfreeze-freeze', u'guest-network-get-interfaces', u'guest-fstrim', u'guest-get-host-name', u'guest-shutdown', u'guest-ge
t-vcpus', u'guest-sync-delimited', u'guest-set-user-password', u'guest-set-memory-blocks', u'guest-fsfreeze-status', u'guest-ping', u'guest-fsfreeze-freeze-list', u'guest-info', u'guest-get-memory-blocks', u'gue
st-set-vcpus', u'guest-fsfreeze-thaw', u'guest-suspend-disk', u'guest-get-osinfo', u'guest-get-time', u'guest-suspend-ram', u'guest-get-users']) (qemuguestagent:171)



Expected results:
QEMU-GA capabilities should be reported immediately after the qemu GA agent is up and running.


Additional info:
engine, hypervisor and logs from the guest are attached


Guest:
[root@localhost audit]# getenforce 
Permissive


[root@localhost audit]# systemctl status qemu-guest-agent.service 
● qemu-guest-agent.service - QEMU Guest Agent
   Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-02-24 15:39:15 EST; 4h 3min left
 Main PID: 653 (qemu-ga)
    Tasks: 1 (limit: 4648)
   Memory: 2.6M
   CGroup: /system.slice/qemu-guest-agent.service
           └─653 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file->

Feb 24 15:39:15 localhost.localdomain systemd[1]: Started QEMU Guest Agent.

Comment 1 Ryan Barry 2019-02-24 17:02:59 UTC
Tomas, any ideas?

Comment 2 Michal Skrivanek 2019-02-25 07:47:12 UTC
capabilities check interval is 5 minutes. Logs above seem to match that. We should consider some simple back off algorithm for VM start

Comment 3 Polina 2019-04-04 07:49:43 UTC
In the last builds with rhel8.0 guest, I see a problem that could be related to the current BZ, though I'm not sure.
I describe it here for your consideration if it could be the same problem or separate bug.

The problem: Start the rhel8 guest VM on iscsi disk . block the storage on the host. wait for the paused state. The  rhel8 VM doesn't turn to paused state. With thel7.6 VM it works perfect - after less than a minute after the storage blocked it turns to the Paused state.

I attach logs from two tests - first_run.tar.gz and  second_run.tar.gz.
In the test1 I run two iscsi VMs rhel7.6 and rhel8 the first time after creating, wait for IP. block the iscsi storage on the host => as result test_rhel7.6 turns to Paused and the test_rhel8 is not. 

The test2 is just the same . The difference if that it is not the first run , so , I get IP sooner , though the result it the same: the test_rhel8 never turns to the paused state. I wait until the host turns to non-operational at about 2019-04-04 10:14:51 which causes test_rhel8 to migrate.

This behavior causes automation tests failures. happens a lot though not 100% cases. sometimes it behaves as expected.


### test1 (the first run and getting IP after vm created) starts at:
2019-04-04 09:41:38,079+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel8 started on Host host_mixed_1
2019-04-04 09:41:53,117+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel7.6 started on Host host_mixed_1

2019-04-04 09:48:08,806+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VM_PAUSED_EIO(145), VM test_rhel7.6 has been paused due to storage I/O problem.

### test 2 (the second run) starts at:

2019-04-04 10:05:55,792+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel7.6 started on Host host_mixed_1

2019-04-04 10:06:10,905+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel8 started on Host host_mixed_1


 about Thu Apr  4 10:14:51 IDT 2019 host_mixed_1 turns to non-operational state which causes test_rhel8 to migrate

Comment 4 Polina 2019-04-04 07:50:32 UTC
Created attachment 1551703 [details]
first_run.tar

Comment 5 Polina 2019-04-04 07:51:11 UTC
Created attachment 1551704 [details]
second_run.tar

Comment 6 Shir Fishbain 2019-05-06 13:37:12 UTC
This problem appears also in :
ovirt-engine-4.3.3.6, with rhel8.0 guest

2019-05-04 03:29:36,156 - MainThread - RemoteExecutor - DEBUG - [root.16.38/qum5net]   OUT: [{u'displayInfo': [{u'tlsPort': u'5901', u'ipAddress': u'10.46.16.38', u'port': u'5900', u'type': u'spice'}], u'memUsage': u'63', u'acpiEnable': u'true', u'guestFQDN': u'', u'vmId': u'43647a6f-bf30-478d-b2f6-9e40db8bda0c', u'session': u'Unknown', u'timeOffset': u'0', u'memoryStats': {u'swap_out': 0, u'majflt': 0, u'minflt': 0, u'mem_free': u'378480', u'swap_in': 0, u'pageflt': 0, u'mem_total': u'772136', u'mem_unused': u'378480'}, u'balloonInfo': {u'balloon_max': u'1048576', u'balloon_cur': u'1048576', u'balloon_target': u'1048576', u'balloon_min': u'1048576'}, u'pauseCode': u'NOERR', u'network': {u'vnet0': {u'macAddr': u'00:1a:4a:16:88:ac', u'rxDropped': u'0', u'tx': u'4149', u'txDropped': u'0', u'rxErrors': u'0', u'rx': u'254188', u'txErrors': u'0', u'state': u'unknown', u'sampleTime': 4319342.34, u'speed': u'1000', u'name': u'vnet0'}}, u'vmType': u'kvm', u'cpuUser': u'0.05', u'elapsedTime': u'311', u'vmJobs': {}, u'cpuSys': u'0.40', u'appsList': [u'qemu-guest-agent-2.12.0'], u'vmName': u'copy_disk_vm_glusterfs', u'vcpuCount': u'1', u'hash': u'-8771708058453342833', u'cpuUsage': u'8600000000', u'vcpuPeriod': 100000

Logs attached

Comment 7 Shir Fishbain 2019-05-06 13:38:19 UTC
Created attachment 1564473 [details]
New_Logs

Comment 8 Michal Skrivanek 2019-09-05 09:49:52 UTC
not related to RHEL 8 at all, just to qemu-ga without ovirt-ga which can happen on any guest.

Comment 9 Shir Fishbain 2019-09-10 13:20:12 UTC
Created attachment 1613592 [details]
New_Logs_4.3.6.5V

Comment 11 Tomáš Golembiovský 2020-02-11 13:00:57 UTC
Work is being done on fixing the issue

Comment 12 Sandro Bonazzola 2020-04-08 15:00:34 UTC
This bug is on POST for 4.4.1 but referenced patches are included in vdsm v4.40.13. Can we move this to modified for 4.4.0?

Comment 13 Petr Matyáš 2020-05-12 16:51:24 UTC
Verified on vdsm-4.40.14-1.el8ev.x86_64

Comment 14 Sandro Bonazzola 2020-05-20 20:03:03 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.