Bug 1680398

Summary: QEMU-GA capabilities are reported as none for a few minutes after VM start. Therefore, VM details are not reported by engine
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: CoreAssignee: Tomáš Golembiovský <tgolembi>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.30.8CC: aefrat, bugs, dholler, fgarciad, lleistne, mburman, michal.skrivanek, pagranat, rbarry, sfishbai, tgolembi
Target Milestone: ovirt-4.4.0Keywords: Automation, AutomationBlocker
Target Release: 4.40.14Flags: rbarry: ovirt-4.3?
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.14 Doc Type: Enhancement
Doc Text:
Mechanism for polling QEMU-GA in VDSM has been enhanced to query newly started VMs more often in order to get the stats as soon as the agent becomes available in guest.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 20:03:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1661283, 1740058    
Attachments:
Description Flags
engine, hypervisor and logs from the guest
none
first_run.tar
none
second_run.tar
none
New_Logs
none
New_Logs_4.3.6.5V none

Description Elad 2019-02-24 16:36:49 UTC
Created attachment 1538179 [details]
engine, hypervisor and logs from the guest

Description of problem:
After start VM for a RHEL8 guest with qemu GA installed, the VM details, like IP and FQDN, are not reported by engine. In vdsm.log I observed that QEMU-GA capabilities for the VM are commands=[].

Version-Release number of selected component (if applicable):
Hypervisor:
vdsm-4.30.9-1.el7ev.x86_64
qemu-kvm-rhev-2.12.0-21.el7.x86_64
libvirt-4.5.0-10.el7_6.4.x86_64
sanlock-3.6.0-1.el7.x86_64
selinux-policy-3.13.1-229.el7_6.9.noarch
kernel - 3.10.0-957.10.1.el7.x86_64

Guest:
RHEL8
kernel - 4.18.0-67.el8.x86_64
qemu-guest-agent-2.12.0-61.module+el8+2786+5afd5ae3.x86_64
selinux-policy-3.14.1-59.el8.noarch

Engine:
ovirt-engine-4.3.1.1-0.1.el7.noarch

How reproducible:
Sometimes. This causes many failures in RHV QE automation, as we rely on the VM IP for SSH connection.

Steps to Reproduce:
1. Create a VM with RHEL8 installed + qemu GA running in the guest
2. Start the VM 


Actual results:
No QEMU-GA capabilities after VM start

2019-02-24 17:38:50,834+0200 INFO  (qgapoller/1) [vds] New QEMU-GA capabilities for vm_id=568dbc4a-e8ca-43a9-bd48-0cad263ea767, qemu-ga=None, commands=[] (qemuguestagent:171)


In the guest, I observed the following audit messages (which resemble to the ones reported in bug 1663092):


[  632.789546] audit: type=1400 audit(1551023330.829:3): avc:  denied  { read } for  pid=653 comm="qemu-ga" name="b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1
[  632.886759] audit: type=1400 audit(1551023330.829:4): avc:  denied  { open } for  pid=653 comm="qemu-ga" path="/run/udev/data/b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1
[  632.994361] audit: type=1400 audit(1551023330.830:5): avc:  denied  { getattr } for  pid=653 comm="qemu-ga" path="/run/udev/data/b252:3" dev="tmpfs" ino=19466 scontext=system_u:system_r:virt_qemu_ga_t:s0 tcontext=system_u:object_r:udev_var_run_t:s0 tclass=file permissive=1


Also, in libvirtd.log, saw these messages:

2019-02-24 15:38:50.833+0000: 3047: error : qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU guest agent is not connected


Only after ~5 minutes, the capabilities are available:

2019-02-24 17:43:50,866+0200 INFO  (qgapoller/0) [vds] New QEMU-GA capabilities for vm_id=568dbc4a-e8ca-43a9-bd48-0cad263ea767, qemu-ga=2.12.0, commands=set([u'guest-get-timezone', u'guest-suspend-hybrid', u'gue
st-get-fsinfo', u'guest-set-time', u'guest-sync', u'guest-get-memory-block-info', u'guest-fsfreeze-freeze', u'guest-network-get-interfaces', u'guest-fstrim', u'guest-get-host-name', u'guest-shutdown', u'guest-ge
t-vcpus', u'guest-sync-delimited', u'guest-set-user-password', u'guest-set-memory-blocks', u'guest-fsfreeze-status', u'guest-ping', u'guest-fsfreeze-freeze-list', u'guest-info', u'guest-get-memory-blocks', u'gue
st-set-vcpus', u'guest-fsfreeze-thaw', u'guest-suspend-disk', u'guest-get-osinfo', u'guest-get-time', u'guest-suspend-ram', u'guest-get-users']) (qemuguestagent:171)



Expected results:
QEMU-GA capabilities should be reported immediately after the qemu GA agent is up and running.


Additional info:
engine, hypervisor and logs from the guest are attached


Guest:
[root@localhost audit]# getenforce 
Permissive


[root@localhost audit]# systemctl status qemu-guest-agent.service 
● qemu-guest-agent.service - QEMU Guest Agent
   Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-02-24 15:39:15 EST; 4h 3min left
 Main PID: 653 (qemu-ga)
    Tasks: 1 (limit: 4648)
   Memory: 2.6M
   CGroup: /system.slice/qemu-guest-agent.service
           └─653 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file->

Feb 24 15:39:15 localhost.localdomain systemd[1]: Started QEMU Guest Agent.

Comment 1 Ryan Barry 2019-02-24 17:02:59 UTC
Tomas, any ideas?

Comment 2 Michal Skrivanek 2019-02-25 07:47:12 UTC
capabilities check interval is 5 minutes. Logs above seem to match that. We should consider some simple back off algorithm for VM start

Comment 3 Polina 2019-04-04 07:49:43 UTC
In the last builds with rhel8.0 guest, I see a problem that could be related to the current BZ, though I'm not sure.
I describe it here for your consideration if it could be the same problem or separate bug.

The problem: Start the rhel8 guest VM on iscsi disk . block the storage on the host. wait for the paused state. The  rhel8 VM doesn't turn to paused state. With thel7.6 VM it works perfect - after less than a minute after the storage blocked it turns to the Paused state.

I attach logs from two tests - first_run.tar.gz and  second_run.tar.gz.
In the test1 I run two iscsi VMs rhel7.6 and rhel8 the first time after creating, wait for IP. block the iscsi storage on the host => as result test_rhel7.6 turns to Paused and the test_rhel8 is not. 

The test2 is just the same . The difference if that it is not the first run , so , I get IP sooner , though the result it the same: the test_rhel8 never turns to the paused state. I wait until the host turns to non-operational at about 2019-04-04 10:14:51 which causes test_rhel8 to migrate.

This behavior causes automation tests failures. happens a lot though not 100% cases. sometimes it behaves as expected.


### test1 (the first run and getting IP after vm created) starts at:
2019-04-04 09:41:38,079+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel8 started on Host host_mixed_1
2019-04-04 09:41:53,117+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel7.6 started on Host host_mixed_1

2019-04-04 09:48:08,806+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VM_PAUSED_EIO(145), VM test_rhel7.6 has been paused due to storage I/O problem.

### test 2 (the second run) starts at:

2019-04-04 10:05:55,792+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel7.6 started on Host host_mixed_1

2019-04-04 10:06:10,905+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: USER_RUN_VM(32), VM test_rhel8 started on Host host_mixed_1


 about Thu Apr  4 10:14:51 IDT 2019 host_mixed_1 turns to non-operational state which causes test_rhel8 to migrate

Comment 4 Polina 2019-04-04 07:50:32 UTC
Created attachment 1551703 [details]
first_run.tar

Comment 5 Polina 2019-04-04 07:51:11 UTC
Created attachment 1551704 [details]
second_run.tar

Comment 6 Shir Fishbain 2019-05-06 13:37:12 UTC
This problem appears also in :
ovirt-engine-4.3.3.6, with rhel8.0 guest

2019-05-04 03:29:36,156 - MainThread - RemoteExecutor - DEBUG - [root.16.38/qum5net]   OUT: [{u'displayInfo': [{u'tlsPort': u'5901', u'ipAddress': u'10.46.16.38', u'port': u'5900', u'type': u'spice'}], u'memUsage': u'63', u'acpiEnable': u'true', u'guestFQDN': u'', u'vmId': u'43647a6f-bf30-478d-b2f6-9e40db8bda0c', u'session': u'Unknown', u'timeOffset': u'0', u'memoryStats': {u'swap_out': 0, u'majflt': 0, u'minflt': 0, u'mem_free': u'378480', u'swap_in': 0, u'pageflt': 0, u'mem_total': u'772136', u'mem_unused': u'378480'}, u'balloonInfo': {u'balloon_max': u'1048576', u'balloon_cur': u'1048576', u'balloon_target': u'1048576', u'balloon_min': u'1048576'}, u'pauseCode': u'NOERR', u'network': {u'vnet0': {u'macAddr': u'00:1a:4a:16:88:ac', u'rxDropped': u'0', u'tx': u'4149', u'txDropped': u'0', u'rxErrors': u'0', u'rx': u'254188', u'txErrors': u'0', u'state': u'unknown', u'sampleTime': 4319342.34, u'speed': u'1000', u'name': u'vnet0'}}, u'vmType': u'kvm', u'cpuUser': u'0.05', u'elapsedTime': u'311', u'vmJobs': {}, u'cpuSys': u'0.40', u'appsList': [u'qemu-guest-agent-2.12.0'], u'vmName': u'copy_disk_vm_glusterfs', u'vcpuCount': u'1', u'hash': u'-8771708058453342833', u'cpuUsage': u'8600000000', u'vcpuPeriod': 100000

Logs attached

Comment 7 Shir Fishbain 2019-05-06 13:38:19 UTC
Created attachment 1564473 [details]
New_Logs

Comment 8 Michal Skrivanek 2019-09-05 09:49:52 UTC
not related to RHEL 8 at all, just to qemu-ga without ovirt-ga which can happen on any guest.

Comment 9 Shir Fishbain 2019-09-10 13:20:12 UTC
Created attachment 1613592 [details]
New_Logs_4.3.6.5V

Comment 11 Tomáš Golembiovský 2020-02-11 13:00:57 UTC
Work is being done on fixing the issue

Comment 12 Sandro Bonazzola 2020-04-08 15:00:34 UTC
This bug is on POST for 4.4.1 but referenced patches are included in vdsm v4.40.13. Can we move this to modified for 4.4.0?

Comment 13 Petr Matyáš 2020-05-12 16:51:24 UTC
Verified on vdsm-4.40.14-1.el8ev.x86_64

Comment 14 Sandro Bonazzola 2020-05-20 20:03:03 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.