Bug 1300566

Summary: [RFE] expose the status of qemu-guest-agent and ovirt-guest-agent
Product: [oVirt] vdsm Reporter: Daniel Erez <derez>
Component: RFEsAssignee: Dan Kenigsberg <danken>
Status: CLOSED CURRENTRELEASE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: low    
Version: 4.17.19CC: acanan, bugs, derez, eblake, michal.skrivanek, mtessun, ngavrilo, tnisan, v.astafiev, vfeenstr
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 1299513 Environment:
Last Closed: 2018-07-06 14:28:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1401183    

Description Daniel Erez 2016-01-21 07:32:49 UTC
+++ This bug was initially created as a clone of Bug #1299513 +++

+++ This bug was initially created as a clone of Bug #1297919 +++

Description of problem:

Preview snapshot and undo causes filesystem corruption to an cinder attached disk.

Version-Release number of selected component:

rhevm-3.6.2-0.1.el6.noarch
vdsm-4.17.15-0.el7ev.noarch

How reproducible:
50%.

Steps to Reproduce:

On a VM that has an OS and another attached disk (cinder) perform the following

1. Create a file on the attached disk, create a snapshot (perform this step 3 times) 
2. Create another file on the attached disk.
3. Preview first snapshot that was created, then "Undo"
4. Run VM, view the 4 files that were created in steps 1 and 2.

Actual results:

console:

[root@RHEL7Server ~]# cd /mnt/vdb/
[root@RHEL7Server vdb]# ll
ls: cannot access 4.txt: Input/output error
ls: cannot access 3.txt: Input/output error
ls: cannot access 2.txt: Input/output error
total 20
-rw-r--r--. 1 root root    40 Jan 12 18:08 1.txt
-?????????? ? ?    ?        ?            ? 2.txt
-?????????? ? ?    ?        ?            ? 3.txt
-?????????? ? ?    ?        ?            ? 4.txt
drwx------. 2 root root 16384 Jan 12 18:07 lost+found
[root@RHEL7Server vdb]# 

/var/log/messages:

Jan 12 19:10:25 RHEL7Server kernel: EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16
Jan 12 19:12:44 RHEL7Server systemd-logind: Removed session 3.
Jan 12 19:14:26 RHEL7Server systemd: Starting Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd: Started Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd-logind: New session 4 of user root.
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 15
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 14
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 16
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16


libvirtd.log:

2016-01-12 16:51:56.477+0000: 32637: info : virDBusCall:1554 : DBUS_METHOD_ERROR: 'org.freedesktop.machine1.Manager.TerminateMachine' on '/org/freedesktop/machine1' at 'org.freedesktop.machine1' error org.freedesktop.machine1.NoSuchMachin
e: No machine 'qemu-vm_cinder_1' known

Additional info:

Note there is a time shift in libvirt log (-2 hours), don't know why it uses a different time zone (the machine is synced with local time)

--- Additional comment from Allon Mureinik on 2016-01-13 13:39:01 IST ---

There's no guest agent:

"""
periodic/0::WARNING::2016-01-12 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
jsonrpc.Executor/4::WARNING::2016-01-12 18:09:39,126::vm::2995::virt.vm::(freeze) vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest filesystems: Guest agent is not responding: Guest agent not available for now
"""

Without a guest agent, the snapshot isn't consistent, obviously.
I suggest closing as NOTABUG.

Daniel?

--- Additional comment from Daniel Erez on 2016-01-13 14:25:08 IST ---

(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 
> Daniel?

Indeed.
@Natalie - is it reproduced with guest agent installed and running?

--- Additional comment from Aharon Canan on 2016-01-13 14:39:14 IST ---

(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.

So in such case we should block the option if guest agent is not installed.

--- Additional comment from Allon Mureinik on 2016-01-13 15:19:01 IST ---

(In reply to Aharon Canan from comment #3)
> (In reply to Allon Mureinik from comment #1)
> > There's no guest agent:
> > 
> > """
> > periodic/0::WARNING::2016-01-12
> > 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> > not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> > [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> > jsonrpc.Executor/4::WARNING::2016-01-12
> > 18:09:39,126::vm::2995::virt.vm::(freeze)
> > vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> > filesystems: Guest agent is not responding: Guest agent not available for now
> > """
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> 
> So in such case we should block the option if guest agent is not installed.

We have discussed this multiple times, we aren't going to re-discuss it here.

--- Additional comment from Daniel Erez on 2016-01-13 15:27:23 IST ---

Re-adding the needinfo on Natalie (see https://bugzilla.redhat.com/show_bug.cgi?id=1297919#c2)

--- Additional comment from Natalie Gavrielov on 2016-01-13 18:40 IST ---

Reproduced with guest agent installed and running (same response when running "ll" and /var/log/messages shows the same type of errors.

--- Additional comment from Daniel Erez on 2016-01-13 18:53:20 IST ---

(In reply to Natalie Gavrielov from comment #6)
> Created attachment 1114466 [details]
> logs: engine.log, libvirtd.log, qemu, vdsm, cinder
> 
> Reproduced with guest agent installed and running (same response when
> running "ll" and /var/log/messages shows the same type of errors.

Some more question for further investigation:
* Did you try it on a new VM with a fresh OS?
* Which OS did you use?
* Is it reproduced with another OS?

--- Additional comment from Daniel Erez on 2016-01-13 19:28:17 IST ---

Another thing, according to the latest logs [1], the guest agent on the VM still isn't responsive.

Can you please check the agent is running by using:
'systemctl status ovirt-guest-agent.service'

BTW, more info on:
* http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
* http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora

[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

--- Additional comment from Natalie Gavrielov on 2016-01-13 19:32:27 IST ---

(In reply to Daniel Erez from comment #8)
> Another thing, according to the latest logs [1], the guest agent on the VM
> still isn't responsive.
> 
> Can you please check the agent is running by using:
> 'systemctl status ovirt-guest-agent.service'
> 
> BTW, more info on:
> * http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
> * http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora
> 
> [1] 2016-01-13 15:50:45,629 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID:
> ecc702e6-720a-482f-8840-0a06a851108e, Call Stack:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest
> agent is not responding: Guest agent not available for now, code = 19
> (Failed with error nonresp and code 19)

I checked that already.

--- Additional comment from Natalie Gavrielov on 2016-01-13 19:51:27 IST ---

In case comment #9 wasn't clear enough.
*Before* performing any of the snapshots (in the second test, comment #6)
I installed the guest agent, started it and made sure it is active.

--- Additional comment from Aharon Canan on 2016-01-14 09:22:25 IST ---

(In reply to Daniel Erez from comment #7)
 
> Some more question for further investigation:
> * Did you try it on a new VM with a fresh OS?
> * Which OS did you use?

RHEL7 as you can see in different places on the description.

> * Is it reproduced with another OS?

I am not sure why to reproduce with another OS, did you check the code and find that it can be relevant to this OS only?

Did you try to talk to Natalie instead of this pingpong? 
Did you try to recreate it as well?


(In reply to Allon Mureinik from comment #4)
 
> We have discussed this multiple times, we aren't going to re-discuss it here.

We never talked about this, maybe in general which I am not remember as well.
Anyway, if engine cant see the guest agent, it should block all operations that guest agent needed for.


(In reply to Allon Mureinik from comment #1)
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 

This is a bug! This flow can cause data corruption, we need to prevent the user from hitting it by fixing it or blocking it.
You can mark it as Can't fix if you can't fix or won't fix if you don't want to, but it is a bug for sure, at least we need to add release note if we are not going to do anything about it.

--- Additional comment from Allon Mureinik on 2016-01-14 11:02:26 IST ---

(In reply to Aharon Canan from comment #11)

> (In reply to Allon Mureinik from comment #1)
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> > I suggest closing as NOTABUG.
> > 
> 
> This is a bug! This flow can cause data corruption, we need to prevent the
> user from hitting it by fixing it or blocking it.
> You can mark it as Can't fix if you can't fix or won't fix if you don't want
> to, but it is a bug for sure, at least we need to add release note if we are
> not going to do anything about it.

We don't nany the user - that's been the definition from day one.

It's completely valid to have a snapshot without a GE installed, if the customer decides to do so.
We can add another warning, maybe (based on engine reporting GE's avialability - but it still won't be bullet proof), but it I struggle to see how that's a productive use of anyone's time, including the QE that will have to verify it later.

--- Additional comment from Aharon Canan on 2016-01-14 11:18:27 IST ---

We have data corruption here, do we agree ? 
Costumers shouldn't hit data corruption, do we agree ? 
I totally agree we shouldn't block snapshots if no GE installed but as proved, it has nothing to do with guest agent installed or not.

Taking snapshot shouldn't cause data corruption for sure.

Do not know about the day one definition, but it should be fixed somehow.

--- Additional comment from Ori Gofen on 2016-01-14 13:28:26 IST ---

* clarification *
- This bug causes a kernel panic on a second attempt to launch.
- This bug does not reproduce on a "regular" oVirt storage.

--- Additional comment from Daniel Erez on 2016-01-17 13:22:45 IST ---

So the issue is a failure in fsFreeze since the guest agent is reported as non-responsive [1]. However, the ovirt-guest-agent service is actually installed and running.

Not sure whether the issue lies in libvirt or the guest agent.


@Eric - could it be related to communication between libvirt and the agent. I've found the following error(?) in the logs that might be related:

2016-01-11 13:03:18.213+0000: 30086: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x7f0fb4012590 buf={"return": [{"frontend-open": false, "filename": "spicevmc", "label": "charchannel2"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.org.qemu.guest_agent.0,server", "label": "charchannel1"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.com.redhat.rhevm.vdsm,server", "label": "charchannel0"}, {"frontend-open": true, "filename": "unix:/var/lib/libvirt/qemu/domain-vm_nfs_6/monitor.sock,server", "label": "charmonitor"}], "id": "libvirt-3"}
 len=598


@Vinzenz - what could cause the mentioned error? Perhaps the ovirt-guest-agent requires a machine reboot after installation? As I've been able to reproduce the issue right after installing the service (i.e. without restarting the VM).



[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-01-18 09:44:54 EST ---

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

--- Additional comment from Daniel Erez on 2016-01-19 12:23:41 EST ---

I think it sufficient (and makes better sense) to add a warning when creating a live snapshot as the root cause is a failure in freeze/thaw process.

Note that determining whether the agents (qemu-guest-agent and ovirt-guest-agent/rhevm-guest-agent) are installed is not adequate. We should probably also verify that both guests are running (not sure if the information is already available in vdsm).

@Vinzenz - does vdsm exposes information regarding the guest agents statuses? I.e. can we determine whether qemu-guest-agent and ovirt-guest-agent are running?

--- Additional comment from Vinzenz Feenstra [evilissimo] on 2016-01-20 02:11:21 EST ---

No, not that I know of - There's an unofficial way to deduce that information for our guest agent, however that's relying on something that wasn't made for it.

So in short I simply say no it's not exposed.

The qemu-guest-agent can't even be checked right now at all from within VDSM - We'd have to probably ask libvirt for it, and I am not sure if there's such a thing in it.

Comment 1 Martin Tessun 2018-07-06 14:28:25 UTC
Current RHV tells you if the GA is up and running (and up2date) in a VM.