Bug 1299513

Summary: Add warning when creating a snapshot on a VM with no guest tools installed
Product: [oVirt] ovirt-engine Reporter: Tal Nisan <tnisan>
Component: BLL.StorageAssignee: Tal Nisan <tnisan>
Status: CLOSED CURRENTRELEASE QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.2CC: acanan, amureini, bugs, derez, eblake, gklein, michal.skrivanek, ngavrilo, ogofen, sbonazzo, tnisan, vfeenstr, ylavi
Target Milestone: ovirt-3.6.3Flags: rule-engine: ovirt-3.6.z+
rule-engine: exception+
rule-engine: planning_ack+
tnisan: devel_ack+
acanan: testing_ack+
Target Release: 3.6.3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1297919
: 1300566 (view as bug list) Environment:
Last Closed: 2016-02-23 13:31:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1297919    
Bug Blocks: 1135132    

Description Tal Nisan 2016-01-18 14:44:53 UTC
+++ This bug was initially created as a clone of Bug #1297919 +++

Description of problem:

Preview snapshot and undo causes filesystem corruption to an cinder attached disk.

Version-Release number of selected component:

rhevm-3.6.2-0.1.el6.noarch
vdsm-4.17.15-0.el7ev.noarch

How reproducible:
50%.

Steps to Reproduce:

On a VM that has an OS and another attached disk (cinder) perform the following

1. Create a file on the attached disk, create a snapshot (perform this step 3 times) 
2. Create another file on the attached disk.
3. Preview first snapshot that was created, then "Undo"
4. Run VM, view the 4 files that were created in steps 1 and 2.

Actual results:

console:

[root@RHEL7Server ~]# cd /mnt/vdb/
[root@RHEL7Server vdb]# ll
ls: cannot access 4.txt: Input/output error
ls: cannot access 3.txt: Input/output error
ls: cannot access 2.txt: Input/output error
total 20
-rw-r--r--. 1 root root    40 Jan 12 18:08 1.txt
-?????????? ? ?    ?        ?            ? 2.txt
-?????????? ? ?    ?        ?            ? 3.txt
-?????????? ? ?    ?        ?            ? 4.txt
drwx------. 2 root root 16384 Jan 12 18:07 lost+found
[root@RHEL7Server vdb]# 

/var/log/messages:

Jan 12 19:10:25 RHEL7Server kernel: EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16
Jan 12 19:12:44 RHEL7Server systemd-logind: Removed session 3.
Jan 12 19:14:26 RHEL7Server systemd: Starting Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd: Started Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd-logind: New session 4 of user root.
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 15
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 14
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 16
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16


libvirtd.log:

2016-01-12 16:51:56.477+0000: 32637: info : virDBusCall:1554 : DBUS_METHOD_ERROR: 'org.freedesktop.machine1.Manager.TerminateMachine' on '/org/freedesktop/machine1' at 'org.freedesktop.machine1' error org.freedesktop.machine1.NoSuchMachin
e: No machine 'qemu-vm_cinder_1' known

Additional info:

Note there is a time shift in libvirt log (-2 hours), don't know why it uses a different time zone (the machine is synced with local time)

--- Additional comment from Allon Mureinik on 2016-01-13 13:39:01 IST ---

There's no guest agent:

"""
periodic/0::WARNING::2016-01-12 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
jsonrpc.Executor/4::WARNING::2016-01-12 18:09:39,126::vm::2995::virt.vm::(freeze) vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest filesystems: Guest agent is not responding: Guest agent not available for now
"""

Without a guest agent, the snapshot isn't consistent, obviously.
I suggest closing as NOTABUG.

Daniel?

--- Additional comment from Daniel Erez on 2016-01-13 14:25:08 IST ---

(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 
> Daniel?

Indeed.
@Natalie - is it reproduced with guest agent installed and running?

--- Additional comment from Aharon Canan on 2016-01-13 14:39:14 IST ---

(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.

So in such case we should block the option if guest agent is not installed.

--- Additional comment from Allon Mureinik on 2016-01-13 15:19:01 IST ---

(In reply to Aharon Canan from comment #3)
> (In reply to Allon Mureinik from comment #1)
> > There's no guest agent:
> > 
> > """
> > periodic/0::WARNING::2016-01-12
> > 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> > not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> > [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> > jsonrpc.Executor/4::WARNING::2016-01-12
> > 18:09:39,126::vm::2995::virt.vm::(freeze)
> > vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> > filesystems: Guest agent is not responding: Guest agent not available for now
> > """
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> 
> So in such case we should block the option if guest agent is not installed.

We have discussed this multiple times, we aren't going to re-discuss it here.

--- Additional comment from Daniel Erez on 2016-01-13 15:27:23 IST ---

Re-adding the needinfo on Natalie (see https://bugzilla.redhat.com/show_bug.cgi?id=1297919#c2)

--- Additional comment from Natalie Gavrielov on 2016-01-13 18:40 IST ---

Reproduced with guest agent installed and running (same response when running "ll" and /var/log/messages shows the same type of errors.

--- Additional comment from Daniel Erez on 2016-01-13 18:53:20 IST ---

(In reply to Natalie Gavrielov from comment #6)
> Created attachment 1114466 [details]
> logs: engine.log, libvirtd.log, qemu, vdsm, cinder
> 
> Reproduced with guest agent installed and running (same response when
> running "ll" and /var/log/messages shows the same type of errors.

Some more question for further investigation:
* Did you try it on a new VM with a fresh OS?
* Which OS did you use?
* Is it reproduced with another OS?

--- Additional comment from Daniel Erez on 2016-01-13 19:28:17 IST ---

Another thing, according to the latest logs [1], the guest agent on the VM still isn't responsive.

Can you please check the agent is running by using:
'systemctl status ovirt-guest-agent.service'

BTW, more info on:
* http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
* http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora

[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

--- Additional comment from Natalie Gavrielov on 2016-01-13 19:32:27 IST ---

(In reply to Daniel Erez from comment #8)
> Another thing, according to the latest logs [1], the guest agent on the VM
> still isn't responsive.
> 
> Can you please check the agent is running by using:
> 'systemctl status ovirt-guest-agent.service'
> 
> BTW, more info on:
> * http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
> * http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora
> 
> [1] 2016-01-13 15:50:45,629 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID:
> ecc702e6-720a-482f-8840-0a06a851108e, Call Stack:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest
> agent is not responding: Guest agent not available for now, code = 19
> (Failed with error nonresp and code 19)

I checked that already.

--- Additional comment from Natalie Gavrielov on 2016-01-13 19:51:27 IST ---

In case comment #9 wasn't clear enough.
*Before* performing any of the snapshots (in the second test, comment #6)
I installed the guest agent, started it and made sure it is active.

--- Additional comment from Aharon Canan on 2016-01-14 09:22:25 IST ---

(In reply to Daniel Erez from comment #7)
 
> Some more question for further investigation:
> * Did you try it on a new VM with a fresh OS?
> * Which OS did you use?

RHEL7 as you can see in different places on the description.

> * Is it reproduced with another OS?

I am not sure why to reproduce with another OS, did you check the code and find that it can be relevant to this OS only?

Did you try to talk to Natalie instead of this pingpong? 
Did you try to recreate it as well?


(In reply to Allon Mureinik from comment #4)
 
> We have discussed this multiple times, we aren't going to re-discuss it here.

We never talked about this, maybe in general which I am not remember as well.
Anyway, if engine cant see the guest agent, it should block all operations that guest agent needed for.


(In reply to Allon Mureinik from comment #1)
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 

This is a bug! This flow can cause data corruption, we need to prevent the user from hitting it by fixing it or blocking it.
You can mark it as Can't fix if you can't fix or won't fix if you don't want to, but it is a bug for sure, at least we need to add release note if we are not going to do anything about it.

--- Additional comment from Allon Mureinik on 2016-01-14 11:02:26 IST ---

(In reply to Aharon Canan from comment #11)

> (In reply to Allon Mureinik from comment #1)
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> > I suggest closing as NOTABUG.
> > 
> 
> This is a bug! This flow can cause data corruption, we need to prevent the
> user from hitting it by fixing it or blocking it.
> You can mark it as Can't fix if you can't fix or won't fix if you don't want
> to, but it is a bug for sure, at least we need to add release note if we are
> not going to do anything about it.

We don't nany the user - that's been the definition from day one.

It's completely valid to have a snapshot without a GE installed, if the customer decides to do so.
We can add another warning, maybe (based on engine reporting GE's avialability - but it still won't be bullet proof), but it I struggle to see how that's a productive use of anyone's time, including the QE that will have to verify it later.

--- Additional comment from Aharon Canan on 2016-01-14 11:18:27 IST ---

We have data corruption here, do we agree ? 
Costumers shouldn't hit data corruption, do we agree ? 
I totally agree we shouldn't block snapshots if no GE installed but as proved, it has nothing to do with guest agent installed or not.

Taking snapshot shouldn't cause data corruption for sure.

Do not know about the day one definition, but it should be fixed somehow.

--- Additional comment from Ori Gofen on 2016-01-14 13:28:26 IST ---

* clarification *
- This bug causes a kernel panic on a second attempt to launch.
- This bug does not reproduce on a "regular" oVirt storage.

--- Additional comment from Daniel Erez on 2016-01-17 13:22:45 IST ---

So the issue is a failure in fsFreeze since the guest agent is reported as non-responsive [1]. However, the ovirt-guest-agent service is actually installed and running.

Not sure whether the issue lies in libvirt or the guest agent.


@Eric - could it be related to communication between libvirt and the agent. I've found the following error(?) in the logs that might be related:

2016-01-11 13:03:18.213+0000: 30086: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x7f0fb4012590 buf={"return": [{"frontend-open": false, "filename": "spicevmc", "label": "charchannel2"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.org.qemu.guest_agent.0,server", "label": "charchannel1"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.com.redhat.rhevm.vdsm,server", "label": "charchannel0"}, {"frontend-open": true, "filename": "unix:/var/lib/libvirt/qemu/domain-vm_nfs_6/monitor.sock,server", "label": "charmonitor"}], "id": "libvirt-3"}
 len=598


@Vinzenz - what could cause the mentioned error? Perhaps the ovirt-guest-agent requires a machine reboot after installation? As I've been able to reproduce the issue right after installing the service (i.e. without restarting the VM).



[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

Comment 1 Red Hat Bugzilla Rules Engine 2016-01-18 14:44:54 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 2 Daniel Erez 2016-01-19 17:23:41 UTC
I think it sufficient (and makes better sense) to add a warning when creating a live snapshot as the root cause is a failure in freeze/thaw process.

Note that determining whether the agents (qemu-guest-agent and ovirt-guest-agent/rhevm-guest-agent) are installed is not adequate. We should probably also verify that both guests are running (not sure if the information is already available in vdsm).

@Vinzenz - does vdsm exposes information regarding the guest agents statuses? I.e. can we determine whether qemu-guest-agent and ovirt-guest-agent are running?

Comment 3 Vinzenz Feenstra [evilissimo] 2016-01-20 07:11:21 UTC
No, not that I know of - There's an unofficial way to deduce that information for our guest agent, however that's relying on something that wasn't made for it.

So in short I simply say no it's not exposed.

The qemu-guest-agent can't even be checked right now at all from within VDSM - We'd have to probably ask libvirt for it, and I am not sure if there's such a thing in it.

Comment 4 Natalie Gavrielov 2016-02-08 12:30:23 UTC
I've tested the following:

1. When ovirt-guest-agent and qemu-guest-agent are NOT installed - passed, there is a warning.
2. With both installed - passed (no warning).
3. Only qemu-guest-agent installed - No warning, AFAIU there should be a warning, "guest tools" include both ovirt-guest-agent and qemu-guest-agent?
4. Is there RFE or a bug about checking if these agents are running or not?

Comment 5 Tal Nisan 2016-02-09 15:22:14 UTC
Not sure actually, we basically check vm.getHasAgent() and trust this logic, I'm not sure which one he checks.
Michal, can you help here?

Comment 6 Daniel Erez 2016-02-09 16:06:29 UTC
(In reply to Natalie Gavrielov from comment #4)
> I've tested the following:
> 
> 1. When ovirt-guest-agent and qemu-guest-agent are NOT installed - passed,
> there is a warning.
> 2. With both installed - passed (no warning).
> 3. Only qemu-guest-agent installed - No warning, AFAIU there should be a
> warning, "guest tools" include both ovirt-guest-agent and qemu-guest-agent?
> 4. Is there RFE or a bug about checking if these agents are running or not?

Yes, I opened RFE 1300566 to cover that ([RFE] expose the status of qemu-guest-agent and ovirt-guest-agent).

Comment 7 Michal Skrivanek 2016-02-09 16:07:31 UTC
regardless that RFE, you should be checking whether it's running, not just installed. Don't we have another (albeit ugly) function for that?

Comment 8 Vinzenz Feenstra [evilissimo] 2016-02-10 06:53:42 UTC
(In reply to Michal Skrivanek from comment #7)
> regardless that RFE, you should be checking whether it's running, not just
> installed. Don't we have another (albeit ugly) function for that?

Well I could make it a dependency of the ovirt-guest-agent service when the ovirt-guest-agent is running, that one should also be running then.

Comment 9 Natalie Gavrielov 2016-02-10 09:47:35 UTC
Tal,

I haven't received a clear answer for this question..
(In reply to Natalie Gavrielov from comment #4)
> 3. Only qemu-guest-agent installed - No warning, AFAIU there should be a
> warning, "guest tools" include both ovirt-guest-agent and qemu-guest-agent?

So I'm reopening this..

I totally agree with what Michal wrote:

(In reply to Michal Skrivanek from comment #7)
> regardless that RFE, you should be checking whether it's running, not just
> installed.

IMO, the ultimate solution is to check whether these components are installed and running (this issue together with RFE 1300566).
But to check that they're running will be sufficient ('cose then, you can just warn the user that you can't find these processes, suggest that they're not running/installed..)
Checking that qemu-guest-agent installed doesn't really solve the problem.

Comment 10 Red Hat Bugzilla Rules Engine 2016-02-10 09:47:37 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 11 Tal Nisan 2016-02-14 15:41:00 UTC
We've given here the best solution we could to 3.6.3 given the information we have from the guest, it's hacky by nature but there's no better indication currently thus the limitations.
When we have more info as requested in bug 1300566 we can give a better solution, for now this bug is meant to cover what we can currently.