Bug 1297919 - Preview snapshot and undo causes filesystem corruption to an attached disk
Summary: Preview snapshot and undo causes filesystem corruption to an attached disk
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-3.6.3
: 3.6.3
Assignee: Daniel Erez
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: 1135132 1299513
TreeView+ depends on / blocked
 
Reported: 2016-01-12 18:47 UTC by Natalie Gavrielov
Modified: 2016-03-10 07:28 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
: 1299513 (view as bug list)
Environment:
Last Closed: 2016-01-19 17:13:39 UTC
oVirt Team: Storage
Embargoed:
tnisan: ovirt-3.6.z?
gklein: blocker?
ngavrilo: planning_ack?
ngavrilo: devel_ack?
ngavrilo: testing_ack?


Attachments (Terms of Use)
logs: engine.log, libvirtd.log, qemu, vdsm, cinder (14.14 MB, application/x-gzip)
2016-01-12 18:47 UTC, Natalie Gavrielov
no flags Details
logs: engine.log, libvirtd.log, qemu, vdsm, cinder (12.03 MB, application/x-gzip)
2016-01-13 16:40 UTC, Natalie Gavrielov
no flags Details

Description Natalie Gavrielov 2016-01-12 18:47:51 UTC
Created attachment 1114088 [details]
logs: engine.log, libvirtd.log, qemu, vdsm, cinder

Description of problem:

Preview snapshot and undo causes filesystem corruption to an cinder attached disk.

Version-Release number of selected component:

rhevm-3.6.2-0.1.el6.noarch
vdsm-4.17.15-0.el7ev.noarch

How reproducible:
50%.

Steps to Reproduce:

On a VM that has an OS and another attached disk (cinder) perform the following

1. Create a file on the attached disk, create a snapshot (perform this step 3 times) 
2. Create another file on the attached disk.
3. Preview first snapshot that was created, then "Undo"
4. Run VM, view the 4 files that were created in steps 1 and 2.

Actual results:

console:

[root@RHEL7Server ~]# cd /mnt/vdb/
[root@RHEL7Server vdb]# ll
ls: cannot access 4.txt: Input/output error
ls: cannot access 3.txt: Input/output error
ls: cannot access 2.txt: Input/output error
total 20
-rw-r--r--. 1 root root    40 Jan 12 18:08 1.txt
-?????????? ? ?    ?        ?            ? 2.txt
-?????????? ? ?    ?        ?            ? 3.txt
-?????????? ? ?    ?        ?            ? 4.txt
drwx------. 2 root root 16384 Jan 12 18:07 lost+found
[root@RHEL7Server vdb]# 

/var/log/messages:

Jan 12 19:10:25 RHEL7Server kernel: EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:10:34 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16
Jan 12 19:12:44 RHEL7Server systemd-logind: Removed session 3.
Jan 12 19:14:26 RHEL7Server systemd: Starting Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd: Started Session 4 of user root.
Jan 12 19:14:26 RHEL7Server systemd-logind: New session 4 of user root.
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 15
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 14
Jan 12 19:14:32 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm bash: deleted inode referenced: 16
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 15
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 14
Jan 12 19:14:35 RHEL7Server kernel: EXT4-fs error (device vdb): ext4_lookup:1437: inode #2: comm ls: deleted inode referenced: 16


libvirtd.log:

2016-01-12 16:51:56.477+0000: 32637: info : virDBusCall:1554 : DBUS_METHOD_ERROR: 'org.freedesktop.machine1.Manager.TerminateMachine' on '/org/freedesktop/machine1' at 'org.freedesktop.machine1' error org.freedesktop.machine1.NoSuchMachin
e: No machine 'qemu-vm_cinder_1' known

Additional info:

Note there is a time shift in libvirt log (-2 hours), don't know why it uses a different time zone (the machine is synced with local time)

Comment 1 Allon Mureinik 2016-01-13 11:39:01 UTC
There's no guest agent:

"""
periodic/0::WARNING::2016-01-12 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
jsonrpc.Executor/4::WARNING::2016-01-12 18:09:39,126::vm::2995::virt.vm::(freeze) vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest filesystems: Guest agent is not responding: Guest agent not available for now
"""

Without a guest agent, the snapshot isn't consistent, obviously.
I suggest closing as NOTABUG.

Daniel?

Comment 2 Daniel Erez 2016-01-13 12:25:08 UTC
(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 
> Daniel?

Indeed.
@Natalie - is it reproduced with guest agent installed and running?

Comment 3 Aharon Canan 2016-01-13 12:39:14 UTC
(In reply to Allon Mureinik from comment #1)
> There's no guest agent:
> 
> """
> periodic/0::WARNING::2016-01-12
> 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> jsonrpc.Executor/4::WARNING::2016-01-12
> 18:09:39,126::vm::2995::virt.vm::(freeze)
> vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> filesystems: Guest agent is not responding: Guest agent not available for now
> """
> 
> Without a guest agent, the snapshot isn't consistent, obviously.

So in such case we should block the option if guest agent is not installed.

Comment 4 Allon Mureinik 2016-01-13 13:19:01 UTC
(In reply to Aharon Canan from comment #3)
> (In reply to Allon Mureinik from comment #1)
> > There's no guest agent:
> > 
> > """
> > periodic/0::WARNING::2016-01-12
> > 18:09:38,162::periodic::258::virt.periodic.VmDispatcher::(__call__) could
> > not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> > [u'f1eb086b-7b28-47a4-bf68-075ee2f17442']
> > jsonrpc.Executor/4::WARNING::2016-01-12
> > 18:09:39,126::vm::2995::virt.vm::(freeze)
> > vmId=`f1eb086b-7b28-47a4-bf68-075ee2f17442`::Unable to freeze guest
> > filesystems: Guest agent is not responding: Guest agent not available for now
> > """
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> 
> So in such case we should block the option if guest agent is not installed.

We have discussed this multiple times, we aren't going to re-discuss it here.

Comment 5 Daniel Erez 2016-01-13 13:27:23 UTC
Re-adding the needinfo on Natalie (see https://bugzilla.redhat.com/show_bug.cgi?id=1297919#c2)

Comment 6 Natalie Gavrielov 2016-01-13 16:40:40 UTC
Created attachment 1114466 [details]
logs: engine.log, libvirtd.log, qemu, vdsm, cinder

Reproduced with guest agent installed and running (same response when running "ll" and /var/log/messages shows the same type of errors.

Comment 7 Daniel Erez 2016-01-13 16:53:20 UTC
(In reply to Natalie Gavrielov from comment #6)
> Created attachment 1114466 [details]
> logs: engine.log, libvirtd.log, qemu, vdsm, cinder
> 
> Reproduced with guest agent installed and running (same response when
> running "ll" and /var/log/messages shows the same type of errors.

Some more question for further investigation:
* Did you try it on a new VM with a fresh OS?
* Which OS did you use?
* Is it reproduced with another OS?

Comment 8 Daniel Erez 2016-01-13 17:28:17 UTC
Another thing, according to the latest logs [1], the guest agent on the VM still isn't responsive.

Can you please check the agent is running by using:
'systemctl status ovirt-guest-agent.service'

BTW, more info on:
* http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
* http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora

[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

Comment 9 Natalie Gavrielov 2016-01-13 17:32:27 UTC
(In reply to Daniel Erez from comment #8)
> Another thing, according to the latest logs [1], the guest agent on the VM
> still isn't responsive.
> 
> Can you please check the agent is running by using:
> 'systemctl status ovirt-guest-agent.service'
> 
> BTW, more info on:
> * http://www.ovirt.org/Understanding_Guest_Agents_and_Other_Tools
> * http://www.ovirt.org/How_to_install_the_guest_agent_in_Fedora
> 
> [1] 2016-01-13 15:50:45,629 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID:
> ecc702e6-720a-482f-8840-0a06a851108e, Call Stack:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest
> agent is not responding: Guest agent not available for now, code = 19
> (Failed with error nonresp and code 19)

I checked that already.

Comment 10 Natalie Gavrielov 2016-01-13 17:51:27 UTC
In case comment #9 wasn't clear enough.
*Before* performing any of the snapshots (in the second test, comment #6)
I installed the guest agent, started it and made sure it is active.

Comment 11 Aharon Canan 2016-01-14 07:22:25 UTC
(In reply to Daniel Erez from comment #7)
 
> Some more question for further investigation:
> * Did you try it on a new VM with a fresh OS?
> * Which OS did you use?

RHEL7 as you can see in different places on the description.

> * Is it reproduced with another OS?

I am not sure why to reproduce with another OS, did you check the code and find that it can be relevant to this OS only?

Did you try to talk to Natalie instead of this pingpong? 
Did you try to recreate it as well?


(In reply to Allon Mureinik from comment #4)
 
> We have discussed this multiple times, we aren't going to re-discuss it here.

We never talked about this, maybe in general which I am not remember as well.
Anyway, if engine cant see the guest agent, it should block all operations that guest agent needed for.


(In reply to Allon Mureinik from comment #1)
> 
> Without a guest agent, the snapshot isn't consistent, obviously.
> I suggest closing as NOTABUG.
> 

This is a bug! This flow can cause data corruption, we need to prevent the user from hitting it by fixing it or blocking it.
You can mark it as Can't fix if you can't fix or won't fix if you don't want to, but it is a bug for sure, at least we need to add release note if we are not going to do anything about it.

Comment 12 Allon Mureinik 2016-01-14 09:02:26 UTC
(In reply to Aharon Canan from comment #11)

> (In reply to Allon Mureinik from comment #1)
> > 
> > Without a guest agent, the snapshot isn't consistent, obviously.
> > I suggest closing as NOTABUG.
> > 
> 
> This is a bug! This flow can cause data corruption, we need to prevent the
> user from hitting it by fixing it or blocking it.
> You can mark it as Can't fix if you can't fix or won't fix if you don't want
> to, but it is a bug for sure, at least we need to add release note if we are
> not going to do anything about it.

We don't nany the user - that's been the definition from day one.

It's completely valid to have a snapshot without a GE installed, if the customer decides to do so.
We can add another warning, maybe (based on engine reporting GE's avialability - but it still won't be bullet proof), but it I struggle to see how that's a productive use of anyone's time, including the QE that will have to verify it later.

Comment 13 Aharon Canan 2016-01-14 09:18:27 UTC
We have data corruption here, do we agree ? 
Costumers shouldn't hit data corruption, do we agree ? 
I totally agree we shouldn't block snapshots if no GE installed but as proved, it has nothing to do with guest agent installed or not.

Taking snapshot shouldn't cause data corruption for sure.

Do not know about the day one definition, but it should be fixed somehow.

Comment 14 Ori Gofen 2016-01-14 11:28:26 UTC
* clarification *
- This bug causes a kernel panic on a second attempt to launch.
- This bug does not reproduce on a "regular" oVirt storage.

Comment 15 Daniel Erez 2016-01-17 11:22:45 UTC
So the issue is a failure in fsFreeze since the guest agent is reported as non-responsive [1]. However, the ovirt-guest-agent service is actually installed and running.

Not sure whether the issue lies in libvirt or the guest agent.


@Eric - could it be related to communication between libvirt and the agent. I've found the following error(?) in the logs that might be related:

2016-01-11 13:03:18.213+0000: 30086: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x7f0fb4012590 buf={"return": [{"frontend-open": false, "filename": "spicevmc", "label": "charchannel2"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.org.qemu.guest_agent.0,server", "label": "charchannel1"}, {"frontend-open": false, "filename": "disconnected:unix:/var/lib/libvirt/qemu/channels/fbc59883-7054-498b-806d-5ef8480a07dc.com.redhat.rhevm.vdsm,server", "label": "charchannel0"}, {"frontend-open": true, "filename": "unix:/var/lib/libvirt/qemu/domain-vm_nfs_6/monitor.sock,server", "label": "charmonitor"}], "id": "libvirt-3"}
 len=598


@Vinzenz - what could cause the mentioned error? Perhaps the ovirt-guest-agent requires a machine reboot after installation? As I've been able to reproduce the issue right after installing the service (i.e. without restarting the VM).



[1] 2016-01-13 15:50:45,629 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-27) [] Correlation ID: 47cfc14a, Job ID: ecc702e6-720a-482f-8840-0a06a851108e, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FreezeVDS, error = Guest agent is not responding: Guest agent not available for now, code = 19 (Failed with error nonresp and code 19)

Comment 16 Daniel Erez 2016-01-19 13:52:02 UTC
Seems that the qemu-guest-agent service is not activated after installing and activating ovirt-guest-agent. So that explains why it works properly after reboot.

@Vinzenz - is this behavior by design? I.e. should the user manually start both qemu-guest-agent and ovirt-guest-agent after installation (or reboot the machine)?

Comment 17 Vinzenz Feenstra [evilissimo] 2016-01-19 14:48:08 UTC
Yes it is - We're following with this the Fedora packaging guidelines [1]

There's one exception however - The rhevm-guest-agent downstream is started by default, but it does only package depend on the qemu guest agent and does not depend on it directly so it won't be started even though the rhevm-guest-agent is started on installation.

[1]: https://fedoraproject.org/wiki/Packaging:DefaultServices?rd=Starting_services_by_default

Comment 18 Daniel Erez 2016-01-19 17:13:17 UTC
(In reply to Vinzenz Feenstra [evilissimo] from comment #17)
> Yes it is - We're following with this the Fedora packaging guidelines [1]
> 
> There's one exception however - The rhevm-guest-agent downstream is started
> by default, but it does only package depend on the qemu guest agent and does
> not depend on it directly so it won't be started even though the
> rhevm-guest-agent is started on installation.
> 
> [1]:
> https://fedoraproject.org/wiki/Packaging:
> DefaultServices?rd=Starting_services_by_default

Thanks Vinzenz!

To sum up, both qemu-guest-agent and rhevm-guest-agent/ovirt-guest agents should be started manually in order to execute freeze/thaw APIs. Hence, closing this bug.

@Andrew - do we have a section in the documentation for installing the guest agents? If we do, it should mention that both agents should be manually started (or, reboot the machine) before using any feature relying on them.

Note that as part of bug 1299513 (RFE for adding a warning when no guest agent is installed) we should probably also detect whether the guests are running or not.

Comment 19 Andrew Dahms 2016-01-20 02:52:50 UTC
Hi Daniel,

Thank you for the needinfo request.

I have raised the following bug to track the work on this topic - BZ#1300144.

While we include instructions on starting the ovirt-guest-agent, we do not have instructions on starting the qemu-ga service, so we will add them in there.

Let us know at any time if there are any further details we should add in there as well.

Kind regards,

Andrew

Comment 20 Michal Skrivanek 2016-01-21 13:52:36 UTC
libvirt fails when snapshot is taken. current vdsm code just re-tries requesting agent bypass explicitly. There is only a log warning


Note You need to log in before you can comment on or make changes to this bug.