1231754 – Snapshots removal causes unresponsive VMs

Bug 1231754 - Snapshots removal causes unresponsive VMs

Summary: Snapshots removal causes unresponsive VMs

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	3.5.2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.3
Assignee:	Adam Litke
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1293868
TreeView+	depends on / blocked

Reported:	2015-06-15 10:57 UTC by Soeren
Modified:	2023-09-14 03:00 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Clones:	1293868 (view as bug list)
Environment:
Last Closed:	2016-02-04 09:54:12 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-3.6.z? ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack?

Attachments	(Terms of Use)
BAckup Script (5.27 KB, text/x-python) 2015-06-15 10:57 UTC, Soeren	no flags	Details
snapshot removal (2.08 KB, text/x-python) 2015-06-15 10:58 UTC, Soeren	no flags	Details
system.log (5.20 MB, text/x-vhdl) 2015-06-15 11:01 UTC, Soeren	no flags	Details
vdsm log (5.69 MB, text/plain) 2015-06-15 11:02 UTC, Soeren	no flags	Details
VM became unresponsive after snapshot removal (10.87 KB, text/plain) 2015-06-30 06:15 UTC, Rene Camu	no flags	Details
VDSM Log of snapshot removal on compute server 4 (67.99 KB, text/plain) 2015-06-30 09:55 UTC, Rene Camu	no flags	Details
View All

Description Soeren 2015-06-15 10:57:57 UTC

Created attachment 1038890 [details]
BAckup Script

Description of problem:

Removing of snapshots results in non-responsive VMs, the VMs can not be stopped or shutdown from the ovirt GUI, if the VM (qemu process) is killed the VM can not be started again, even a restart of the libvirt does not recover the VM, only a reboot of the hypervisor host recovers the VM.

The problem comes up when removing snapshots after a backup from a python script as well as when removeing the snapshots manually.


Version-Release number of selected component (if applicable):

ovirt: 3.5.2.1-1
vdsm: 4.16.14-0
gluster: 3.7.0-2

OS for compute and storage: CentOS 7.1 (all patches applied)
OS for hosted Engine: CentOS 6.6 (all patches applied)

How reproducible:

On our environment we can always reproduce this using our backup script and then the snapshot remove script (previously one script now diveded into 2 scripts), during the remove it always breaks


Steps to Reproduce:
1. run backup script (in our case there are approximately 30 VMs)
2. run snapshot remove script
3. at one point one of the VMs will show a "?" and be unresponsive

Actual results:

- VM unresponsive


Expected results:

- VM ok and snapshots removed


Additional info:

The environment is built as follows:

- 4 Compute hosts running CentOS 7.1 with 1 bond (2 x 1 Gbit LACP Layer3+4) for ovirtmgmt, 1 bond (2 x 1Gbit LACP Layer3+4) for the virtual machines, 1 bond (4 x 1Gbit LACP Layer3+4) for storage access. 
- 3 Gluster Storage hosts, 1 brick replicated, 1 bonded ( 8 x 1Gbit LACP Layer3+4)
- 2 Compute hosts (same setup as above) for hosted engine only 
- hosted engine, CentOS 6.6
- 2 x Dell N3048 Switches with MLAG for Storage connection (uplinks always 50/50)
- 2 x Dell N3ß48 Switches with MLAG for ovirtmgmt and VM networks (uplinks always 50/50)

Comment 1 Soeren 2015-06-15 10:58:39 UTC

Created attachment 1038891 [details]
snapshot removal

Comment 2 Soeren 2015-06-15 11:01:38 UTC

Created attachment 1038892 [details]
system.log

Comment 3 Soeren 2015-06-15 11:02:38 UTC

Created attachment 1038893 [details]
vdsm log

Comment 4 Michal Skrivanek 2015-06-17 06:53:04 UTC

IIUC this is a live merge scenario and it is reproducible when you take a snapshot and remove it while the VM is running.
Please confirm this is reproducible even with a single VM and one snapshot via webadmin

I suppose exact libvirt, qemu and OS version would help

Comment 5 Rene Camu 2015-06-17 08:23:53 UTC

Hi,

Compute HOSTS:
operatingsystem => CentOS
operatingsystemmajrelease => 7
operatingsystemrelease => 7.1.1503

libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64
qemu-img-rhev-1.5.3-86.el7.1.x86_64
qemu-kvm-rhev-1.5.3-86.el7.1.x86_64
ipxe-roms-qemu-20130517-6.gitc4bce43.el7.noarch
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
qemu-kvm-common-rhev-1.5.3-86.el7.1.x86_64
qemu-kvm-tools-rhev-1.5.3-86.el7.1.x86_64

Engine HOST:
CentOS release 6.6 (Final)
libvirt-client-0.10.2-46.el6_6.6.x86_64

Best regards

René

Comment 6 Allon Mureinik 2015-06-17 09:26:54 UTC

Adam/Greg, can you take a look please?

Comment 7 Adam Litke 2015-06-17 20:05:09 UTC

Soeren, please answer Michal's needinfo by attempting to reproduce by live merging just a single VM snapshot.  We need to determine if this is a scalability or functionality issue.

Comment 8 Rene Camu 2015-06-18 11:36:26 UTC

Hi,

this is reproducable when a virtual machine has multiple snapshots and these snapshots are being removed one after eachother.

Regards

René

Comment 9 Rene Camu 2015-06-30 06:14:48 UTC

Hello,

hereby a new reproducable error on removing snapshots. The vm had an existing snapshot from our backup script and an addition manual created snapshot. After 3 hours we reomved the manuel snapshot and the vm became unresponsive.

Added the engine log information.

Regards

René

Comment 10 Rene Camu 2015-06-30 06:15:50 UTC

Created attachment 1044576 [details]
VM became unresponsive after snapshot removal

Comment 11 Allon Mureinik 2015-06-30 09:24:16 UTC

Can you include VDSM's log too please?

Comment 12 Rene Camu 2015-06-30 09:54:38 UTC

added attachement

Comment 13 Rene Camu 2015-06-30 09:55:57 UTC

Created attachment 1044634 [details]
VDSM Log of snapshot removal on compute server 4

Comment 14 Adam Litke 2015-06-30 13:54:40 UTC

This looks like a qemu job lock bug in libvirt.  I see the following in system.log

Jun 04 17:08:28 mc-dc3ham-compute-03-live.mc.mcon.net libvirtd[1788]: Cannot start job (modify, none) for domain mc-dc3ham-dpl-01-live-mc-mcon-net; current job is (modify, none) owned by (1814, 0)

and after that:

Jun 04 17:08:28 mc-dc3ham-compute-03-live.mc.mcon.net libvirtd[1788]: Timed out during operation: cannot acquire state change lock

repeatedly.  This is causing vdsm's block job monitoring code to throw errors since the libvirt operations are timing out.

Comment 15 Adam Litke 2015-06-30 13:56:39 UTC

Soeren,

What do you mean when you say the VM becomes unresponsive?  Can you still ssh into the VM (or make use of any services it may be providing?  My guess is that you are simply unable to open a console to the VM from the oVirt UI (since the libvirt ticketing API needed to establish a session is blocked).

Comment 16 Adam Litke 2015-06-30 13:58:53 UTC

Also, can you please tell us what libvirt version you are using?

Comment 17 Rene Camu 2015-07-01 06:34:54 UTC

Hi,

Compute HOSTS:
operatingsystem => CentOS
operatingsystemmajrelease => 7
operatingsystemrelease => 7.1.1503

libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64

Unresponsive means no SSH connection, no console control. Shutdown through the web isn't possible, no reaction from the VM.

The issue from comment 9 occured on Friday the 26th of june. The snapshot was created at 01:12 PM and tried to be deleted at 06:42 PM. In this special case, the problem was on our compute host 4 and the VM was called zeiss

Regards

René

Comment 18 Soeren 2015-07-05 21:36:10 UTC

After reviewing this bug again, i saw that one information was missing (i only mentioned that in the mailing list).

This error did not occur when we were running the same script on Fedora 20 using the libvirt preview repository and ovirt 3.5.1 and 3.5.0, i do not have the libvirt versions anymore, but we changed to CentOS 7.1 right after the release of ovirt 3.5.2 and before that we were running the latest version of libvirt.

Soeren

Comment 19 Soeren 2015-07-07 09:16:10 UTC

Dear all, 

now i have information on the libvirt versions, on CentOS 7.1 where we experience the bugs we have this 

libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64

on Fedora 20, where we do NOT have this problem we have this

libvirt-client-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-nwfilter-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-nodedev-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-storage-1.2.9.1-2.fc20.x86_64
libvirt-python-1.2.9-2.fc20.x86_64
libvirt-daemon-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-qemu-1.2.9.1-2.fc20.x86_64
libvirt-daemon-config-nwfilter-1.2.9.1-2.fc20.x86_64
libvirt-lock-sanlock-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-secret-1.2.9.1-2.fc20.x86_64
libvirt-daemon-kvm-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-network-1.2.9.1-2.fc20.x86_64
libvirt-daemon-driver-interface-1.2.9.1-2.fc20.x86_64

Regards
Soeren

Comment 20 Red Hat Bugzilla Rules Engine 2015-10-19 10:59:06 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 21 Adam Litke 2015-11-23 18:36:32 UTC

Hi Eric,

As reported in Comment 19, libvirt-daemon-1.2.8-16.el7_1.3.x86_64 has a locking issue (unable to acquire state change lock) and the problem is resolved by libvirt-daemon-1.2.9.1-2.fc20.x86_64.  Are you aware of any specific fixes that resolved this kind of bug between these two versions?  We're trying to determine if a fix could be backported to el7.  Thanks!

Comment 22 Eric Blake 2015-12-08 21:55:25 UTC

I looked through the upstream logs from v1.2.8..v1.2.9.1.

On a first pass, I compiled a list of any commit that seems to mention jobs while touching something in src/qemu, then tracked down which of those commits have been backported to RHEL (+) or are only in Fedora (-):

+ 02d2bd7 blockjob: shuffle block rebase code
- cefe0ba blockjob: split out block info driver...
- ced8136 blockjob: split out block info monitor...
- 08cc14f blockjob: hoist bandwidth scaling out...
- fcbeb2e blockjob: add new monitor json...
- db33cc2 blockjob: allow finer bandwidth tuning...
+ b7e7358 blockcopy: allow block device dest...
+ 4be8855 qemu: Rename DEFAULT_JOB_MASK...
+ b3d2a42 qemu: snapshot: Fix job handling...
+ a4065dc qemu: snapshot: Acquire job earlier...
+ d030447 qemu: snapshot: Fix snapshot function...
+ 2990db9 qemu: snapshot: Simplify error paths
+ 6fcddfc Refactor job statistics
+ 4365247 qemu: Avoid incrementing jobs_queued...
+ 3a86881 Add support for fetching statistics...
- e53ee6c blockcopy: tweak how rebase calls...
- e3f5f8e blockcopy: add qemu implementation...
- eef91f9 blockcopy: add qemu implementation...
- 0a6a6b1 blockjob: allow finer bandwidth...
+ fe7ef7b qemu: Always re-detect backing chain
+ 2086a99 qemu: forbid snapshot-delete...
+ 00331bf qemu: better error message...

then I started reading the details of the commits not backported.  Most of them are related to new features and not appropriate for backport, but one stands out as being possibly related:

commit cefe0ba3db92b50068f5098691b95c8da6cf4428
Author: Eric Blake <eblake>
Date:   Fri Aug 29 13:36:59 2014 -0600

    blockjob: split out block info driver handling
    
    The qemu implementation for virDomainGetBlockJobInfo() has a
    minor bug: it grabs the qemu job with intent to QEMU_JOB_MODIFY,
    which means it cannot be run in parallel with any other
    domain-modifying command.  Among others, virDomainBlockJobAbort()
    is such a modifying command, and it defaults to being
    synchronous, and can wait as long as several seconds to ensure
    that the job has actually finished.  Due to the job rules, this
    means a user cannot obtain status about the job during that
    timeframe, even though we know that some client management code
    exists which is using a polling loop on status to see when a job
    finishes.
    
    This bug has been present ever since blockpull support was first
    introduced (commit b976165, v0.9.4 in Jul 2011), all because we
    stupidly tried to cram too much multiplexing through a single
    helper routine, but was made worse in 97c59b9 (v1.2.7) when
    BlockJobAbort was fixed to wait longer.  It's time to disentangle
    some of the mess in qemuDomainBlockJobImpl, and in the process
    relax block job query to use QEMU_JOB_QUERY, since it can safely
    be used in parallel with any long running modify command.
    
    Technically, there is one case where getting block job info can
    modify domain XML - we do snooping to see if a 2-phase job has
    transitioned into the second phase, for an optimization in the
    case of old qemu that lacked an event for the transition.  I
    claim this optimization is safe (the jobs are all about modifying
    qemu state, not necessarily xml state); but if it proves to be
    a problem, we could use the difference between the capabilities
    QEMU_CAPS_BLOCKJOB_{ASYNC,SYNC} to determine whether we even
    need snooping, and only request a modifying job in the case of
    older qemu.
    
    * src/qemu/qemu_driver.c (qemuDomainBlockJobImpl): Move info
    handling...
    (qemuDomainGetBlockJobInfo): ...here, and relax job type.
    (qemuDomainBlockJobAbort, qemuDomainBlockJobSetSpeed)
    (qemuDomainBlockRebase, qemuDomainBlockPull): Adjust callers.
    
    Signed-off-by: Eric Blake <eblake>

I'm not positive that it is the source of the problem, but comment 14 does mention a log message of one attempted modify job being blocked by another one, and that could indeed be explained by trying to track if a block commit operation has completed.  I also wouldn't be surprised if we have lock handling bugs in our recovery after one attempted modify command times out (as that is a harder error path to test), which might explain that if you can first trigger the situation of one command being starved, then things get worse after that point.

Comment 23 Allon Mureinik 2015-12-09 11:36:52 UTC

So, how do we proceed here?
Eric - can we get a RHEL scratch build with this missing patch (or better yet, an official build which includes it)?

Comment 24 Eric Blake 2015-12-09 23:41:02 UTC

(In reply to Allon Mureinik from comment #23)
> So, how do we proceed here?
> Eric - can we get a RHEL scratch build with this missing patch (or better
> yet, an official build which includes it)?

I can't do the official z-stream builds, but I'm preparing a scratch build and will post the link once it is ready...

Comment 26 Eric Blake 2015-12-11 22:56:43 UTC

If the libvirt scratch build helps, we need to clone this bug to libvirt to get an official build.

Comment 27 Allon Mureinik 2015-12-13 07:29:40 UTC

Awesome, thanks Eric.

Aharon - can you try this scratch build and see if it resolves the issue?
Thanks!

Comment 28 William 2015-12-14 21:53:10 UTC

Customer has requested the patch be applied to 1.2.8-16.el7_1.3.x86_64 not 1.2.8-16.el7_1.5.x86_64 for their testing.

Comment 29 Eric Blake 2015-12-15 17:47:10 UTC

(In reply to William from comment #28)
> Customer has requested the patch be applied to 1.2.8-16.el7_1.3.x86_64 not
> 1.2.8-16.el7_1.5.x86_64 for their testing.

Sure, I'll kick off another scratch build.

Comment 30 Eric Blake 2015-12-15 17:57:01 UTC

Actually, I'm having a hard time figuring out how to make a scratch build use an earlier version number.  Really, the only difference between 1.2.8-16.el7_1.3 and el7_1.5 is the fixes for bug 1263317 and bug 1255859, neither of which should impact the problem.  But I'm still trying to see if I can coax the scratch builder brew to use a different version number to placate the request.

Comment 31 Eric Blake 2015-12-15 17:58:06 UTC

Not to mention that if we actually spawn a libvirt BZ to make the scratch build official, the libvirt build will be numbered 1.2.8-16.el7_1.6, regardless of what shenanigans I pull on the scratch build.

Comment 38 Yaniv Lavi 2016-01-11 12:09:08 UTC

Does this recreate for you on el7.2?

Comment 40 Yaniv Lavi 2016-02-04 09:54:12 UTC

Please reopen if you can provide the needed info, it should be fixed in el7.2.

Comment 41 Red Hat Bugzilla 2023-09-14 03:00:41 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.

acanan
alitke
amureini
bctpe4hbiu
bugs
eblake
gklein
gpadgett
lsurette
michal.skrivanek
philippe
rbalakri
redhat
rene.camu
soeren.malchow
v.astafiev
wlehman
yeylon
ykaul
ylavi