Created attachment 1038890 [details] BAckup Script Description of problem: Removing of snapshots results in non-responsive VMs, the VMs can not be stopped or shutdown from the ovirt GUI, if the VM (qemu process) is killed the VM can not be started again, even a restart of the libvirt does not recover the VM, only a reboot of the hypervisor host recovers the VM. The problem comes up when removing snapshots after a backup from a python script as well as when removeing the snapshots manually. Version-Release number of selected component (if applicable): ovirt: 3.5.2.1-1 vdsm: 4.16.14-0 gluster: 3.7.0-2 OS for compute and storage: CentOS 7.1 (all patches applied) OS for hosted Engine: CentOS 6.6 (all patches applied) How reproducible: On our environment we can always reproduce this using our backup script and then the snapshot remove script (previously one script now diveded into 2 scripts), during the remove it always breaks Steps to Reproduce: 1. run backup script (in our case there are approximately 30 VMs) 2. run snapshot remove script 3. at one point one of the VMs will show a "?" and be unresponsive Actual results: - VM unresponsive Expected results: - VM ok and snapshots removed Additional info: The environment is built as follows: - 4 Compute hosts running CentOS 7.1 with 1 bond (2 x 1 Gbit LACP Layer3+4) for ovirtmgmt, 1 bond (2 x 1Gbit LACP Layer3+4) for the virtual machines, 1 bond (4 x 1Gbit LACP Layer3+4) for storage access. - 3 Gluster Storage hosts, 1 brick replicated, 1 bonded ( 8 x 1Gbit LACP Layer3+4) - 2 Compute hosts (same setup as above) for hosted engine only - hosted engine, CentOS 6.6 - 2 x Dell N3048 Switches with MLAG for Storage connection (uplinks always 50/50) - 2 x Dell N3ß48 Switches with MLAG for ovirtmgmt and VM networks (uplinks always 50/50)
Created attachment 1038891 [details] snapshot removal
Created attachment 1038892 [details] system.log
Created attachment 1038893 [details] vdsm log
IIUC this is a live merge scenario and it is reproducible when you take a snapshot and remove it while the VM is running. Please confirm this is reproducible even with a single VM and one snapshot via webadmin I suppose exact libvirt, qemu and OS version would help
Hi, Compute HOSTS: operatingsystem => CentOS operatingsystemmajrelease => 7 operatingsystemrelease => 7.1.1503 libvirt-daemon-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64 libvirt-client-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64 qemu-img-rhev-1.5.3-86.el7.1.x86_64 qemu-kvm-rhev-1.5.3-86.el7.1.x86_64 ipxe-roms-qemu-20130517-6.gitc4bce43.el7.noarch libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64 qemu-kvm-common-rhev-1.5.3-86.el7.1.x86_64 qemu-kvm-tools-rhev-1.5.3-86.el7.1.x86_64 Engine HOST: CentOS release 6.6 (Final) libvirt-client-0.10.2-46.el6_6.6.x86_64 Best regards René
Adam/Greg, can you take a look please?
Soeren, please answer Michal's needinfo by attempting to reproduce by live merging just a single VM snapshot. We need to determine if this is a scalability or functionality issue.
Hi, this is reproducable when a virtual machine has multiple snapshots and these snapshots are being removed one after eachother. Regards René
Hello, hereby a new reproducable error on removing snapshots. The vm had an existing snapshot from our backup script and an addition manual created snapshot. After 3 hours we reomved the manuel snapshot and the vm became unresponsive. Added the engine log information. Regards René
Created attachment 1044576 [details] VM became unresponsive after snapshot removal
Can you include VDSM's log too please?
added attachement
Created attachment 1044634 [details] VDSM Log of snapshot removal on compute server 4
This looks like a qemu job lock bug in libvirt. I see the following in system.log Jun 04 17:08:28 mc-dc3ham-compute-03-live.mc.mcon.net libvirtd[1788]: Cannot start job (modify, none) for domain mc-dc3ham-dpl-01-live-mc-mcon-net; current job is (modify, none) owned by (1814, 0) and after that: Jun 04 17:08:28 mc-dc3ham-compute-03-live.mc.mcon.net libvirtd[1788]: Timed out during operation: cannot acquire state change lock repeatedly. This is causing vdsm's block job monitoring code to throw errors since the libvirt operations are timing out.
Soeren, What do you mean when you say the VM becomes unresponsive? Can you still ssh into the VM (or make use of any services it may be providing? My guess is that you are simply unable to open a console to the VM from the oVirt UI (since the libvirt ticketing API needed to establish a session is blocked).
Also, can you please tell us what libvirt version you are using?
Hi, Compute HOSTS: operatingsystem => CentOS operatingsystemmajrelease => 7 operatingsystemrelease => 7.1.1503 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-1.2.8-16.el7_1.3.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64 libvirt-client-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64 Unresponsive means no SSH connection, no console control. Shutdown through the web isn't possible, no reaction from the VM. The issue from comment 9 occured on Friday the 26th of june. The snapshot was created at 01:12 PM and tried to be deleted at 06:42 PM. In this special case, the problem was on our compute host 4 and the VM was called zeiss Regards René
After reviewing this bug again, i saw that one information was missing (i only mentioned that in the mailing list). This error did not occur when we were running the same script on Fedora 20 using the libvirt preview repository and ovirt 3.5.1 and 3.5.0, i do not have the libvirt versions anymore, but we changed to CentOS 7.1 right after the release of ovirt 3.5.2 and before that we were running the latest version of libvirt. Soeren
Dear all, now i have information on the libvirt versions, on CentOS 7.1 where we experience the bugs we have this libvirt-daemon-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64 libvirt-client-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64 on Fedora 20, where we do NOT have this problem we have this libvirt-client-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-nwfilter-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-nodedev-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-storage-1.2.9.1-2.fc20.x86_64 libvirt-python-1.2.9-2.fc20.x86_64 libvirt-daemon-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-qemu-1.2.9.1-2.fc20.x86_64 libvirt-daemon-config-nwfilter-1.2.9.1-2.fc20.x86_64 libvirt-lock-sanlock-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-secret-1.2.9.1-2.fc20.x86_64 libvirt-daemon-kvm-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-network-1.2.9.1-2.fc20.x86_64 libvirt-daemon-driver-interface-1.2.9.1-2.fc20.x86_64 Regards Soeren
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Hi Eric, As reported in Comment 19, libvirt-daemon-1.2.8-16.el7_1.3.x86_64 has a locking issue (unable to acquire state change lock) and the problem is resolved by libvirt-daemon-1.2.9.1-2.fc20.x86_64. Are you aware of any specific fixes that resolved this kind of bug between these two versions? We're trying to determine if a fix could be backported to el7. Thanks!
I looked through the upstream logs from v1.2.8..v1.2.9.1. On a first pass, I compiled a list of any commit that seems to mention jobs while touching something in src/qemu, then tracked down which of those commits have been backported to RHEL (+) or are only in Fedora (-): + 02d2bd7 blockjob: shuffle block rebase code - cefe0ba blockjob: split out block info driver... - ced8136 blockjob: split out block info monitor... - 08cc14f blockjob: hoist bandwidth scaling out... - fcbeb2e blockjob: add new monitor json... - db33cc2 blockjob: allow finer bandwidth tuning... + b7e7358 blockcopy: allow block device dest... + 4be8855 qemu: Rename DEFAULT_JOB_MASK... + b3d2a42 qemu: snapshot: Fix job handling... + a4065dc qemu: snapshot: Acquire job earlier... + d030447 qemu: snapshot: Fix snapshot function... + 2990db9 qemu: snapshot: Simplify error paths + 6fcddfc Refactor job statistics + 4365247 qemu: Avoid incrementing jobs_queued... + 3a86881 Add support for fetching statistics... - e53ee6c blockcopy: tweak how rebase calls... - e3f5f8e blockcopy: add qemu implementation... - eef91f9 blockcopy: add qemu implementation... - 0a6a6b1 blockjob: allow finer bandwidth... + fe7ef7b qemu: Always re-detect backing chain + 2086a99 qemu: forbid snapshot-delete... + 00331bf qemu: better error message... then I started reading the details of the commits not backported. Most of them are related to new features and not appropriate for backport, but one stands out as being possibly related: commit cefe0ba3db92b50068f5098691b95c8da6cf4428 Author: Eric Blake <eblake> Date: Fri Aug 29 13:36:59 2014 -0600 blockjob: split out block info driver handling The qemu implementation for virDomainGetBlockJobInfo() has a minor bug: it grabs the qemu job with intent to QEMU_JOB_MODIFY, which means it cannot be run in parallel with any other domain-modifying command. Among others, virDomainBlockJobAbort() is such a modifying command, and it defaults to being synchronous, and can wait as long as several seconds to ensure that the job has actually finished. Due to the job rules, this means a user cannot obtain status about the job during that timeframe, even though we know that some client management code exists which is using a polling loop on status to see when a job finishes. This bug has been present ever since blockpull support was first introduced (commit b976165, v0.9.4 in Jul 2011), all because we stupidly tried to cram too much multiplexing through a single helper routine, but was made worse in 97c59b9 (v1.2.7) when BlockJobAbort was fixed to wait longer. It's time to disentangle some of the mess in qemuDomainBlockJobImpl, and in the process relax block job query to use QEMU_JOB_QUERY, since it can safely be used in parallel with any long running modify command. Technically, there is one case where getting block job info can modify domain XML - we do snooping to see if a 2-phase job has transitioned into the second phase, for an optimization in the case of old qemu that lacked an event for the transition. I claim this optimization is safe (the jobs are all about modifying qemu state, not necessarily xml state); but if it proves to be a problem, we could use the difference between the capabilities QEMU_CAPS_BLOCKJOB_{ASYNC,SYNC} to determine whether we even need snooping, and only request a modifying job in the case of older qemu. * src/qemu/qemu_driver.c (qemuDomainBlockJobImpl): Move info handling... (qemuDomainGetBlockJobInfo): ...here, and relax job type. (qemuDomainBlockJobAbort, qemuDomainBlockJobSetSpeed) (qemuDomainBlockRebase, qemuDomainBlockPull): Adjust callers. Signed-off-by: Eric Blake <eblake> I'm not positive that it is the source of the problem, but comment 14 does mention a log message of one attempted modify job being blocked by another one, and that could indeed be explained by trying to track if a block commit operation has completed. I also wouldn't be surprised if we have lock handling bugs in our recovery after one attempted modify command times out (as that is a harder error path to test), which might explain that if you can first trigger the situation of one command being starved, then things get worse after that point.
So, how do we proceed here? Eric - can we get a RHEL scratch build with this missing patch (or better yet, an official build which includes it)?
(In reply to Allon Mureinik from comment #23) > So, how do we proceed here? > Eric - can we get a RHEL scratch build with this missing patch (or better > yet, an official build which includes it)? I can't do the official z-stream builds, but I'm preparing a scratch build and will post the link once it is ready...
If the libvirt scratch build helps, we need to clone this bug to libvirt to get an official build.
Awesome, thanks Eric. Aharon - can you try this scratch build and see if it resolves the issue? Thanks!
Customer has requested the patch be applied to 1.2.8-16.el7_1.3.x86_64 not 1.2.8-16.el7_1.5.x86_64 for their testing.
(In reply to William from comment #28) > Customer has requested the patch be applied to 1.2.8-16.el7_1.3.x86_64 not > 1.2.8-16.el7_1.5.x86_64 for their testing. Sure, I'll kick off another scratch build.
Actually, I'm having a hard time figuring out how to make a scratch build use an earlier version number. Really, the only difference between 1.2.8-16.el7_1.3 and el7_1.5 is the fixes for bug 1263317 and bug 1255859, neither of which should impact the problem. But I'm still trying to see if I can coax the scratch builder brew to use a different version number to placate the request.
Not to mention that if we actually spawn a libvirt BZ to make the scratch build official, the libvirt build will be numbered 1.2.8-16.el7_1.6, regardless of what shenanigans I pull on the scratch build.
Does this recreate for you on el7.2?
Please reopen if you can provide the needed info, it should be fixed in el7.2.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days