Bug 1403691
Summary: | Snapshot fail trying to add an existing sanlock lease | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nir Soffer <nsoffer> | ||||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Han Han <hhan> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.3 | CC: | bmcclain, dyuan, hhan, jdenemar, jsuchane, michal.skrivanek, mkalinin, pkrempa, rbalakri, snagar, teigland, xuzhang, ylavi | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | libvirt-3.0.0-1.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Cause:
libvirt attempted to start cpus even for cases where they were not paused for the snapshot. The cpu resume code also invokes lock manager APIs which try to acquire the configured leases.
Consequence:
As the lease was already locked a new attempt to lock them would fail. This would result into an error of the snapshot API.
Fix:
The code properly tracks whether the VM needs to be resumed after the snapshot is completed. The code acquiring the locks is not invoked if not necessary.
Result:
Libvirt does not try to lock the managed leases twice.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1406765 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-08-01 17:19:14 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1406765, 1415488 | ||||||||||
Attachments: |
|
Description
Nir Soffer
2016-12-12 08:10:25 UTC
Created attachment 1230728 [details]
Logs and configuration files
This is urgent and I request that it will be fixed in 7.3.z. Can you escalate? This fix is critical for the 4.1 release. Hi, I think this bug is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1191901, which will hang when doing external snapshot after sanlock lease. (In reply to Han Han from comment #7) > Hi, I think this bug is blocked by > https://bugzilla.redhat.com/show_bug.cgi?id=1191901, which will hang when > doing external snapshot after sanlock lease. Han, why do you think this depends on the other bug? Here there is no deadlock, libvirt simply fail to acquire the lock. Also here we use very different configuration: # vim /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 0 It is possible that both bugs are caused by the same root cause, trying to modify an acquired lease, but I don't see any dependency. Well, you're right. They are just similar testing scenarios, no dependency. I will add the bug to 'See Also'. Han, can you reproduce this bug with virsh, in the same way you reproduced bug 1191901? I guess it will be useful for testing. Patch fixing this particular locking problem posted upstream: https://www.redhat.com/archives/libvir-list/2016-December/msg00619.html Fixed upstream: commit 4b951d1e38259ff5d03e9eedb65095eead8099e1 Author: Peter Krempa <pkrempa> Date: Wed Dec 14 08:01:34 2016 +0100 qemu: snapshot: Don't attempt to resume cpus if they were not paused External disk-only snapshots with recent enough qemu don't require libvirt to pause the VM. The logic determining when to resume cpus was slightly flawed and attempted to resume them even if they were not paused by the snapshot code. This normally was not a problem, but with locking enabled the code would attempt to acquire the lock twice. The fallout of this bug would be a error from the API, but the actual snapshot being created. The bug was introduced with when adding support for external snapshots with memory (checkpoints) in commit f569b87. Created attachment 1234263 [details]
bug reproduce
I reproduced the bug on RHEL7.3 by virsh.
Steps(Pls refer to script on attachment for detail):
1. Setup configurations for libvirt snalock
2. Restart libvirtd and sanlock services
3. Manually create locking files
4. Add <lease>..</lease> element in VM's xml
5. Start VM and try to create external snapshot
Created attachment 1234264 [details]
libvirtd log
Hi, Peter. I test your scratch build as last comment and find another error:
# virsh snapshot-create-as n1 s1 --disk-only
error: unsupported configuration: Read/write, exclusive access, disks were present, but no leases specified
Pls check it
I tested the build from comment 22 - works for me. Tested with vdsm from this patch: https://gerrit.ovirt.org/65465/ Each vm was started with one lease device using the vm id as key, and a sanlock resource on block/file storage. Positive flows: - live migration - live snapshot - live snapshot with memory - live storage migration - live merge - hot plug disk - hot unplug disk Negative flows: - restarting vdsm while vms are running recovery seems ok - Blocking storage on host running vm with nfs storage sanlock terminate the vm - Blocking storage on host running vm with block storage sanlock terminate the vm - Trying to use same lease in two vms one vm starts, the other fail to start with error: VM test-v4-iscsi is down with error. Exit message: resource busy: Failed to acquire lock: error -243. Everything looks good from ovirt side. In libvirt logs there are some unexpected errors like: 2016-12-22 01:15:38.680+0000: 11094: error : virLockManagerSanlockAcquire:998 : Failed to acquire lock: File exists These errors do not seem to affect vdsm. Attaching libvirt logs for reference. Created attachment 1234594 [details] libvirt logs from successful run with this scratch build from comment 22 Peter, I guess that this is still broken on Fedora, right? We need this fix also on Fedora 24/25. Should I open a Fedora bug for this? (In reply to Nir Soffer from comment #25) > Peter, I guess that this is still broken on Fedora, right? > > We need this fix also on Fedora 24/25. Should I open a Fedora bug for this? Yes. For backports to fedora please file bzs for the appropriate releases. (In reply to Han Han from comment #20) > Created attachment 1234264 [details] > libvirtd log > > Hi, Peter. I test your scratch build as last comment and find another error: > # virsh snapshot-create-as n1 s1 --disk-only > error: unsupported configuration: Read/write, exclusive access, disks were > present, but no leases specified You need to set 'require_lease_for_disks = 0' in the locking config file for this particular case. One additional patch fixes a slight problem with qemu remaining paused after a snapshot with the --live option. commit 2e86c0816fc8ab573745f1a9a650be09bd66e300 Author: Peter Krempa <pkrempa> Date: Wed Jan 4 13:23:31 2017 +0100 qemu: snapshot: Resume VM after live snapshot Commit 4b951d1e38259ff5d03e9eedb65095eead8099e1 missed the fact that the VM needs to be resumed after a live external checkpoint (memory snapshot) where the cpus would be paused by the migration rather than libvirt. That's usually how Z-stream works. 1406765 was already released in 7.3.z, the release for 7.4 was not done yet. Verify it on: libvirt-3.2.0-3.virtcov.el7.x86_64 sanlock-3.5.0-1.el7.x86_64 qemu-kvm-rhev-2.9.0-2.el7.x86_64 1. Set configurations as following: In qemu-sanlock.conf: user = "sanlock" group = "sanlock" require_lease_for_disks = 0 host_id = 1 auto_disk_leases = 0 disk_lease_dir = "/var/lib/libvirt/sanlock" In qemu.conf: lock_manager = "sanlock" 2. Create file for locksapce and resource: # DOM=V-lock # lockspace_name=libvirt-sanlock # lockspce_resource_path=/var/lib/libvirt/sanlock/libvirt-sanlock # resource_name=test-disk-resource-lock # resource_offset=1048576 # truncate -s 2M $lockspce_resource_path # chown sanlock:sanlock $lockspce_resource_path # sanlock direct init -s $lockspace_name:0:$lockspce_resource_path:0 # sanlock add_lockspace -s $lockspace_name:1:$lockspce_resource_path:0 # sanlock direct init -r $lockspace_name:$resource_name:$lockspce_resource_path:$resource_offset # restorecon -R -v /var/lib/libvirt/sanlock 3. Cold plug following xml to VM # cat lease.xml <lease> <lockspace>libvirt-sanlock</lockspace> <key>test-disk-resource-lock</key> <target path='/var/lib/libvirt/sanlock/libvirt-sanlock' offset='1048576'/> </lease> # virsh attach-device $DOM ./lease.xml --config 4. Start VM and create snapshots with different options: # virsh start $DOM # virsh snapshot-create-as $DOM s1 --disk-only # virsh snapshot-create-as $DOM s2 --memspec /var/lib/libvirt/images/$DOM-mem.$i # virsh snapshot-create-as $DOM $i --memspec /var/lib/libvirt/images/$DOM-mem.$i --live # virsh snapshot-create-as $DOM $i --memspec /var/lib/libvirt/images/$DOM-mem.$i --diskspec hda,file=/var/lib/libvirt/images/$DOM.$i # virsh snapshot-create-as $DOM $i --memspec /var/lib/libvirt/images/$DOM-mem.$i --diskspec hda,file=/var/lib/libvirt/images/$DOM.$i --live VM started. All snapshots created successfully. Bug fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1846 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1846 |