Hide Forgot
Description of problem: For the first, create snapshot for a paused guest will finished quickly, then try to create snapshot for the guest again, that will take long time, if cancel this job use Ctrl+c, then the guest can't resume back with time out error. Version-Release number of selected component (if applicable): libvirt-0.9.4-23.el6_2.3.x86_64 qemu-kvm-0.12.1.2-2.213.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Suspend a running guest # virsh suspend test Domain test suspended # virsh list Id Name State ---------------------------------- 7 test paused 2. Create snapshot for the guest # time virsh snapshot-create-as test s1 Domain snapshot s1 created real 0m9.139s user 0m0.005s sys 0m0.005s 3. Create snapshot for the guest again # time virsh snapshot-create-as test s2 Domain snapshot s2 created real 4m59.982s user 0m0.006s sys 0m0.008s 4. At step3, if cancel the snapshot-create job, the snapshot file will be created by libvirt, and the guest can't resume back 4.1 during creating the snapshot guest process, type Ctrl+c to cancel it 4.2 check the new create snapshot file # virsh snapshot-list test Name Creation Time State ------------------------------------------------------------ s1 2012-01-04 15:24:49 +0800 paused s2 2012-01-04 15:25:20 +0800 paused # qemu-img info /var/lib/libvirt/images/test.qcow2 image: /var/lib/libvirt/images/test.qcow2 file format: qcow2 virtual size: 6.0G (6442450944 bytes) disk size: 2.8G cluster_size: 65536 Snapshot list: ID TAG VM SIZE DATE VM CLOCK 1 s1 349M 2012-01-04 15:36:53 00:04:57.177 here, s2 didn't create by qemu, 4.3 try to resume the guest back # virsh resume test error: Failed to resume domain test error: Timed out during operation: cannot acquire state change lock # virsh list Id Name State ---------------------------------- 7 test paused Actual results: As step3 and step4 described. Expected results: Create snapshot at the second time will finished quickly. And if cancel the job, the guest can resume successfully. Additional info:
Test this bug with libvirt-0.9.9-1.el6.x86_64, can reproduce this bug. Create snapshot always slowly, and if use Ctrl+c to cancel it, then resume the guest also get the time out error.
Eric, I'm assuming the steps to reproduce are a valid use-case. Is that right?
Steps 1, 2, and 3 are definitely valid. If step 3 is taking a long time, I suspect that might be a qemu bug rather than a libvirt bug; it would be interesting to reproduce things with raw 'savevm' commands directly to qemu without libvirt in the mix, at which point, we may need to clone this BZ. I'm guessing that doing Ctrl-C during a long-running 'virsh snapshot-create' (step 4) is not valid: the current libvirt code is not wired up to be interruptible, and in turn, it looks like qemu does not support aborting a 'savevm' monitor command. Yes, it would be nice to make this interruptible in the future, but I think it would require qemu cooperation. Meanwhile, I can't help but wonder if the failure to grab the 'state change lock' is a sign of a bug in libvirt not gracefully handling the case where virsh dies in the middle; I'll have to take a closer look into that.
I've definitely reproduced a huge slowdown in step 3 when using libvirt; I'm trying to reproduce it using just qemu, but it's looking pretty much like this is squarely a bug in qemu. Meanwhile, the failure to grab the 'state change lock' is not a bug - it is a side-effect of the fact that qemu's 'savevm' monitor command is not abortable; if qemu takes a long time, there's nothing libvirt can do about it, and if you Ctrl-C to kill virsh, you still didn't abort the actual 'savevm', and all future attempts to connect to the same domain while the savevm is still on-going will correctly warn you that a pending monitor command is still on-going. However, rather than reassign this to qemu, I will keep it open on libvirt, because in my testing, I managed to kill libvirtd when doing 'virsh snapshot-delete dom --children first' to delete the first snapshot and all its children; I'm trying to get to a root cause on that crash.
More observations - libvirt's 'virsh snapshot-list' lists the in-memory representation of the snapshot as soon as the 'savevm' monitor command is started, but it isn't saved to disk in /var/lib/libvirt/qemu/snapshot/dom, nor does it show up in 'qemu-img info file', until the 'savevm' monitor command completes. Given that this operation takes a long time, there is a risk that in the face of hardware failure or libvirtd restart that libvirt's disk state being different from the qcow2 state could cause issues; perhaps libvirt should mark metadata as in-progress if a savevm was started but not completed, and 'snapshot-list' should not list such in-progress snapshots, so that restarting libvirtd knows to clean up the partial operation mess.
(In reply to comment #5) > However, rather than reassign this to qemu, I will keep it open on libvirt, > because in my testing, I managed to kill libvirtd when doing 'virsh > snapshot-delete dom --children first' to delete the first snapshot and all its > children; I'm trying to get to a root cause on that crash. The crash on snapshot-delete is tracked in bug 790744. I'm still trying to prove whether the slowdown can reproduced using just qemu.
All libvirt is doing is issuing 'savevm' monitor commands. It appears that the extreme slowness occurs when qemu is expanding the size of a qcow2 file in order to hold the internal snapshot, and there is nothing that libvirt is doing that gets in the way of qemu. I'm going to reassign this to qemu; since the operation is correct (it eventually completes), just slow, it might not be a high priority for the RHEL 6.3 timeframe. It might also be that improvements to 'unlimited' migration speed will reduce the time spent in qemu expanding the size of a qcow2 file during a 'savevm'.
savevm is an internal snapshot and not supported in RHEL. Was QE trying to test the supported live snapshots?
(In reply to comment #9) > savevm is an internal snapshot and not supported in RHEL. > Was QE trying to test the supported live snapshots? I test it with new libvirt live snapshot create works well libvirt-0.9.10-5.el6.x86_64 qemu-kvm-0.12.1.2-2.240.el6rhev.x86_64 1)When guest is running [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806702 created real 0m0.265s user 0m0.011s sys 0m0.005s [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806703 created real 0m0.273s user 0m0.008s sys 0m0.007s 2)When guest is pause : [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806766 created real 0m0.347s user 0m0.011s sys 0m0.005s [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806767 created real 0m0.293s user 0m0.007s sys 0m0.009s
(In reply to comment #9) > savevm is an internal snapshot and not supported in RHEL. > Was QE trying to test the supported live snapshots? The steps in comment 0 were testing live snapshots, the steps in comment 10 were testing disk snapshots. If RHEL only supports disk snapshots and not live snapshots, then RHEL does not care if live snapshots take a long time, so I agree with the decision to close this bug (although it would still be nice to raise the slow behavior issue to upstream qemu).
(In reply to comment #11) > (In reply to comment #9) > > savevm is an internal snapshot and not supported in RHEL. > > Was QE trying to test the supported live snapshots? > > The steps in comment 0 were testing live snapshots, the steps in comment 10 > were testing disk snapshots. If RHEL only supports disk snapshots and not live > snapshots, then RHEL does not care if live snapshots take a long time, so I > agree with the decision to close this bug (although it would still be nice to > raise the slow behavior issue to upstream qemu). Your terminology is mixed. RHEL supports live snapshots of the virtual disk using external qcow2 files (without the RAM). That's what we should be focusing on and what should be tested. -- savevm saves both the disk using internal qcow2 snapshot and both the guest ram. The guest ram is big and it is a lengthy operation to save it. It also a synchronous one so the guest is expected to block. Anyway it is not supported in RHEL.
(In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #9) > > > savevm is an internal snapshot and not supported in RHEL. > > > Was QE trying to test the supported live snapshots? > > > > The steps in comment 0 were testing live snapshots, the steps in comment 10 > > were testing disk snapshots. If RHEL only supports disk snapshots and not live > > snapshots, then RHEL does not care if live snapshots take a long time, so I > > agree with the decision to close this bug (although it would still be nice to > > raise the slow behavior issue to upstream qemu). > > Your terminology is mixed. Fine, I'll rewrite what I meant. The steps in comment 0 were testing "system checkpointing", which is the combination of disk state and RAM state. The steps in comment 10 were testing "disk snapshots", which is disk state only. Both forms of snapshot are "live" in the sense that they operate on a running machine, but "system checkpointing" is not supported by RHEL, in part because it takes a long time and suspends the guest during that time. The fact that "system checkpointing" is slow is not a RHEL problem, but should probably be raised upstream to qemu. > RHEL supports live snapshots of the virtual disk using external qcow2 files > (without the RAM). > That's what we should be focusing on and what should be tested. Agreed, which is why comment 10 is a better test than comment 0, and which is why this is not a RHEL bug. > > -- > > savevm saves both the disk using internal qcow2 snapshot and both the guest > ram. > The guest ram is big and it is a lengthy operation to save it. It also a > synchronous one so the guest is expected to block. Anyway it is not supported > in RHEL.
*** Bug 800303 has been marked as a duplicate of this bug. ***