| Summary: | create snapshot for a paused guest at the second time will take long time | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | yanbing du <ydu> |
| Component: | qemu-kvm | Assignee: | Virtualization Maintenance <virt-maint> |
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.3 | CC: | acathrow, bsarathy, dallan, dyuan, eblake, mkenneth, mzhan, rwu, tburke, virt-maint, whuang, zhpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-03-15 10:03:00 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
yanbing du
2012-01-04 11:37:25 UTC
Test this bug with libvirt-0.9.9-1.el6.x86_64, can reproduce this bug. Create snapshot always slowly, and if use Ctrl+c to cancel it, then resume the guest also get the time out error. Eric, I'm assuming the steps to reproduce are a valid use-case. Is that right? Steps 1, 2, and 3 are definitely valid. If step 3 is taking a long time, I suspect that might be a qemu bug rather than a libvirt bug; it would be interesting to reproduce things with raw 'savevm' commands directly to qemu without libvirt in the mix, at which point, we may need to clone this BZ. I'm guessing that doing Ctrl-C during a long-running 'virsh snapshot-create' (step 4) is not valid: the current libvirt code is not wired up to be interruptible, and in turn, it looks like qemu does not support aborting a 'savevm' monitor command. Yes, it would be nice to make this interruptible in the future, but I think it would require qemu cooperation. Meanwhile, I can't help but wonder if the failure to grab the 'state change lock' is a sign of a bug in libvirt not gracefully handling the case where virsh dies in the middle; I'll have to take a closer look into that. I've definitely reproduced a huge slowdown in step 3 when using libvirt; I'm trying to reproduce it using just qemu, but it's looking pretty much like this is squarely a bug in qemu. Meanwhile, the failure to grab the 'state change lock' is not a bug - it is a side-effect of the fact that qemu's 'savevm' monitor command is not abortable; if qemu takes a long time, there's nothing libvirt can do about it, and if you Ctrl-C to kill virsh, you still didn't abort the actual 'savevm', and all future attempts to connect to the same domain while the savevm is still on-going will correctly warn you that a pending monitor command is still on-going. However, rather than reassign this to qemu, I will keep it open on libvirt, because in my testing, I managed to kill libvirtd when doing 'virsh snapshot-delete dom --children first' to delete the first snapshot and all its children; I'm trying to get to a root cause on that crash. More observations - libvirt's 'virsh snapshot-list' lists the in-memory representation of the snapshot as soon as the 'savevm' monitor command is started, but it isn't saved to disk in /var/lib/libvirt/qemu/snapshot/dom, nor does it show up in 'qemu-img info file', until the 'savevm' monitor command completes. Given that this operation takes a long time, there is a risk that in the face of hardware failure or libvirtd restart that libvirt's disk state being different from the qcow2 state could cause issues; perhaps libvirt should mark metadata as in-progress if a savevm was started but not completed, and 'snapshot-list' should not list such in-progress snapshots, so that restarting libvirtd knows to clean up the partial operation mess. (In reply to comment #5) > However, rather than reassign this to qemu, I will keep it open on libvirt, > because in my testing, I managed to kill libvirtd when doing 'virsh > snapshot-delete dom --children first' to delete the first snapshot and all its > children; I'm trying to get to a root cause on that crash. The crash on snapshot-delete is tracked in bug 790744. I'm still trying to prove whether the slowdown can reproduced using just qemu. All libvirt is doing is issuing 'savevm' monitor commands. It appears that the extreme slowness occurs when qemu is expanding the size of a qcow2 file in order to hold the internal snapshot, and there is nothing that libvirt is doing that gets in the way of qemu. I'm going to reassign this to qemu; since the operation is correct (it eventually completes), just slow, it might not be a high priority for the RHEL 6.3 timeframe. It might also be that improvements to 'unlimited' migration speed will reduce the time spent in qemu expanding the size of a qcow2 file during a 'savevm'. savevm is an internal snapshot and not supported in RHEL. Was QE trying to test the supported live snapshots? (In reply to comment #9) > savevm is an internal snapshot and not supported in RHEL. > Was QE trying to test the supported live snapshots? I test it with new libvirt live snapshot create works well libvirt-0.9.10-5.el6.x86_64 qemu-kvm-0.12.1.2-2.240.el6rhev.x86_64 1)When guest is running [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806702 created real 0m0.265s user 0m0.011s sys 0m0.005s [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806703 created real 0m0.273s user 0m0.008s sys 0m0.007s 2)When guest is pause : [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806766 created real 0m0.347s user 0m0.011s sys 0m0.005s [root@intel-w3520-12-1 images]# time virsh snapshot-create r --disk-only Domain snapshot 1331806767 created real 0m0.293s user 0m0.007s sys 0m0.009s (In reply to comment #9) > savevm is an internal snapshot and not supported in RHEL. > Was QE trying to test the supported live snapshots? The steps in comment 0 were testing live snapshots, the steps in comment 10 were testing disk snapshots. If RHEL only supports disk snapshots and not live snapshots, then RHEL does not care if live snapshots take a long time, so I agree with the decision to close this bug (although it would still be nice to raise the slow behavior issue to upstream qemu). (In reply to comment #11) > (In reply to comment #9) > > savevm is an internal snapshot and not supported in RHEL. > > Was QE trying to test the supported live snapshots? > > The steps in comment 0 were testing live snapshots, the steps in comment 10 > were testing disk snapshots. If RHEL only supports disk snapshots and not live > snapshots, then RHEL does not care if live snapshots take a long time, so I > agree with the decision to close this bug (although it would still be nice to > raise the slow behavior issue to upstream qemu). Your terminology is mixed. RHEL supports live snapshots of the virtual disk using external qcow2 files (without the RAM). That's what we should be focusing on and what should be tested. -- savevm saves both the disk using internal qcow2 snapshot and both the guest ram. The guest ram is big and it is a lengthy operation to save it. It also a synchronous one so the guest is expected to block. Anyway it is not supported in RHEL. (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #9) > > > savevm is an internal snapshot and not supported in RHEL. > > > Was QE trying to test the supported live snapshots? > > > > The steps in comment 0 were testing live snapshots, the steps in comment 10 > > were testing disk snapshots. If RHEL only supports disk snapshots and not live > > snapshots, then RHEL does not care if live snapshots take a long time, so I > > agree with the decision to close this bug (although it would still be nice to > > raise the slow behavior issue to upstream qemu). > > Your terminology is mixed. Fine, I'll rewrite what I meant. The steps in comment 0 were testing "system checkpointing", which is the combination of disk state and RAM state. The steps in comment 10 were testing "disk snapshots", which is disk state only. Both forms of snapshot are "live" in the sense that they operate on a running machine, but "system checkpointing" is not supported by RHEL, in part because it takes a long time and suspends the guest during that time. The fact that "system checkpointing" is slow is not a RHEL problem, but should probably be raised upstream to qemu. > RHEL supports live snapshots of the virtual disk using external qcow2 files > (without the RAM). > That's what we should be focusing on and what should be tested. Agreed, which is why comment 10 is a better test than comment 0, and which is why this is not a RHEL bug. > > -- > > savevm saves both the disk using internal qcow2 snapshot and both the guest > ram. > The guest ram is big and it is a lengthy operation to save it. It also a > synchronous one so the guest is expected to block. Anyway it is not supported > in RHEL. *** Bug 800303 has been marked as a duplicate of this bug. *** |