Bug 894085
Summary: | libvirt: vm pauses after live storage migration | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dafna Ron <dron> | ||||
Component: | libvirt | Assignee: | Eric Blake <eblake> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.4 | CC: | abaron, acathrow, dallan, dyasny, dyuan, eblake, fsimonce, hateya, iheim, mzhan, pbonzini, rwu, tlavigne, weizhan, whuang, ykaul, zhpeng | ||||
Target Milestone: | rc | Keywords: | TestBlocker | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.10.2-16.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 896225 (view as bug list) | Environment: | |||||
Last Closed: | 2013-02-21 07:29:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 892791 | ||||||
Bug Blocks: | 896225, 902691 | ||||||
Attachments: |
|
(In reply to comment #0) > Created attachment 676411 [details] > logs > > Description of problem: > > I live migrated a vm's disk with rhevm and after the migration is done the > vm pauses. > looking at libvirt log I could see this: > > The command drive-mirror has not been found and The command > block-job-complete has not been found > > Version-Release number of selected component (if applicable): > > sf3 > vdsm-4.10.2-3.0.el6ev.x86_64 > libvirt-0.10.2-14.el6.x86_64 > qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64 > > How reproducible: > > 100% > > Steps to Reproduce: > 1. run a vm > 2. live migrate the disk > 3. > > Actual results: > > vm pauses after the live migration is reported finished. I'm not seeing this behavior in the logs; right after the guest finished the block job, then libvirt resumed the guest: 2013-01-10 14:49:42.675+0000: 14883: debug : qemuMonitorIOWrite:461 : QEMU_MONIT OR_IO_WRITE: mon=0x7f7e9800d000 buf={"execute":"__com.redhat_drive-reopen","argu ments":{"device":"drive-virtio-disk0","new-image-file":"/rhev/data-center/afcde1 c5-6022-4077-ab06-2beed7e5e404/a45fadeb-2db0-4201-9635-0c7ed6c1fd22/images/3906b f7c-8a7c-45fb-9b38-7b33e63c6817/b5f0db2f-386d-4009-8983-4338202e4bd1","format":" qcow2"},"id":"libvirt-87"} ... 2013-01-10 14:49:42.915+0000: 14883: debug : qemuMonitorIOProcess:353 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"return": {}, "id": "libvirt-87"} ... 2013-01-10 14:49:42.916+0000: 14887: debug : qemuMonitorSend:903 : QEMU_MONITOR_SEND_MSG: mon=0x7f7e9800d000 msg={"execute":"cont","id":"libvirt-88"} > Additional info: logs > > 2013-01-10 14:49:02.340+0000: 14883: debug : qemuMonitorIOProcess:353 : > QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"id": "libvirt-50", > "error": {"class": "CommandNotFound", "desc": "The command drive-mirror has > not been found", "data": {"name": "drive-mirror"}}} > 2013-01-10 14:49:42.674+0000: 14883: debug : qemuMonitorIOProcess:353 : > QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"id": "libvirt-86", > "error": {"class": "CommandNotFound", "desc": "The command > block-job-complete has not been found", "data": {"name": > "block-job-complete"}}} These two are red herrings - they are expected errors. Libvirt always tries the upstream spelling first, then falls back to the downstream RHEL spelling of __com.redhat_drive-mirror and __com.redhat_drive-reopen. They have no bearing on this bug report. I wonder if the real issue is that libvirt is not issuing an event at the right time; further IRC conversation says: # virsh -r list Id Name State ---------------------------------------------------- 2 TEST7 running # vdsClient -s 0 list table 454d8222-99f4-4251-9cac-d6817c10a02d 23704 TEST7 Paused so vdsm doesn't know that the guest resumed, even though libvirt does. I'm looking at the code now where libvirt pauses around the snapshot, to see if there is a missing event. If I'm right about it just being a case of libvirt sending a pause event but not a resume event, then you can use the workaround of virsh "suspend TEST7; resume TEST7" to force libvirt to emit events that let vdsm get back in sync. Also, there was a recent upstream patch concerned with emitting events for RESUME from qemu; maybe that will help in this case: commit aedfcce33e4c2f266668a39fd655574fe34f1265 Author: Andres Lagar-Cavilla <andres> Date: Mon Jan 7 16:25:01 2013 -0500 Add RESUME event listener to qemu monitor. Perform all the appropriate plumbing. When qemu/KVM VMs are paused manually through a monitor not-owned by libvirt, libvirt will think of them as "paused" event after they are resumed and effectively running. With this patch the discrepancy goes away. This is meant to address bug 892791. Signed-off-by: Andres Lagar-Cavilla <andres> I can reproduce this bug : libvirt-0.10.2-14.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64 vdsm-4.10.2-1.1.el6.x86_64 After disk live migration finish, the vm will be paused Is it _really_ paused, or just marked as such? Eric, do we need a qemu-kvm BZ for the event of comment 3? (In reply to comment #6) > Is it _really_ paused, or just marked as such? > > Eric, do we need a qemu-kvm BZ for the event of comment 3? The guest is running from both qemu and libvirt's perspective - the only entity that still thinks the guest is paused is VDSM. But the bug is in libvirt for not giving VDSM enough events for VDSM to track state correctly. I'm currently testing this patch, and will move the bug into POST if it works: http://post-office.corp.redhat.com/archives/rhvirt-patches/2013-January/msg00197.html I tested Eric's build on my setup with iscsi storage and the migration succeeded, vm did not pause after migration. Moving to POST based on comment 11, and the following analysis: At first, I was wondering if we have a bigger problem here. Right now, the only reason that VDSM sees either event is because the event is being emitted by a helper thread prior to returning lock to the main thread. This feels like a race window, and I wanted to be sure the race is consistently being won by the qemu event handling. The potential race occurs in at least these places: qemuDomainCoreDump (virsh dump --live), qemuDomainBlockPivot (virsh blockjob --pivot), qemuDomainSnapshotCreateActiveInternal (virsh snapshot-create-as --live --memspec filename) In all of those functions, there is no explicit event issued for either the call to qemuProcessStopCPUs (suspending the domain) or qemuProcessStartCPUs (resuming the domain), except that when startCPUs fails, an explicit SUSPENDED_API_ERROR event is issued. So, if you see any event at all, it was from this sequence of code: main thread calls virDomainObjGetState, sees domain is running main thread calls qemuProcessStopCPUs main thread drops lock with qemuDomainObjEnterMonitorAsync main thread calls qemuMonitorStopCPUs helper thread gets response from monitor, and response includes the event helper thread calls qemuProcessHandleStop helper thread requests lock with virDomainObjLock helper thread calls virDomainObjGetState, sees domain is still running helper thread changes libvirt state to paused with virDomainObjSetState helper thread raises the SUSPENDED_PAUSED event helper thread drops lock with virDomainObjUnlock main thread regains lock with qemuDomainObjExitMonitorWithDriver main thread changes libvirt state to paused with virDomainObjSetState (similarly with qemuProcessStartCPUs and qemuProcessHandleResume) As long as we can guarantee that qemu emits the event _prior_ to the return value for the acknowledgment of the successful pause, then the event will ALWAYS be emitted by the helper thread. But if qemu could ever complete the qemuMonitorStopCPUs action prior to emitting an event, then it would be possible to have this alternate sequence of events: main thread calls virDomainObjGetState, sees domain is running main thread calls qemuProcessStopCPUs main thread drops lock with qemuDomainObjEnterMonitorAsync main thread calls qemuMonitorStopCPUs helper thread gets response from monitor, event has not been seen yet main thread regains lock with qemuDomainObjExitMonitorWithDriver main thread changes libvirt state to paused with virDomainObjSetState helper thread finally sees an event helper thread calls qemuProcessHandleStop helper thread requests lock with virDomainObjLock helper thread calls virDomainObjGetState, sees domain is already paused helper thread discards the event helper thread drops lock with virDomainObjUnlock But it looks like we are lucky that the second result does not happen - this is what I see in the logs coming from qemu: 2013-01-10 14:49:05.463+0000: 14883: debug : qemuMonitorIOWrite:461 : QEMU_MONITOR_IO_WRITE: mon=0x7f7e9800d000 buf={"execute":"stop","id":"libvirt-54"} len=38 ret=38 errno=11 ... 2013-01-10 14:49:05.471+0000: 14883: debug : qemuMonitorIOProcess:353 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"timestamp": {"seconds": 1357829345, "microseconds": 470765}, "event": "STOP"} {"return": {}, "id": "libvirt-54"} len=117 That is, we are always seeing the event emitted prior to returning control to the main thread, so we don't have to worry about the race in RHEL. It may still be worth worrying about upstream, if qemu is ever changed to emit the event after the return for an individual command, but for now, I'm confident that we have solved the problem for RHEL, and am moving this to POST. Verify this bug with : libvirt-0.10.2-16.el6.x86_64 vdsm-4.10.2-2.0.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.352.el6.x86_64 1) prepare 2 nfs storage domain A and B 2) start a guest in A 3) In the Disks tab you can "move" the guest's image from A to B 4) After disk live migration the guest is still running ,not pause Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |
Created attachment 676411 [details] logs Description of problem: I live migrated a vm's disk with rhevm and after the migration is done the vm pauses. looking at libvirt log I could see this: The command drive-mirror has not been found and The command block-job-complete has not been found Version-Release number of selected component (if applicable): sf3 vdsm-4.10.2-3.0.el6ev.x86_64 libvirt-0.10.2-14.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. run a vm 2. live migrate the disk 3. Actual results: vm pauses after the live migration is reported finished. Expected results: vm should not pause Additional info: logs 2013-01-10 14:49:02.340+0000: 14883: debug : qemuMonitorIOProcess:353 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"id": "libvirt-50", "error": {"class": "CommandNotFound", "desc": "The command drive-mirror has not been found", "data": {"name": "drive-mirror"}}} 2013-01-10 14:49:42.674+0000: 14883: debug : qemuMonitorIOProcess:353 : QEMU_MONITOR_IO_PROCESS: mon=0x7f7e9800d000 buf={"id": "libvirt-86", "error": {"class": "CommandNotFound", "desc": "The command block-job-complete has not been found", "data": {"name": "block-job-complete"}}} tail: /var/log/libvirtd.log: file truncated