Bug 654023
Summary: | [kvm] [DW] guest moves to non-responding on failed migration when source fails to communicate storage server | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Haim <hateya> | |
Component: | kvm | Assignee: | Juan Quintela <quintela> | |
Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | low | |||
Version: | 5.5.z | CC: | abaron, bazulay, danken, hateya, iheim, mgoldboi, mkenneth, tburke, virt-maint, yeylon, ykaul | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 665820 (view as bug list) | Environment: | ||
Last Closed: | 2010-11-28 10:13:14 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 580948, 653898, 665820, 711374 |
Description
Haim
2010-11-16 16:33:33 UTC
vdsm22-4.5-62.26.el5_5rhev2_2 kernel-2.6.18-194.26.1.el5 kvm-83-164.el5_5.23 What's non responsive state means for qemu? No monitor? Something else? yes, Haim means that qmp is blocking forever. qmp on rhel5? BTW, what do customer expect us to do? Migration code flush memory state to block devices (in this case NFS), and then do migration. From what I see, there are only two solutions: - stop the source of migration with one error (it can't continue anyways, no disk) - migrate the "pending" state to the destination hosts, and do that there no code exist today for doing that, and as far as I know, there is no plan to do that. Haim, what do you expect to happen? I suspect that it's hanging in qemu_aio_wait while it's waiting for all running AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm this by attaching a gdb to the hanging instance and getting a backtrace? If so, and I interpret timeout 7000 correctly as 700s, this may take more than ten minutes. No wonder that migration times out. If this is the reason, there's nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout. (In reply to comment #5) > I suspect that it's hanging in qemu_aio_wait while it's waiting for all running > AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm > this by attaching a gdb to the hanging instance and getting a backtrace? > > If so, and I interpret timeout 7000 correctly as 700s, this may take more than > ten minutes. No wonder that migration times out. If this is the reason, there's > nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout. Why not define an internal timeout and send an interrupt to cancel IO and continue with migration? Normally people use hard mounts in which case this would never return (and we are planning on moving to hard mounts ourselves...) |