Description of problem: guest machine running on host that fails to communicate with NFS storage server, when mount NFS timeout is set to 7000, and migration to 30, trying to migrate host to valid host leads to state where migration fails, and guest moves to non-operational. scenario: starting with 2 hosts, one host acts as SPM, seconds host runs guest machines. block communication between host to storage server (on storage side), migrate guest to valid host, migration fails, and guest moves to non-responding. repro steps: 1) make sure to work with 2 hosts 2) make sure to work with NFS storage 3) make sure NFS timeout (mount options on vdsm) is set to 7000 4) make sure migration timeout is set to 30 5) make sure to run guest machine on non-spm machine ('host B') 6) make sure to block communication between host B to storage 7) make sure to migrate guest to host A (has valid connection to storage). note that migration fails on destination on timeout issue: Thread-5331::ERROR::2010-11-16 18:24:04,269::vm::1824::vds.vmlog.b53faba6-83e7-4f21-ac92-58ee62fa2546::Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 1816, in run self._waitForOutgoingMigration() File "/usr/share/vdsm/vm.py", line 1729, in _waitForOutgoingMigration raise _MigrationError('Migration timeout exceeded at source') _MigrationError: Migration timeout exceeded at source
vdsm22-4.5-62.26.el5_5rhev2_2 kernel-2.6.18-194.26.1.el5 kvm-83-164.el5_5.23
What's non responsive state means for qemu? No monitor? Something else?
yes, Haim means that qmp is blocking forever.
qmp on rhel5? BTW, what do customer expect us to do? Migration code flush memory state to block devices (in this case NFS), and then do migration. From what I see, there are only two solutions: - stop the source of migration with one error (it can't continue anyways, no disk) - migrate the "pending" state to the destination hosts, and do that there no code exist today for doing that, and as far as I know, there is no plan to do that. Haim, what do you expect to happen?
I suspect that it's hanging in qemu_aio_wait while it's waiting for all running AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm this by attaching a gdb to the hanging instance and getting a backtrace? If so, and I interpret timeout 7000 correctly as 700s, this may take more than ten minutes. No wonder that migration times out. If this is the reason, there's nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout.
(In reply to comment #5) > I suspect that it's hanging in qemu_aio_wait while it's waiting for all running > AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm > this by attaching a gdb to the hanging instance and getting a backtrace? > > If so, and I interpret timeout 7000 correctly as 700s, this may take more than > ten minutes. No wonder that migration times out. If this is the reason, there's > nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout. Why not define an internal timeout and send an interrupt to cancel IO and continue with migration? Normally people use hard mounts in which case this would never return (and we are planning on moving to hard mounts ourselves...)