Bug 654023 - [kvm] [DW] guest moves to non-responding on failed migration when source fails to communicate storage server
[kvm] [DW] guest moves to non-responding on failed migration when source fail...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
5.5.z
Unspecified Unspecified
low Severity high
: rc
: ---
Assigned To: Juan Quintela
Virtualization Bugs
:
Depends On:
Blocks: 665820 Rhel5KvmTier2 653898 711374
  Show dependency treegraph
 
Reported: 2010-11-16 11:33 EST by Haim
Modified: 2014-01-12 19:47 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 665820 (view as bug list)
Environment:
Last Closed: 2010-11-28 05:13:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Haim 2010-11-16 11:33:33 EST
Description of problem:

guest machine running on host that fails to communicate with NFS storage server, when mount NFS timeout is set to 7000, and migration to 30, trying to migrate host to valid host leads to state where migration fails, and guest moves to non-operational. 


scenario: 

starting with 2 hosts, one host acts as SPM, seconds host runs guest machines. 
block communication between host to storage server (on storage side), migrate guest to valid host, migration fails, and guest moves to non-responding. 

repro steps: 

1) make sure to work with 2 hosts 
2) make sure to work with NFS storage 
3) make sure NFS timeout (mount options on vdsm) is set to 7000 
4) make sure migration timeout is set to 30
5) make sure to run guest machine on non-spm machine ('host B')
6) make sure to block communication between host B to storage 
7) make sure to migrate guest to host A (has valid connection to storage). 

note that migration fails on destination on timeout issue:

Thread-5331::ERROR::2010-11-16 18:24:04,269::vm::1824::vds.vmlog.b53faba6-83e7-4f21-ac92-58ee62fa2546::Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 1816, in run
    self._waitForOutgoingMigration()
  File "/usr/share/vdsm/vm.py", line 1729, in _waitForOutgoingMigration
    raise _MigrationError('Migration timeout exceeded at source')
_MigrationError: Migration timeout exceeded at source
Comment 1 Haim 2010-11-16 11:34:56 EST
vdsm22-4.5-62.26.el5_5rhev2_2
kernel-2.6.18-194.26.1.el5
kvm-83-164.el5_5.23
Comment 2 Dor Laor 2010-11-21 16:04:38 EST
What's non responsive state means for qemu? No monitor? Something else?
Comment 3 Dan Kenigsberg 2010-11-22 04:54:59 EST
yes, Haim means that qmp is blocking forever.
Comment 4 Juan Quintela 2010-11-24 09:16:21 EST
qmp on rhel5?

BTW, what do customer expect us to do?
Migration code flush memory state to block devices (in this case NFS), and then do migration.  From what I see, there are only two solutions:
- stop the source of migration with one error
   (it can't continue anyways, no disk)
- migrate the "pending" state to the destination hosts, and do that there
  no code exist today for doing that, and as far as I know, there is no plan
  to do that.

Haim, what do you expect to happen?
Comment 5 Kevin Wolf 2010-11-24 09:21:32 EST
I suspect that it's hanging in qemu_aio_wait while it's waiting for all running AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm this by attaching a gdb to the hanging instance and getting a backtrace?

If so, and I interpret timeout 7000 correctly as 700s, this may take more than ten minutes. No wonder that migration times out. If this is the reason, there's nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout.
Comment 6 Ayal Baron 2010-11-24 15:19:25 EST
(In reply to comment #5)
> I suspect that it's hanging in qemu_aio_wait while it's waiting for all running
> AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm
> this by attaching a gdb to the hanging instance and getting a backtrace?
> 
> If so, and I interpret timeout 7000 correctly as 700s, this may take more than
> ten minutes. No wonder that migration times out. If this is the reason, there's
> nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout.

Why not define an internal timeout and send an interrupt to cancel IO and continue with migration?
Normally people use hard mounts in which case this would never return (and we are planning on moving to hard mounts ourselves...)

Note You need to log in before you can comment on or make changes to this bug.