Bug 654023

Summary: [kvm] [DW] guest moves to non-responding on failed migration when source fails to communicate storage server
Product: Red Hat Enterprise Linux 5 Reporter: Haim <hateya>
Component: kvmAssignee: Juan Quintela <quintela>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: low    
Version: 5.5.zCC: abaron, bazulay, danken, hateya, iheim, mgoldboi, mkenneth, tburke, virt-maint, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 665820 (view as bug list) Environment:
Last Closed: 2010-11-28 10:13:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580948, 653898, 665820, 711374    

Description Haim 2010-11-16 16:33:33 UTC
Description of problem:

guest machine running on host that fails to communicate with NFS storage server, when mount NFS timeout is set to 7000, and migration to 30, trying to migrate host to valid host leads to state where migration fails, and guest moves to non-operational. 


scenario: 

starting with 2 hosts, one host acts as SPM, seconds host runs guest machines. 
block communication between host to storage server (on storage side), migrate guest to valid host, migration fails, and guest moves to non-responding. 

repro steps: 

1) make sure to work with 2 hosts 
2) make sure to work with NFS storage 
3) make sure NFS timeout (mount options on vdsm) is set to 7000 
4) make sure migration timeout is set to 30
5) make sure to run guest machine on non-spm machine ('host B')
6) make sure to block communication between host B to storage 
7) make sure to migrate guest to host A (has valid connection to storage). 

note that migration fails on destination on timeout issue:

Thread-5331::ERROR::2010-11-16 18:24:04,269::vm::1824::vds.vmlog.b53faba6-83e7-4f21-ac92-58ee62fa2546::Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 1816, in run
    self._waitForOutgoingMigration()
  File "/usr/share/vdsm/vm.py", line 1729, in _waitForOutgoingMigration
    raise _MigrationError('Migration timeout exceeded at source')
_MigrationError: Migration timeout exceeded at source

Comment 1 Haim 2010-11-16 16:34:56 UTC
vdsm22-4.5-62.26.el5_5rhev2_2
kernel-2.6.18-194.26.1.el5
kvm-83-164.el5_5.23

Comment 2 Dor Laor 2010-11-21 21:04:38 UTC
What's non responsive state means for qemu? No monitor? Something else?

Comment 3 Dan Kenigsberg 2010-11-22 09:54:59 UTC
yes, Haim means that qmp is blocking forever.

Comment 4 Juan Quintela 2010-11-24 14:16:21 UTC
qmp on rhel5?

BTW, what do customer expect us to do?
Migration code flush memory state to block devices (in this case NFS), and then do migration.  From what I see, there are only two solutions:
- stop the source of migration with one error
   (it can't continue anyways, no disk)
- migrate the "pending" state to the destination hosts, and do that there
  no code exist today for doing that, and as far as I know, there is no plan
  to do that.

Haim, what do you expect to happen?

Comment 5 Kevin Wolf 2010-11-24 14:21:32 UTC
I suspect that it's hanging in qemu_aio_wait while it's waiting for all running AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm this by attaching a gdb to the hanging instance and getting a backtrace?

If so, and I interpret timeout 7000 correctly as 700s, this may take more than ten minutes. No wonder that migration times out. If this is the reason, there's nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout.

Comment 6 Ayal Baron 2010-11-24 20:19:25 UTC
(In reply to comment #5)
> I suspect that it's hanging in qemu_aio_wait while it's waiting for all running
> AIO requests to complete (i.e. to fail after the NFS timeout). Can you confirm
> this by attaching a gdb to the hanging instance and getting a backtrace?
> 
> If so, and I interpret timeout 7000 correctly as 700s, this may take more than
> ten minutes. No wonder that migration times out. If this is the reason, there's
> nothing we can do in qemu about it. You'd need to choose a smaller NFS timeout.

Why not define an internal timeout and send an interrupt to cancel IO and continue with migration?
Normally people use hard mounts in which case this would never return (and we are planning on moving to hard mounts ourselves...)