Bug 629625

Summary: [vdsm] [libvirt] vdsm loses connection with 'kvm' process during concurrent migration
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED CURRENTRELEASE QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: low    
Version: 6.1CC: bazulay, hateya, iheim, mgoldboi, Rhev-m-bugs, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-21 14:17:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 659310    
Bug Blocks:    
Attachments:
Description Flags
vdsm log files of both pele and silver none

Description Haim 2010-09-02 14:40:42 UTC
Description of problem:

vdsm losses connection with kvm process during concurrent migration of at least 2 guests per host (must).
it seems like vms actually goes down in rhevm, but qemu process is live and active (also seen by virsh). 
there are some disturbing errors in log which might have caused this state: 

libvirtEventLoop::ERROR::2010-09-02 17:47:59,771::libvirtvm::933::vds::Traceback (most recent call last):
  File "/usr/share/vdsm/libvirtvm.py", line 912, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 878, in _onLibvirtLifecycleEvent
    hooks.after_vm_start(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'

libvirtEventLoop::ERROR::2010-09-02 17:41:33,692::libvirtvm::933::vds::Traceback (most recent call last):
  File "/usr/share/vdsm/libvirtvm.py", line 912, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 872, in _onLibvirtLifecycleEvent
    self._onQemuDeath()
  File "/usr/share/vdsm/vm.py", line 1006, in _onQemuDeath
    "Lost connection with kvm process")
  File "/usr/share/vdsm/vm.py", line 1753, in setDownStatus
    self.saveState()
  File "/usr/share/vdsm/libvirtvm.py", line 772, in saveState
    vm.Vm.saveState(self)
  File "/usr/share/vdsm/vm.py", line 1228, in saveState
    os.rename(tmpFile, self._recoveryFile)
OSError: [Errno 2] No such file or directory

this bug is consistent on latest vdsm version 4-9.14. 

repro steps: 

1) make sure you have 2 hosts 
2) make sure you have 4 running vms, 2 per host 
3) run concurrent migration so vms running on server X will be deported to server Y, and vms running on server Y will be deported to server X. 

open log.

Comment 2 Haim 2010-09-02 15:01:29 UTC
Created attachment 442648 [details]
vdsm log files of both pele and silver

Comment 3 Haim 2010-09-02 15:03:11 UTC
note that during migration 3 out of 4 vms died (from vdsmd perspective).

Comment 5 Barak 2010-11-21 16:17:18 UTC
Haim,

This probably happened due to one of the hooks issues post migration.
It was fixed long time ago.
Please try to reproduce or close.

Comment 6 Haim 2010-11-29 07:13:56 UTC
(In reply to comment #5)
> Haim,
> 
> This probably happened due to one of the hooks issues post migration.
> It was fixed long time ago.
> Please try to reproduce or close.

sorry Barak, due to libvirt bug that blocks migration (crash upon migration), i can't reproduce this bug, once i'll get a proper build with their build, i will try to reproduce.

Comment 8 Haim 2010-12-05 15:26:45 UTC
sorry - can't reproduce due to dead lock in libvirt on concurrent migration. set the bug dependencies accordingly.

Comment 9 Itamar Heim 2010-12-19 14:48:02 UTC
please try with latest libvirt again. thanks.

Comment 10 Haim 2010-12-21 13:30:09 UTC
(In reply to comment #9)
> please try with latest libvirt again. thanks.

no repro on latest libvirt, guess it was solved on the way, can either move to on_qa or closed as CURRENT_RELEASE.

Comment 11 Dan Kenigsberg 2010-12-21 14:17:32 UTC
why wait? closing after verification at comment 10.