Bug 676395

Summary: Migration of VMs fails and those VMs became paused
Product: Red Hat Enterprise Linux 6 Reporter: Rami Vaknin <rvaknin>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED CURRENTRELEASE QA Contact: Tomas Dosek <tdosek>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, danken, dnaori, iheim, ilvovsky, mgoldboi, tdosek, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.9-51 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
vdsm-4.9-47 libvirt-0.8.7-5.el6.x86_64 qemu-kvm-0.12.1.2-2.129.el6.x86_64
Last Closed: 2011-08-19 15:14:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
VDSM Logs none

Description Rami Vaknin 2011-02-09 17:51:07 UTC
Created attachment 477874 [details]
VDSM Logs

Scenario:
1. Define a cluster of two RHEL6 hosts with few running VMs (both HA, non-HA and Pinned VMs)
2. Mark few VMs and perform migration

Few of the VMs are becoming paused and their migration fails.

It was reproduced 4 times up until now, 3 of them were on the same host and the 4th time was on other host, both hosts were defined as HA.

Here are vdsm log snippets from two reproductions with different calls to the same place:

Thread-9451::INFO::2011-02-09 18:15:33,842::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: teardownVolume, args: ( sdUUID=8431f208-1bf3-4f1f-a05e-7def6d6d0239 spUUID=ade9f73e-272a-42bb-8b8c-bb3288e0637e imgUUID=d4bd49bf-1b88-4f7a-9f9f-1845db0b243b volUUID=4802a4d9-7ab8-4d0a-b014-034ed86f502e rw=True)
Thread-9451::DEBUG::2011-02-09 18:15:33,853::task::491::TaskManager.Task::(_debug) Task f7e1bd18-bffb-4511-bd7b-d9abf938ace5: moving from state init -> state preparing
Thread-9451::INFO::2011-02-09 18:15:33,855::sp::1378::Storage.StoragePool::(getDomains) Get storage pool domains: {'8431f208-1bf3-4f1f-a05e-7def6d6d0239': 'Active', '2e46da5a-2ad6-4672-a297-02b6d3713274': 'Active'}
Thread-9451::INFO::2011-02-09 18:15:33,861::blockSD::596::Storage.StorageDomain::(validate) sdUUID=8431f208-1bf3-4f1f-a05e-7def6d6d0239
Thread-9457::DEBUG::2011-02-09 18:15:33,863::guestIF::78::root::(_connect) Attempting connection to /var/lib/libvirt/qemu/channels/host6_rhel6.com.redhat.rhevm.vdsm
Thread-9457::DEBUG::2011-02-09 18:15:33,865::utils::570::Storage.Misc.excCmd::(execCmd) '/usr/bin/sudo -n /usr/share/vdsm/prepare-vmchannel /var/lib/libvirt/qemu/channels/host6_rhel6.com.redhat.rhevm.vdsm' (cwd None)
Thread-9438::DEBUG::2011-02-09 18:15:33,865::utils::570::Storage.Misc.excCmd::(execCmd) '/usr/bin/sudo -n /usr/share/vdsm/get-vm-pid host6_rhel6' (cwd None)
Thread-9451::DEBUG::2011-02-09 18:15:33,867::supervdsm::30::SuperVdsmProxy::(__connect) Trying to conenct to Super Vdsm
libvirtEventLoop::ERROR::2011-02-09 18:15:33,883::libvirtconnection::47::vds::(__eventCallback) Traceback (most recent call last):
  File "/usr/share/vdsm/libvirtconnection.py", line 26, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 1020, in _onLibvirtLifecycleEvent
    self._onQemuDeath()
  File "/usr/share/vdsm/vm.py", line 999, in _onQemuDeath
    "Lost connection with kvm process")
  File "/usr/share/vdsm/vm.py", line 1766, in setDownStatus
    self.saveState()
  File "/usr/share/vdsm/libvirtvm.py", line 905, in saveState
    vm.Vm.saveState(self)
  File "/usr/share/vdsm/vm.py", line 1227, in saveState
    os.rename(tmpFile, self._recoveryFile)
OSError: [Errno 2] No such file or directory





Thread-9451::DEBUG::2011-02-09 18:15:33,835::utils::570::Storage.Misc.excCmd::(execCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-9451::DEBUG::2011-02-09 18:15:33,840::libvirtvm::944::vds.vmlog.2c1cbd82-c369-4fad-92bc-132bd40e21b5::(destroy) Total desktops after destroy of 2c1cbd82-c369-4fad-92bc-132bd40e21b5 is 10
Thread-9451::INFO::2011-02-09 18:15:33,842::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: teardownVolume, args: ( sdUUID=8431f208-1bf3-4f1f-a05e-7def6d6d0239 spUUID=ade9f73e-272a-42bb-8b8c-bb3288e0637e imgUUID=d4bd49
bf-1b88-4f7a-9f9f-1845db0b243b volUUID=4802a4d9-7ab8-4d0a-b014-034ed86f502e rw=True)
Thread-9451::DEBUG::2011-02-09 18:15:33,853::task::491::TaskManager.Task::(_debug) Task f7e1bd18-bffb-4511-bd7b-d9abf938ace5: moving from state init -> state preparing
Thread-9451::INFO::2011-02-09 18:15:33,855::sp::1378::Storage.StoragePool::(getDomains) Get storage pool domains: {'8431f208-1bf3-4f1f-a05e-7def6d6d0239': 'Active', '2e46da5a-2ad6-4672-a297-02b6d3713274': 'Active'}
Thread-9451::INFO::2011-02-09 18:15:33,861::blockSD::596::Storage.StorageDomain::(validate) sdUUID=8431f208-1bf3-4f1f-a05e-7def6d6d0239
Thread-9457::DEBUG::2011-02-09 18:15:33,863::guestIF::78::root::(_connect) Attempting connection to /var/lib/libvirt/qemu/channels/host6_rhel6.com.redhat.rhevm.vdsm
Thread-9457::DEBUG::2011-02-09 18:15:33,865::utils::570::Storage.Misc.excCmd::(execCmd) '/usr/bin/sudo -n /usr/share/vdsm/prepare-vmchannel /var/lib/libvirt/qemu/channels/host6_rhel6.com.redhat.rhevm.vdsm' (cwd None)
Thread-9438::DEBUG::2011-02-09 18:15:33,865::utils::570::Storage.Misc.excCmd::(execCmd) '/usr/bin/sudo -n /usr/share/vdsm/get-vm-pid host6_rhel6' (cwd None)
Thread-9451::DEBUG::2011-02-09 18:15:33,867::supervdsm::30::SuperVdsmProxy::(__connect) Trying to conenct to Super Vdsm
libvirtEventLoop::ERROR::2011-02-09 18:15:33,883::libvirtconnection::47::vds::(__eventCallback) Traceback (most recent call last):
  File "/usr/share/vdsm/libvirtconnection.py", line 26, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 1020, in _onLibvirtLifecycleEvent
    self._onQemuDeath()
  File "/usr/share/vdsm/vm.py", line 999, in _onQemuDeath
    "Lost connection with kvm process")
  File "/usr/share/vdsm/vm.py", line 1766, in setDownStatus
    self.saveState()
  File "/usr/share/vdsm/libvirtvm.py", line 905, in saveState
    vm.Vm.saveState(self)
  File "/usr/share/vdsm/vm.py", line 1227, in saveState
    os.rename(tmpFile, self._recoveryFile)
OSError: [Errno 2] No such file or directory

Comment 2 David Naori 2011-02-17 15:29:01 UTC
Verified on vdsm-4.9-49

Comment 3 David Naori 2011-02-24 12:24:03 UTC
Reproduced on vdsm-4.9-50

Worked on the issue with danken and he figured out the following action fix the issue:

    def saveState (self):
        if self.destroyed:
            return
        from copy import deepcopy
        toSave = deepcopy(self.status())
        toSave['startTime'] = self._startTime
        if self.lastStatus != 'Down' and self._vmStats and self.guestAgent:
            toSave['username'] = self.guestAgent.guestInfo['username']
            toSave['guestIPs'] = self.guestAgent.guestInfo['guestIPs']
        else:
            toSave['username'] = ""
            toSave['guestIPs'] = ""
        if 'sysprepInf' in toSave:
            del toSave['sysprepInf']
            if 'floppy' in toSave: del toSave['floppy']
        for drive in toSave.get('drives', []):
            for d in self._drives:
                if drive.get('volumeID') == d.volumeID:
                    drive['truesize'] = str(d.truesize)
                    drive['apparentsize'] = str(d.apparentsize)

        import tempfile
        with tempfile.NamedTemporaryFile(dir=constants.P_VDSM_RUN, delete=False) as f:
             pickle.dump(toSave, f)
        try:
            os.rename(f.name, self._recoveryFile)
        except:
            self.log.error("DDDD %s:%s %s:%s", f.name, os.access(f.name, 0), self._recoveryFile, os.access(self._recoveryFile, 0), exc_info=True)
            raise

Comment 4 Tomas Dosek 2011-03-02 09:57:48 UTC
vdsm-4.9-51 - migration behaves pretty strange - in case migration fails vms stay on source host and are up (that's ok I assume), but when trying to migrate more than 4 vms to another host (migration policies are off) all migrations fail (the second host is loadless). Log:

2011-03-02 10:36:20,380 ERROR [com.redhat.rhevm.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-97) Rerun vm 13f9f894-5502-4872-86ab-075382dd50a3. Called from vds 10.34.57.206

2011-03-02 10:36:20,435 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-40) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vcpu count failed, its already 0 or null

2011-03-02 10:36:20,435 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-40) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vmem size failed, its already 0 or null

2011-03-02 10:36:20,435 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-39) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vcpu count failed, its already 0 or null

2011-03-02 10:36:20,435 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-39) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vmem size failed, its already 0 or null

2011-03-02 10:36:20,437 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-41) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vcpu count failed, its already 0 or null

2011-03-02 10:36:20,437 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-41) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vmem size failed, its already 0 or null

2011-03-02 10:36:20,569 INFO  [com.redhat.rhevm.bll.CommandBase] (pool-11-thread-39) CanDoAction of action MigrateVmToServer failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,VDS_INVALID_SERVER_ID

2011-03-02 10:36:20,578 INFO  [com.redhat.rhevm.bll.CommandBase] (pool-11-thread-40) CanDoAction of action MigrateVmToServer failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,VDS_INVALID_SERVER_ID

2011-03-02 10:36:20,582 INFO  [com.redhat.rhevm.bll.CommandBase] (pool-11-thread-41) CanDoAction of action MigrateVmToServer failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,VDS_INVALID_SERVER_ID

2011-03-02 10:36:22,572 ERROR [com.redhat.rhevm.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-79) Rerun vm 7f79b379-306e-4f2c-8413-c7f34ea5f2a4. Called from vds 10.34.57.206

2011-03-02 10:36:22,604 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-40) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vcpu count failed, its already 0 or null

2011-03-02 10:36:22,604 INFO  [com.redhat.rhevm.bll.RunVmCommandBase] (pool-11-thread-40) DecreasePendingVms::Decreasing vds 10.34.57.207 pending vmem size failed, its already 0 or null

2011-03-02 10:36:22,654 INFO  [com.redhat.rhevm.bll.CommandBase] (pool-11-thread-40) CanDoAction of action MigrateVmToServer failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,VDS_INVALID_SERVER_ID

2011-03-02 10:38:27,428 INFO  [com.redhat.rhevm.bll.DbUserCacheManager] (QuartzScheduler_Worker-85) DbUserCacheManager::refreshAllUserData() - entered

2011-03-02 10:38:27,436 INFO  [com.redhat.rhevm.bll.adbroker.AdBrokerCommandBase] (QuartzScheduler_Worker-85) Running AD command: com.redhat.rhevm.bll.adbroker.ADGetAdUserByUserIdListCommand

2011-03-02 10:38:27,438 INFO  [com.redhat.rhevm.bll.adbroker.AdBrokerCommandBase] (QuartzScheduler_Worker-85) Running AD command: com.redhat.rhevm.bll.adbroker.ADSearchUserByQueryCommand

2011-03-02 10:41:22,069 INFO  [com.redhat.rhevm.vdsbroker.VdsManager] (QuartzScheduler_Worker-33) OnVdsDuringFailureTimer of vds 10.34.57.207 entered. Time:3/2/11 10:41 AM. Attemts after2

Comment 5 David Naori 2011-03-06 14:09:01 UTC
Verified on -vdsm-4.9-51

Comment 6 David Naori 2011-03-06 14:13:58 UTC
Excuse me Tomas didnt see your comment, moving back to on qa till danken will replay.

Comment 7 Tomas Dosek 2011-03-07 08:21:40 UTC
Verified - vdsm-4.9-51 - above specified scenario doesn't reproduce anymore (it turned out that the trouble verifying this bug was in our env.)