Bug 1274315

Summary: HE-VM can't be migrated between the hosts and host is stuck in "Preparing For Maintenance".
Product: [oVirt] ovirt-engine Reporter: Nikolai Sednev <nsednev>
Component: BLL.VirtAssignee: Roy Golan <rgolan>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0.1CC: bugs, cshao, dfediuck, gklein, huiwa, huzhao, mavital, mgoldboi, nsednev, sbonazzo, tjelinek, ycui
Target Milestone: ovirt-3.6.2Keywords: Triaged
Target Release: 3.6.2Flags: rule-engine: ovirt-3.6.z+
rule-engine: blocker+
mgoldboi: planning_ack+
dfediuck: devel_ack+
rule-engine: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The import of the engine VM was constantly failing and that was colliding with the attempt update the migration progress of that VM. Consequence: the engine VM would appear as still be running on the host and the host couldn't be moved into maintenance. Fix: The import of the engine vm is stable now and shouldn't fail and that will make the migration proceed. Result: The engine VM will handover to the destination host, and the source host will be free and then it will be able to move to maintenance.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 11:00:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1269768    
Bug Blocks:    
Attachments:
Description Flags
logs from the engine and the host from which migration was started
none
Screenshot from 2015-10-26 14:40:44.png
none
engine's logs
none
alma03 logs (source)
none
alma04 logs (destination)
none
engine's log
none
alma03
none
alma04 none

Description Nikolai Sednev 2015-10-22 13:20:56 UTC
Description of problem:
I have two hosts within my HE environment and I tried to migrate all VMs including the HE-VM from one host to another, by setting the host in to maintenance. I saw that all guest VMs were migrated successfully, except HE-VM, which was shown for some time as being migrated, bu then it failed to migrate and hosts remained in "Preparing For Maintenance" state. 
I saw also error message received by the engine:
"VDSM <host's FQDN> command failed: Virtual machine does not exist"

Version-Release number of selected component (if applicable):
Host:
ovirt-hosted-engine-setup-1.3.1-0.0.master.20151020145724.git565c3f9.el7.centos.noarch
ovirt-hosted-engine-ha-1.3.1-1.20151016090950.git5ea5093.el7.noarch                    
mom-0.5.1-2.el7.noarch
vdsm-4.17.9-15.git1a7d1d3.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-12.el7.x86_64
sanlock-3.2.4-1.el7.x86_64

Engine:
ovirt-vmconsole-1.0.0-0.0.6.master.el6ev.noarch
rhevm-3.6.0.1-0.1.el6.noarch
rhevm-guest-agent-common-1.0.11-2.el6ev.noarch
qemu-guest-agent-0.12.1.2-2.479.el6_7.1.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Deploy RHEVM3.6 (16) on 2 RHEL7.2 hosts over iSCSI.
2.Create 2 VMs and start them all on the same host.
3.Set host that hosts 3 VMs to maintenance via WEBUI.

Actual results:
All guest VMs except HE-VM migrated to second host and host itself is stuck in "Preparing For Maintenance".

Expected results:
HE-VM should be migrated to second host.

Additional info:

Comment 1 Nikolai Sednev 2015-10-22 13:32:20 UTC
Created attachment 1085520 [details]
logs from the engine and the host from which migration was started

Comment 2 Oved Ourfali 2015-10-22 14:43:45 UTC
Host will remain in preparing for maintenance until all VMs are migrated. 
Putting virt (as you haven't put any whiteboard), but sounds like notabug, unless the migration is problematic for some reason.

Comment 3 Michal Skrivanek 2015-10-23 17:56:40 UTC
vdsm log doesn't cover the right period. Please add correct logs for both source and destination (time seem to be ~2015-10-22 14:18:00)

Comment 4 Michal Skrivanek 2015-10-23 17:57:30 UTC
I don't see what's "urgent" here

Comment 5 Nikolai Sednev 2015-10-26 12:41:34 UTC
Created attachment 1086500 [details]
Screenshot from 2015-10-26 14:40:44.png

Comment 6 Nikolai Sednev 2015-10-26 12:56:24 UTC
(In reply to Michal Skrivanek from comment #4)
> I don't see what's "urgent" here

If you can't migrate HE, then the whole idea of backing up the engine is irrelevant, hence it's urgent IMHO.

Please see the attachments.

Comment 7 Nikolai Sednev 2015-10-26 12:57:00 UTC
Created attachment 1086514 [details]
engine's logs

Comment 8 Nikolai Sednev 2015-10-26 12:57:47 UTC
Created attachment 1086515 [details]
alma03 logs (source)

Comment 9 Nikolai Sednev 2015-10-26 12:58:30 UTC
Created attachment 1086516 [details]
alma04 logs (destination)

Comment 10 Michal Skrivanek 2015-11-04 08:25:07 UTC
Nikolai,
- your new logs do not cover the original interval, they are 4 days later
- logs include one migration of HE at 2015-10-26 15:34:43 which took about 2 minutes and ended successfully at 15:36:53
- the attached picture in comment #5 shows a successful move to maintenance
- engine.log from comment #7 ends at 2015-10-26 14:51:27 showing perhaps some previous attempts/states?


please describe what is the problem, what did you observe, time frame of the issue and entities involved and relevant logs

Comment 11 Nikolai Sednev 2015-11-04 12:33:25 UTC
I(In reply to Michal Skrivanek from comment #10)
> Nikolai,
> - your new logs do not cover the original interval, they are 4 days later
> - logs include one migration of HE at 2015-10-26 15:34:43 which took about 2
> minutes and ended successfully at 15:36:53
> - the attached picture in comment #5 shows a successful move to maintenance
> - engine.log from comment #7 ends at 2015-10-26 14:51:27 showing perhaps
> some previous attempts/states?
> 
> 
> please describe what is the problem, what did you observe, time frame of the
> issue and entities involved and relevant logs

Problem is that engine itself not being able to get migrated between two hosts, all the VMs are mirated no problem , except the HE-VM.

Comment 12 Michal Skrivanek 2015-11-04 13:18:52 UTC
Sorry, but I can't do anything with the bug without getting answers/logs.

Comment 13 Sandro Bonazzola 2015-11-05 08:16:18 UTC
oVirt 3.6.0 has been released on November 4th, re-targeting to 3.6.1 since this bug has been marked as high severity

Comment 14 Nikolai Sednev 2015-11-05 09:27:05 UTC
(In reply to Michal Skrivanek from comment #12)
> Sorry, but I can't do anything with the bug without getting answers/logs.

Hi Michal,
Providing the full data as I could gather it.
1.Time frame is: "	
Nov 5, 2015 11:07:06 AM
	
Host alma03.qa.lab.tlv.redhat.com was switched to Maintenance mode by admin@internal (Reason: Not Specified)."

2.I'm migrating 3 Guest-VMs from alma03->alma04, result - PASS.
3.I'm trying to set the alma03 in to the maintenance via WEBUI and HE-VM is running on top of alma03, so during the maintenance HE-VM should be migrated to alma04, result - alma03 stuck in preparing to maintenance, as HE-VM not migrated to alma04. Result - FAIL.
4.Sosreports from the engine, alma03(source), alma04(target) attached.
 
Engine components:
rhevm-3.6.0.2-0.1.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.0.2-0.1.el6.noarch
ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.0.2-0.1.el6.noarch
ovirt-host-deploy-1.4.0-1.el6ev.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch
ovirt-engine-extension-aaa-jdbc-1.0.1-1.el6ev.noarch
ovirt-host-deploy-java-1.4.0-1.el6ev.noarch

Host's components on alma03:
ovirt-vmconsole-1.0.0-0.0.6.master.el7ev.noarch
ovirt-release36-001-0.5.beta.noarch
mom-0.5.1-2.el7.noarch
ovirt-hosted-engine-setup-1.3.1-0.0.master.20151020145724.git565c3f9.el7.centos.noarch
ovirt-setup-lib-1.0.0-1.20150922141000.git147e275.el7.centos.noarch
ovirt-host-deploy-1.5.0-0.0.master.20151015221110.gitc2abfed.el7.noarch
ovirt-release36-snapshot-001-0.5.beta.noarch
vdsm-4.17.10-13.gite438b03.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-12.el7.x86_64
ovirt-hosted-engine-ha-1.3.1-1.20151016090950.git5ea5093.el7.noarch
sanlock-3.2.4-1.el7.x86_64
ovirt-engine-sdk-python-3.6.0.4-0.1.20151014.git117764a.el7.centos.noarch
ovirt-vmconsole-host-1.0.0-0.0.6.master.el7ev.noarch

Host's components on alma04:
mom-0.5.1-2.el7.noarch
ovirt-hosted-engine-setup-1.3.1-0.0.master.20151020145724.git565c3f9.el7.centos.noarch
ovirt-setup-lib-1.0.0-1.20150922141000.git147e275.el7.centos.noarch
ovirt-vmconsole-1.0.0-0.0.6.master.el7ev.noarch
ovirt-release36-snapshot-001-0.5.beta.noarch
ovirt-host-deploy-1.5.0-0.0.master.20151015221110.gitc2abfed.el7.noarch
ovirt-release36-001-0.5.beta.noarch
ovirt-engine-sdk-python-3.6.0.4-0.1.20151014.git117764a.el7.centos.noarch
vdsm-4.17.10.1-0.el7ev.noarch
ovirt-hosted-engine-ha-1.3.1-1.20151016090950.git5ea5093.el7.noarch
sanlock-3.2.4-1.el7.x86_64
ovirt-vmconsole-host-1.0.0-0.0.6.master.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-12.el7.x86_64

Comment 15 Nikolai Sednev 2015-11-05 11:01:28 UTC
Created attachment 1090019 [details]
engine's log

Comment 16 Nikolai Sednev 2015-11-05 11:08:05 UTC
Created attachment 1090021 [details]
alma03

Comment 17 Nikolai Sednev 2015-11-05 11:17:35 UTC
Created attachment 1090023 [details]
alma04

Comment 18 Tomas Jelinek 2015-11-05 13:09:38 UTC
Looking into logs I see that:
- source VDSM: on line 1311 the migration of the HE VM is initiated
- source VDSM: on line 11643 the migration finished successfully
- engine: in engine log line 605 the migration succeeded is received
- engne: line 630 the ERROR [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-7-thread-22) [18437bc6] Transaction rolled-back for command 'org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand'.

Moving to SLA for more investigation.

May be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1269768

Comment 19 Roy Golan 2015-12-03 20:58:54 UTC
This should be solved by Bug 1269768  (auto import of HE VM and Storage domain)

The VMs Monitoring is intercepting the engine VM and tries to import it but fails. It will skip probably the part the does the migration hand-over (where we set the destination host of the vm as the one this vm is running on)

Comment 20 Nikolai Sednev 2016-01-21 12:18:11 UTC
Successfully got HE-SD&HE-VM auto-imported on cleanly installed NFS deployment after NFS data SD was added. Engine was installed using PXE. Then I've set via WEBUI the host with VMs running on in to maintenance mode and all VMs were migrated to second host.
Works for me on these components:
Host:
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
mom-0.5.1-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.6.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.2.x86_64
ovirt-setup-lib-1.0.1-1.el7ev.noarch
vdsm-4.17.18-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.3-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
Linux version 3.10.0-327.8.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jan 11 05:03:18 EST 2016

Engine:
ovirt-vmconsole-1.0.0-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-setup-lib-1.0.1-1.el6ev.noarch
ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch
rhevm-3.6.2.6-0.1.el6.noarch
rhevm-dwh-setup-3.6.2-1.el6ev.noarch
rhevm-dwh-3.6.2-1.el6ev.noarch
rhevm-reports-setup-3.6.2.4-1.el6ev.noarch
rhevm-reports-3.6.2.4-1.el6ev.noarch
rhevm-guest-agent-common-1.0.11-2.el6ev.noarch
Linux version 2.6.32-573.8.1.el6.x86_64 
(mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015