1570598 – The HE-VM didn't migrate to the additional host while putting this host to local maintenance via cockpit

Bug 1570598 - The HE-VM didn't migrate to the additional host while putting this host to local maintenance via cockpit

Summary: The HE-VM didn't migrate to the additional host while putting this host to lo...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.2.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.3
Target Release:	---
Assignee:	Ryan Barry
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-23 10:11 UTC by Yihui Zhao
Modified:	2018-04-26 10:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-26 10:16:27 UTC
oVirt Team:	Integration
Embargoed:
Flags:	rule-engine: ovirt-4.2? cshao: testing_ack?

Attachments	(Terms of Use)
local_maintenance_first_host (61.00 KB, image/png) 2018-04-23 10:11 UTC, Yihui Zhao	no flags	Details
local_maintenance_terminal (140.77 KB, image/png) 2018-04-23 10:14 UTC, Yihui Zhao	no flags	Details
from_engine (64.54 KB, image/png) 2018-04-23 10:15 UTC, Yihui Zhao	no flags	Details
after_manual_migrate (62.90 KB, image/png) 2018-04-23 10:15 UTC, Yihui Zhao	no flags	Details
logs (924.17 KB, application/x-bzip) 2018-04-23 10:21 UTC, Yihui Zhao	no flags	Details
vdsm_log (13.30 MB, text/plain) 2018-04-24 09:53 UTC, Yihui Zhao	no flags	Details
View All

Description Yihui Zhao 2018-04-23 10:11:48 UTC

Created attachment 1425630 [details]
local_maintenance_first_host

Description of problem:
The HE-VM didn't migrate to the additional host while putting this host to local maintenance.

Version-Release number of selected component (if applicable):
rhvh-4.2.2.1-0.20180420.0+1
cockpit-ovirt-dashboard-0.11.22-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.10-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.18-1.el7ev.noarch
rhvm-appliance-4.2-20180420.0.el7.noarch

How reproducible:
100%


Steps to Reproduce:
1. Install the latest RHVH4.1.11
2. Deploy HE on the first host via cockpit
3. Add the another host into the cluster
4. Put the first host into local maintenance
5. Check the HE-VM status on the engine and cockpit

Actual results:
After step5, The HE-VM didn't migrate to the additional host while putting this host to local maintenance.

Expected results:
After step5, the HE-VM should migrate to the additional host, and it's status is up on the additional host

Additional info:

Comment 1 Yihui Zhao 2018-04-23 10:12:51 UTC

(In reply to Yihui Zhao from comment #0)
> Created attachment 1425630 [details]
> local_maintenance_first_host
> 
> Description of problem:
> The HE-VM didn't migrate to the additional host while putting this host to
> local maintenance.
> 
> Version-Release number of selected component (if applicable):
> rhvh-4.2.2.1-0.20180420.0+1
> cockpit-ovirt-dashboard-0.11.22-1.el7ev.noarch
> ovirt-hosted-engine-ha-2.2.10-1.el7ev.noarch
> ovirt-hosted-engine-setup-2.2.18-1.el7ev.noarch
> rhvm-appliance-4.2-20180420.0.el7.noarch
> 
> How reproducible:
> 100%
> 
> 
> Steps to Reproduce:
> 1. Install the latest RHVH4.1.11
Shoule be "install the latest RHVH4.2.2"
> 2. Deploy HE on the first host via cockpit
> 3. Add the another host into the cluster
> 4. Put the first host into local maintenance
> 5. Check the HE-VM status on the engine and cockpit
> 
> Actual results:
> After step5, The HE-VM didn't migrate to the additional host while putting
> this host to local maintenance.
> 
> Expected results:
> After step5, the HE-VM should migrate to the additional host, and it's
> status is up on the additional host
> 
> Additional info:

Comment 2 Yihui Zhao 2018-04-23 10:14:28 UTC

Created attachment 1425631 [details]
local_maintenance_terminal

Comment 3 Ryan Barry 2018-04-23 10:14:42 UTC

Please ensure is also in local maintenance on the host (though it should be, because we pull this from "hosted-engine --vm-status --json"

Comment 4 Yihui Zhao 2018-04-23 10:15:19 UTC

Created attachment 1425632 [details]
from_engine

Comment 5 Yihui Zhao 2018-04-23 10:15:55 UTC

Created attachment 1425634 [details]
after_manual_migrate

Comment 6 Yihui Zhao 2018-04-23 10:21:37 UTC

Created attachment 1425635 [details]
logs

Comment 7 Yihui Zhao 2018-04-23 10:23:07 UTC

(In reply to Ryan Barry from comment #3)
> Please ensure is also in local maintenance on the host (though it should be,
> because we pull this from "hosted-engine --vm-status --json"

See https://bugzilla.redhat.com/attachment.cgi?id=1425631

Comment 8 Martin Sivák 2018-04-24 07:45:59 UTC

I see an almost normal migration attempt in the log:

2018-04-23 13:26:00,498 Local maintenance detected
2018-04-23 13:26:00,526 EngineUp-LocalMaintenanceMigrateVm
2018-04-23 13:26:00,766 Score is 0 due to local maintenance mode
2018-04-23 13:26:00,832 The VM is running locally or we have no data, keeping the domain monitor.
2018-04-23 13:26:10,865 Continuing to monitor migration
...
2018-04-23 13:26:50,954 Global maintenance detected
2018-04-23 13:26:50,984 EngineMigratingAway-GlobalMaintenance
2018-04-23 13:26:51,231 Current state GlobalMaintenance (score: 3400)
...
2018-04-23 13:29:10,312 GlobalMaintenance-ReinitializeFSM
2018-04-23 13:29:20,576 ReinitializeFSM-EngineDown)
2018-04-23 13:29:30,848 Engine vm is running on host 10.73.73.106 (id 2)

Comment 9 Martin Sivák 2018-04-24 07:55:12 UTC

Now this is interesting:

2018-04-23 17:46:55,061 EngineUp-LocalMaintenanceMigrateVm
2018-04-23 17:46:55,309 LocalMaintenanceMigrateVm-ReinitializeFSM
2018-04-23 17:46:55,309 The VM is running locally or we have no data, keeping the domain monitor.
2018-04-23 17:47:05,321 Local maintenance detected
2018-04-23 17:47:05,340ReinitializeFSM-LocalMaintenance


This sequence generally means the migration failed. We do have logging there in 4.2, but we never backported the big change that contained the logging clenups to 4.1.

Can we get the VDSM log?

Comment 10 Yihui Zhao 2018-04-24 08:29:34 UTC

(In reply to Martin Sivák from comment #9)
> Now this is interesting:
> 
> 2018-04-23 17:46:55,061 EngineUp-LocalMaintenanceMigrateVm
> 2018-04-23 17:46:55,309 LocalMaintenanceMigrateVm-ReinitializeFSM
> 2018-04-23 17:46:55,309 The VM is running locally or we have no data,
> keeping the domain monitor.
> 2018-04-23 17:47:05,321 Local maintenance detected
> 2018-04-23 17:47:05,340ReinitializeFSM-LocalMaintenance
> 
> 
> This sequence generally means the migration failed. We do have logging there
> in 4.2, but we never backported the big change that contained the logging
> clenups to 4.1.
> 
> Can we get the VDSM log?

The VDSM log is also here: https://bugzilla.redhat.com/attachment.cgi?id=1425635

Comment 11 Martin Sivák 2018-04-24 09:33:23 UTC

Too bad I can't correlate the vdsm and hosted engine logs.. which host is the vdsm.log from?

Comment 12 Yihui Zhao 2018-04-24 09:53:46 UTC

Created attachment 1425891 [details]
vdsm_log

Comment 13 Yihui Zhao 2018-04-26 10:16:27 UTC

Update:

Tested with these versions, it works for me.

cockpit-ovirt-dashboard-0.11.23-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch
rhvm-appliance-4.2-20180420.0.el7.noarch

Tested steps:
1. put one host into local maintenance(first host is deploying HE, and the second host is the additional host)


#1. put first host into local maintenance:
The VM migration is ok, find the completed message in first host vdsm.log:
"""
2018-04-26 16:04:06,215+0800 INFO  (migmon/d14af27b) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') Migration Progress: 80 seconds elapsed, 99% of data processed, total data: 16444MB, processed data: 4039MB, r      emaining data: 70MB, transfer speed 52MBps, zero pages: 3336262MB, compressed: 0MB, dirty rate: 2697, memory iteration: 3 (migration:867)<br> 7744 2018-04-26 16:04:08,107+0800 INFO  (libvirt/events) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') CPU stopped: onSuspend (vm:6104)<br> 7745 2018-04-26 16:04:09,216+0800 INFO  (migsrc/d14af27b) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') migration took 83 seconds to complete (migration:514)<br> 7746 2018-04-26 16:04:09,216+0800 INFO  (migsrc/d14af27b) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') Changed state to Down: Migration succeeded (code=4) (vm:1683)<br><br>
"""


#2. remove the first host from maintenance, then put the second host into local maintenance
The VM migration is also OK. Also find the completed message in second host vdsm.log

"""
migsrc/d14af27b) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') migration took 120 seconds to complete (migration:514)<br>2018-04-26 18:04:57,986+0800 INFO  (migsrc/d14af27b) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') Changed state to Down: Migration succeeded (code=4) (vm:1683)<br>2018-04-26 18:04:58,037+0800 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573)<br>2018-04-26 18:04:58,041+0800 INFO  (jsonrpc/7) [api.virt] START getMigrationStatus() from=::1,47502, vmId=d14af27b-9859-4197-ac79-50ec9693bc1b (api:46)<br>2018-04-26 18:04:58,041+0800 INFO  (jsonrpc/7) [virt.vm] (vmId='d14af27b-9859-4197-ac79-50ec9693bc1b') new computed progress 98 < than old value 100, discarded (migration:200)<br>2018-04-26 18:04:58,041+0800 INFO  (jsonrpc/7) [api.virt] FINISH getMigrationStatus return={'status': {'message': 'Done', 'code': 0}, 'migrationStats': {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 100, 'downtime': 193L}} from=::1,47502, vmId=d14af27b-9859-4197-ac79-50ec9693bc1b (api:52)<br>2018-04-26 18:04:58,041+0800 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.getMigrationStatus succeeded in 0.00 seconds (__init__:573)<br><br>

"""

So, close it as working for me.

Note You need to log in before you can comment on or make changes to this bug.