Bug 1691244 - [scale] HE - unable to migrate HE vm by putting the host to maintenance
Summary: [scale] HE - unable to migrate HE vm by putting the host to maintenance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: ---
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.3.5
: ---
Assignee: Simone Tiraboschi
QA Contact: David Vaanunu
URL:
Whiteboard:
: 1691508 (view as bug list)
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2019-03-21 08:58 UTC by Ilan Zuckerman
Modified: 2020-02-25 09:20 UTC (History)
13 users (show)

Fixed In Version: ovirt-hosted-engine-setup-2.3.11
Doc Type: Bug Fix
Doc Text:
In the past when the host running the engine VM was set into maintenance mode from the engine, the engine VM was going to be migrated by ovirt-ha-agent as an indirect action caused by the maintenance mode. This process wasn't fully controlled by the engine and it could potentially lead to side effects on specific cases.
Clone Of:
Environment:
Last Closed: 2019-07-30 14:08:33 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.3?
pm-rhel: blocker?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
sbonazzo: testing_ack?


Attachments (Terms of Use)
Tasks tab (57.05 KB, image/png)
2019-03-21 08:58 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 99404 0 'None' MERGED Remove the support to start a migration 2021-01-25 10:44:42 UTC
oVirt gerrit 99462 0 'None' MERGED Avoid entering local mainteance mode if the engine VM is there 2021-01-25 10:44:42 UTC
oVirt gerrit 99523 0 'None' MERGED he: avoid skipping HE VM migration on maintenance 2021-01-25 10:44:42 UTC
oVirt gerrit 100299 0 'None' MERGED he: avoid skipping HE VM migration on maintenance 2021-01-25 10:44:40 UTC
oVirt gerrit 100300 0 'None' MERGED Avoid entering local mainteance mode if the engine VM is there 2021-01-25 10:44:41 UTC
oVirt gerrit 100301 0 'None' MERGED Remove the support to start a migration 2021-01-25 10:44:41 UTC

Description Ilan Zuckerman 2019-03-21 08:58:40 UTC
Created attachment 1546377 [details]
Tasks tab

Description of problem:

When putting the host on which the HE vm resides to maintenance, the VM fails to migrate to next available host.
the HE vm gets stuck in "migrating" state, while the host from which it tries to migrate from gets stuck in "PreparingForMaintenance" state.

At some point, when you look at the "Tasks" inside webAdmin UI, you see that the task of "Moving Host host_mixed_1 to Maintenance" gets duplicated over and over again (see image attached).
Engine log loops through same events over and over again:
https://paste.fedoraproject.org/paste/MmTxXa4V1zR3gLFOSji6yQ

Version-Release number of selected component (if applicable):

rpm -qa | grep rhv-release
rhv-release-4.3.2-1-001.noarch

Software Version:4.3.2-0.1.el7

Lab Topology using Hosted Engine:
9 DC and 9 Clusters
536 Hosts  of which 35 real servers and 501 nested hosts

11  Storage Domains composed of:
7 FC Domains
2 ISCSI Domains
2 NFS Domains (Export + Data)

4091 VMS are up

Hosted Engine running on dedicated host:  26vcpu with 32GB ram

How reproducible:
Always


Steps to Reproduce:
1. Assume you have two hosts which are capable of holding HE vm in the same cluster
2. Put the one host on which the HE vm resides to maintenance

Actual results:
The HE vm gets stuck forever in migrating status, while the host is stuck in "prepareForMaintenance" staate


Expected results:
The VM should migrate to next available host

Additional info:
Environment Credentials are provided in private message.

Comment 3 Ilan Zuckerman 2019-03-24 08:49:50 UTC
One more addition to this BZ:
When you trigger migration of the HE vm MANUALLY from the WebAdmin, then the migration works fine. Meaning that the HE vm is migrated successfully to the second available host.

Comment 4 Sandro Bonazzola 2019-03-25 13:39:56 UTC
Looks like virt team field. Ryan can you please have a look?

Comment 6 Michal Skrivanek 2019-03-26 08:28:19 UTC
Migration is invoked by HE broker (most likely) because of local maint being set - 
2019-03-21 07:15:29,281Z INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] VM 'a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21'(HostedEngine) moved from 'Up' --> 'MigratingFrom'

in vdsm:

2019-03-21 07:15:29,273+0000 INFO  (jsonrpc/2) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'b01-h03-r620.rhev.openstack.engineering.redhat.com', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'b01-h03-r620.rhev.openstack.engineering.redhat.com', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,34144, vmId=a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21 (api:48)


and then progresses indefinitely, e.g.:
2019-03-21 07:27:13,832+0000 WARN  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration stalling: remaining (12253MiB) > lowm
ark (357MiB). (migration:849)
2019-03-21 07:27:13,832+0000 INFO  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration Progress: 700.357 seconds elapsed, 63% of data processed, total data: 32864MB, processed data: 36313MB, remaining data: 12253MB, transfer speed 52MBps, zero pages: 263841MB, compressed: 0MB, dirty rate: 20630, memory iteration: 2 (migration:879)

Comment 7 Michal Skrivanek 2019-03-26 08:41:02 UTC
the last migration monitor update is
2019-03-21 09:10:52,176+0000 INFO  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration Progress: 2500.89 seconds elapsed, 99% of data processed, total data: 32864MB, processed data: 273298MB, remaining data: 558MB, transfer speed 52MBps, zero pages: 757443MB, compressed: 0MB, dirty rate: 14328, memory iteration: 19 (migration:879)

then it's aborted at
2019-03-21 09:11:02,184+0000 WARN  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Aborting migration (migration:914)

Comment 8 Michal Skrivanek 2019-03-26 09:36:12 UTC
the main difference between the manual attempt and he-broker invoked migration is that the cluster is using Legacy profile with 625Mbps bandwidth limit vs Suspend policy(with 5s downtime) capped at 52Mbps migrating indefinitely

It seems the migration is actually abort due to miscalculated iterations which Milan is fixing in https://gerrit.ovirt.org/#/c/98792/. But that's unrelated to this problem

Comment 9 Michal Skrivanek 2019-03-26 09:52:43 UTC
moving to Integration for possible improvement in HE Broker to either
1) use engine API to invoke the migration (which would use the same parameters as a manual migration in UI)
2) or send better convergence schedule (use more bandwidth than the default of 52Mbps when nothing is set) Engine tries to guess link speed.
3) or let engine migrate itself, do not initiate migrating from the broker when engine is responsive

Comment 10 RHEL Program Management 2019-03-27 08:14:24 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Sandro Bonazzola 2019-04-03 07:58:04 UTC
Moving to HA broker: plan is to remove the functionality from HA broker and let the engine do the migration.

Comment 12 Martin Perina 2019-04-23 13:43:19 UTC
*** Bug 1691508 has been marked as a duplicate of this bug. ***

Comment 14 David Vaanunu 2019-07-15 12:10:54 UTC
rpm -qa | grep rhv-release
rhv-release-4.3.5-6-001.noarch

Software Version:4.3.5.4-0.1.el7


Test case verified twice successfully.
1. Migrate HE manually from Host1 to Host2

2019-07-15 12:58:15,128+03 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-230) [89932e62-aecd-4f68-800c-9dede34bb692] Running command: MigrateVmToServerCommand internal: false. Entities affected :  ID: 81415768-ae4e-4b6b-9888-379656c5c9dd Type: VMAction group MIGRATE_VM with role type USER

2019-07-15 12:58:15,293+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-230) [89932e62-aecd-4f68-800c-9dede34bb692] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_2, User: admin@internal-authz).

2019-07-15 12:58:39,789+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-44) [] VM '81415768-ae4e-4b6b-9888-379656c5c9dd' is migrating to VDS '724fe6fb-2a51-44d8-a07b-484c9c9f6cc1'(host_mixed_2) ignoring it in the refresh until migration is done

2019-07-15 12:58:55,301+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_2, Duration: 28 seconds, Total: 40 seconds, Actual downtime: 244ms)




2. Host2 change to maintenance, HE migrates automatically to Host1


2019-07-15 14:04:09,832+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-248) [6a0c4466] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Migration initiated by system (VM: HostedEngine, Source: host_mixed_2, Destination: host_mixed_1, Reason: ).

2019-07-15 14:04:09,840+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-248) [6a0c4466] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host host_mixed_2 was switched to Maintenance mode by admin@internal-authz.

2019-07-15 14:04:10,398+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [17d16dc4-32bb-4db9-a674-f05b49ef8ec1] Command 'MaintenanceNumberOfVdss' id: 'f27fc39e-d916-4cef-8c47-9de559ce9774' child commands '[]' executions were completed, status 'SUCCEEDED'

2019-07-15 14:04:13,886+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] VM '81415768-ae4e-4b6b-9888-379656c5c9dd' is migrating to VDS 'af584899-03a4-40d6-9cf8-60a74c66085e'(host_mixed_1) ignoring it in the refresh until migration is done

2019-07-15 14:04:49,837+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: HostedEngine, Source: host_mixed_2, Destination: host_mixed_1, Duration: 23 seconds, Total: 40 seconds, Actual downtime: 339ms)

Comment 15 Nikolai Sednev 2019-07-15 12:23:28 UTC
This bug seams like a flavor of https://bugzilla.redhat.com/show_bug.cgi?id=1665934, which is fixed.

Comment 16 Sandro Bonazzola 2019-07-30 14:08:33 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.