Bug 1691244

Summary: [scale] HE - unable to migrate HE vm by putting the host to maintenance
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Ilan Zuckerman <izuckerm>
Component: BrokerAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: David Vaanunu <dvaanunu>
Severity: high Docs Contact:
Priority: high    
Version: ---CC: adevolder, ascerra, bugs, dvaanunu, jroberts, lrotenbe, michal.skrivanek, mkalinin, mlehrer, mperina, nsednev, rbarry, sabose
Target Milestone: ovirt-4.3.5Keywords: Performance, Regression
Target Release: ---Flags: sbonazzo: ovirt-4.3?
pm-rhel: blocker?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
sbonazzo: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.3.11 Doc Type: Bug Fix
Doc Text:
In the past when the host running the engine VM was set into maintenance mode from the engine, the engine VM was going to be migrated by ovirt-ha-agent as an indirect action caused by the maintenance mode. This process wasn't fully controlled by the engine and it could potentially lead to side effects on specific cases.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-30 14:08:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1724792    
Attachments:
Description Flags
Tasks tab none

Description Ilan Zuckerman 2019-03-21 08:58:40 UTC
Created attachment 1546377 [details]
Tasks tab

Description of problem:

When putting the host on which the HE vm resides to maintenance, the VM fails to migrate to next available host.
the HE vm gets stuck in "migrating" state, while the host from which it tries to migrate from gets stuck in "PreparingForMaintenance" state.

At some point, when you look at the "Tasks" inside webAdmin UI, you see that the task of "Moving Host host_mixed_1 to Maintenance" gets duplicated over and over again (see image attached).
Engine log loops through same events over and over again:
https://paste.fedoraproject.org/paste/MmTxXa4V1zR3gLFOSji6yQ

Version-Release number of selected component (if applicable):

rpm -qa | grep rhv-release
rhv-release-4.3.2-1-001.noarch

Software Version:4.3.2-0.1.el7

Lab Topology using Hosted Engine:
9 DC and 9 Clusters
536 Hosts  of which 35 real servers and 501 nested hosts

11  Storage Domains composed of:
7 FC Domains
2 ISCSI Domains
2 NFS Domains (Export + Data)

4091 VMS are up

Hosted Engine running on dedicated host:  26vcpu with 32GB ram

How reproducible:
Always


Steps to Reproduce:
1. Assume you have two hosts which are capable of holding HE vm in the same cluster
2. Put the one host on which the HE vm resides to maintenance

Actual results:
The HE vm gets stuck forever in migrating status, while the host is stuck in "prepareForMaintenance" staate


Expected results:
The VM should migrate to next available host

Additional info:
Environment Credentials are provided in private message.

Comment 3 Ilan Zuckerman 2019-03-24 08:49:50 UTC
One more addition to this BZ:
When you trigger migration of the HE vm MANUALLY from the WebAdmin, then the migration works fine. Meaning that the HE vm is migrated successfully to the second available host.

Comment 4 Sandro Bonazzola 2019-03-25 13:39:56 UTC
Looks like virt team field. Ryan can you please have a look?

Comment 6 Michal Skrivanek 2019-03-26 08:28:19 UTC
Migration is invoked by HE broker (most likely) because of local maint being set - 
2019-03-21 07:15:29,281Z INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] VM 'a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21'(HostedEngine) moved from 'Up' --> 'MigratingFrom'

in vdsm:

2019-03-21 07:15:29,273+0000 INFO  (jsonrpc/2) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'b01-h03-r620.rhev.openstack.engineering.redhat.com', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'b01-h03-r620.rhev.openstack.engineering.redhat.com', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,34144, vmId=a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21 (api:48)


and then progresses indefinitely, e.g.:
2019-03-21 07:27:13,832+0000 WARN  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration stalling: remaining (12253MiB) > lowm
ark (357MiB). (migration:849)
2019-03-21 07:27:13,832+0000 INFO  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration Progress: 700.357 seconds elapsed, 63% of data processed, total data: 32864MB, processed data: 36313MB, remaining data: 12253MB, transfer speed 52MBps, zero pages: 263841MB, compressed: 0MB, dirty rate: 20630, memory iteration: 2 (migration:879)

Comment 7 Michal Skrivanek 2019-03-26 08:41:02 UTC
the last migration monitor update is
2019-03-21 09:10:52,176+0000 INFO  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Migration Progress: 2500.89 seconds elapsed, 99% of data processed, total data: 32864MB, processed data: 273298MB, remaining data: 558MB, transfer speed 52MBps, zero pages: 757443MB, compressed: 0MB, dirty rate: 14328, memory iteration: 19 (migration:879)

then it's aborted at
2019-03-21 09:11:02,184+0000 WARN  (migmon/a9a6eaa8) [virt.vm] (vmId='a9a6eaa8-cd51-4d86-89b0-3f8b64dd9f21') Aborting migration (migration:914)

Comment 8 Michal Skrivanek 2019-03-26 09:36:12 UTC
the main difference between the manual attempt and he-broker invoked migration is that the cluster is using Legacy profile with 625Mbps bandwidth limit vs Suspend policy(with 5s downtime) capped at 52Mbps migrating indefinitely

It seems the migration is actually abort due to miscalculated iterations which Milan is fixing in https://gerrit.ovirt.org/#/c/98792/. But that's unrelated to this problem

Comment 9 Michal Skrivanek 2019-03-26 09:52:43 UTC
moving to Integration for possible improvement in HE Broker to either
1) use engine API to invoke the migration (which would use the same parameters as a manual migration in UI)
2) or send better convergence schedule (use more bandwidth than the default of 52Mbps when nothing is set) Engine tries to guess link speed.
3) or let engine migrate itself, do not initiate migrating from the broker when engine is responsive

Comment 10 RHEL Program Management 2019-03-27 08:14:24 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Sandro Bonazzola 2019-04-03 07:58:04 UTC
Moving to HA broker: plan is to remove the functionality from HA broker and let the engine do the migration.

Comment 12 Martin Perina 2019-04-23 13:43:19 UTC
*** Bug 1691508 has been marked as a duplicate of this bug. ***

Comment 14 David Vaanunu 2019-07-15 12:10:54 UTC
rpm -qa | grep rhv-release
rhv-release-4.3.5-6-001.noarch

Software Version:4.3.5.4-0.1.el7


Test case verified twice successfully.
1. Migrate HE manually from Host1 to Host2

2019-07-15 12:58:15,128+03 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-230) [89932e62-aecd-4f68-800c-9dede34bb692] Running command: MigrateVmToServerCommand internal: false. Entities affected :  ID: 81415768-ae4e-4b6b-9888-379656c5c9dd Type: VMAction group MIGRATE_VM with role type USER

2019-07-15 12:58:15,293+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-230) [89932e62-aecd-4f68-800c-9dede34bb692] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_2, User: admin@internal-authz).

2019-07-15 12:58:39,789+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-44) [] VM '81415768-ae4e-4b6b-9888-379656c5c9dd' is migrating to VDS '724fe6fb-2a51-44d8-a07b-484c9c9f6cc1'(host_mixed_2) ignoring it in the refresh until migration is done

2019-07-15 12:58:55,301+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_2, Duration: 28 seconds, Total: 40 seconds, Actual downtime: 244ms)




2. Host2 change to maintenance, HE migrates automatically to Host1


2019-07-15 14:04:09,832+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-248) [6a0c4466] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Migration initiated by system (VM: HostedEngine, Source: host_mixed_2, Destination: host_mixed_1, Reason: ).

2019-07-15 14:04:09,840+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-248) [6a0c4466] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host host_mixed_2 was switched to Maintenance mode by admin@internal-authz.

2019-07-15 14:04:10,398+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [17d16dc4-32bb-4db9-a674-f05b49ef8ec1] Command 'MaintenanceNumberOfVdss' id: 'f27fc39e-d916-4cef-8c47-9de559ce9774' child commands '[]' executions were completed, status 'SUCCEEDED'

2019-07-15 14:04:13,886+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] VM '81415768-ae4e-4b6b-9888-379656c5c9dd' is migrating to VDS 'af584899-03a4-40d6-9cf8-60a74c66085e'(host_mixed_1) ignoring it in the refresh until migration is done

2019-07-15 14:04:49,837+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: HostedEngine, Source: host_mixed_2, Destination: host_mixed_1, Duration: 23 seconds, Total: 40 seconds, Actual downtime: 339ms)

Comment 15 Nikolai Sednev 2019-07-15 12:23:28 UTC
This bug seams like a flavor of https://bugzilla.redhat.com/show_bug.cgi?id=1665934, which is fixed.

Comment 16 Sandro Bonazzola 2019-07-30 14:08:33 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.