1399766 – Host can be stuck in preparing for maintenance, because of the current maintenance state transitions

Bug 1399766 - Host can be stuck in preparing for maintenance, because of the current maintenance state transitions

Summary: Host can be stuck in preparing for maintenance, because of the current mainte...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Agent
Sub Component:
Version:	2.0.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-4.1.7
Target Release:	---
Assignee:	Denis Chaplygin
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1362618 (view as bug list)
Depends On:	1419326 1479768
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-29 16:57 UTC by Martin Sivák
Modified:	2017-11-09 11:21 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-09-27 11:06:58 UTC
oVirt Team:	SLA
Embargoed:
Flags:	eheftman: needinfo- rule-engine: ovirt-4.1+ rule-engine: planning_ack+ dfediuck: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
sosreport-nsednev-he-1.qa.lab.tlv.redhat.com-20170312103030.tar.xz (9.44 MB, application/x-xz) 2017-03-12 08:35 UTC, Nikolai Sednev	no flags	Details
delay being screened (9.60 MB, application/octet-stream) 2017-08-23 16:16 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1362618	high	CLOSED	[HE] sometimes move HE host to maintenance stuck in "Preparing For Maintenance" status.	2021-02-22 00:41:40 UTC
oVirt gerrit	67540	'None'	MERGED	Improve logging of migration failure transition	2021-02-02 06:54:07 UTC
oVirt gerrit	68597	'None'	MERGED	Improve logging of migration failure transition	2021-02-02 06:54:07 UTC
oVirt gerrit	75684	'None'	MERGED	he: Implemented dynamic delay of monitoring loop	2021-02-02 06:54:07 UTC
oVirt gerrit	75685	'None'	MERGED	he: Implemented monitoring loop steps management	2021-02-02 06:54:52 UTC

Internal Links: 1362618

Description Martin Sivák 2016-11-29 16:57:08 UTC

Description of problem:

It is possible to get stuck in Preparing for maintenance when the following is performed using only 2 hosted engine hosts (A, B).

Lets assume the HE VM runs on host B
Put A into maintenance
Activate A
As soon as A is Up (be fast, the HE score must still be 0) put host B to maintenance mode

The other scenario that can be used after this is hit for the first time is just:

VM is running on B, but B is already in LocalMaintenance state (and probably Preparing for maintenance in the engine)
Activate B
Put B back to maintenance again

Version-Release number of selected component (if applicable):

all components from current 4.0.6 snapshots as of 29th of Nov 2016

How reproducible:

Always, but you need to be somewhat fast.

Actual results:

Stuck in preparing for maintenance

Expected results:

No maintenance mode attempted and the user informed.

Additional info:

This is caused by a two transition sequences in hosted engine agent.

The first flow:

The state machine went from EngineMigratingAway to ReinitializeFSM meaning something weird happened (13:42:02). VDSM finished the migration at 13:43:44. It needs to be checked what was in new_data.migration_result variable when the EngineMigratingAway failed.

Second case is:

Host B activated

16:55:45,307 LocalMaintenance-ReinitializeFSM
16:56:10,111 ReinitializeFSM-EngineStarting

The agent needs couple of seconds to realize the Engine is properly Up. But it won't make it, because the maintenance mode is set again.

16:56:36,855 EngineStarting-LocalMaintenance

This can be prevented by an engine patch: https://gerrit.ovirt.org/#/c/67300/

Comment 1 Artyom 2016-11-30 13:22:15 UTC

Hi Martin, I believe it better to check ha-agent status from the engine and accept maintenance operation just in the case when the host has state EngineUp and fails maintenance operation in all other cases with appropriate message from the engine side(sure I talk only on cases when HA VM runs on the host)
What do you think?

Comment 2 Martin Sivák 2016-11-30 14:04:28 UTC

That it is a bad idea :) The state names and possible transitions are internal knowledge of the agent and can (and in fact do) change. We do not export the state name in machine readable format for a good reason.

The score paired with local maintenance flag is enough to filter most (if not all) of the issues out.

Comment 3 Nikolai Sednev 2016-11-30 14:12:16 UTC

(In reply to Martin Sivák from comment #2)
> That it is a bad idea :) The state names and possible transitions are
> internal knowledge of the agent and can (and in fact do) change. We do not
> export the state name in machine readable format for a good reason.
> 
> The score paired with local maintenance flag is enough to filter most (if
> not all) of the issues out.

IMHO we should consider not moving any hosted-engine-hosts in to the maintenance with running HE-VM on them, unless there is at least one hosted-engine-host with positive ha score available, otherwise customers would get stuck with "preparing-for-maintenance" hosts and we'll get tons of questions regarding these corner cases.

Comment 4 Martin Sivák 2016-11-30 14:16:36 UTC

> IMHO we should consider not moving any hosted-engine-hosts in to the
> maintenance with running HE-VM on them, unless there is at least one
> hosted-engine-host with positive ha score available

And that is exactly what https://gerrit.ovirt.org/#/c/67300/ is doing.

Comment 5 Artyom 2016-12-01 07:25:09 UTC

What do you mean by machine readable format?

We have hosted engine client API that give you possbility to get all information that you need(I believe in the same way you pass the host score to the engine via VDSM)
print he_client.get_all_stats()
{0: {'maintenance': False}, 1: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1162589 (Thu Dec  1 09:12:05 2016)\nhost-id=1\nscore=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n', 'hostname': 'puma23.scl.lab.tlv.redhat.com', 'host-id': 1, 'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}', 'score': 0, 'stopped': False, 'maintenance': True, 'crc32': 'b8623ef2', 'host-ts': 1162589}, 2: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=590695 (Thu Dec  1 09:11:50 2016)\nhost-id=2\nscore=3400\nmaintenance=False\nstate=EngineUp\nstopped=False\n', 'hostname': 'puma26.scl.lab.tlv.redhat.com', 'host-id': 2, 'engine-status': '{"health": "good", "vm": "up", "detail": "up"}', 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '90aab4bd', 'host-ts': 590695}, 3: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1162571 (Thu Dec  1 09:12:05 2016)\nhost-id=3\nscore=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n', 'hostname': 'puma27.scl.lab.tlv.redhat.com', 'host-id': 3, 'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}', 'score': 0, 'stopped': False, 'maintenance': True, 'crc32': 'ca6ca357', 'host-ts': 1162571}}

Also when user will have host that stuck in the "prepearing to maintenance" state, how does he know that he need to check ovirt-ha-agent log, because of the engine log lack of the information?
For me it does not look user friendly at all.

Comment 6 Martin Sivák 2016-12-01 08:21:46 UTC

Artyom: The dict is exactly what I am talking about. The state machine's current state does not have its own key in there. It is only reported in a string form using the extra key together with other data. We _only_ put it there for debug purposes (internally it is not even guaranteed it will be 100% correct).

> 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\n
> timestamp=1162571 (Thu Dec  1 09:12:05 2016)\nhost-id=3\n
> score=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n'

We won't ever use the state name in the engine. Period. We would have to maintain the logic at two places if we did and we really really do not want that.

Comment 7 Kobi Hakimi 2016-12-14 10:17:21 UTC

*** Bug 1362618 has been marked as a duplicate of this bug. ***

Comment 8 Martin Sivák 2016-12-16 11:42:35 UTC

There are multiple patches merged that should prevent getting into situations like this both in the engine and in the hosted engine tools.

Comment 9 Sandro Bonazzola 2017-01-25 07:54:57 UTC

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 10 Sandro Bonazzola 2017-02-01 16:01:22 UTC

oVirt 4.1.0 GA has been released, re-targeting to 4.1.1.
Please check if this issue is correctly targeted or already included in 4.1.0.

Comment 11 Denis Chaplygin 2017-03-02 06:03:57 UTC

As far as i understand, patches are already merged to required branches.

Comment 12 Nikolai Sednev 2017-03-05 17:11:01 UTC

Still being reproduced.
1)Have at least two hosts and put one of them with the HE-VM in to maintenance.
2)HE-VM migrated to another host.
3)Activate back host from maintenance.
4)Once host becomes active, set to maintenance host with HE-VM, you will have to be fast enough.
Host with HE-VM will stuck in preparing for maintenance.
Components on hosts:
libvirt-client-2.0.0-10.el7_3.4.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
rhevm-appliance-20160721.0-2.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
vdsm-4.19.6-1.el7ev.x86_64
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-doc-4.1.0-2.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.1.2-0.1.el7.noarch
Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017
Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 13 Red Hat Bugzilla Rules Engine 2017-03-05 17:11:08 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 14 Yaniv Kaul 2017-03-09 09:17:42 UTC

Denis - the bug was re-opened - are you looking at it?

Comment 15 Martin Sivák 2017-03-09 15:33:48 UTC

Nikolai, attach logs please.

Comment 16 Nikolai Sednev 2017-03-12 08:28:39 UTC

I think this issue might be related to:
https://bugzilla.redhat.com/show_bug.cgi?id=1411319 && https://bugzilla.redhat.com/show_bug.cgi?id=1419326

Reproduced based on:
"Lets assume the HE VM runs on host B (alma03)
Put A(alma04) into maintenance
Activate A(alma04)
As soon as A(alma04) is Up (be fast, the HE score must still be 0) put host B(alma03) to maintenance mode

The other scenario that can be used after this is hit for the first time is just:

VM is running on B(alma03), but B(alma03) is already in LocalMaintenance state (and probably Preparing for maintenance in the engine)
Activate B(alma03)
Put B(alma03) back to maintenance again"


For me the second reproduction scenario worked.
Please see the attached screen cast.

Comment 17 Nikolai Sednev 2017-03-12 08:35:44 UTC

Created attachment 1262193 [details]
sosreport-nsednev-he-1.qa.lab.tlv.redhat.com-20170312103030.tar.xz

Comment 18 Nikolai Sednev 2017-03-12 08:53:51 UTC

alma04:
https://drive.google.com/open?id=0B85BEaDBcF88SklvOTI1QjJ6aGc
alma03:
https://drive.google.com/open?id=0B85BEaDBcF88a25DbzZ2M0lsZDg

Comment 19 Nikolai Sednev 2017-03-12 12:29:27 UTC

Screen cast:
https://drive.google.com/open?id=0B85BEaDBcF88UzZJazZKUU52U2s

Comment 20 Denis Chaplygin 2017-04-19 13:04:57 UTC

I'm not able to reproduce it. While HE score is 0 or host A it still in local maintenance (from HE point of view), engine does not allows me to put host B into maintenance mode. 

Could you please check if it is still reproducible on your side?

Comment 21 Nikolai Sednev 2017-04-19 14:35:51 UTC

(In reply to Denis Chaplygin from comment #20)
> I'm not able to reproduce it. While HE score is 0 or host A it still in
> local maintenance (from HE point of view), engine does not allows me to put
> host B into maintenance mode. 
> 
> Could you please check if it is still reproducible on your side?

Forth to our conversation, I've reproduced the scenario on your environment, which was an upstream by the way, while my original environment was a downstream.

Comment 22 Denis Chaplygin 2017-04-20 10:07:56 UTC

I investigated that issue and realized, that there is not too much can be done. We are trying to synchronize two different apps, living on their own schedules. And, unfortunately, those schedules are quite different.

At the moment Hosted engine updates it's status with about 30 seconds delay and status change propagation from HE to the engine takes about 15 seconds. 

So, in worst case, engine will see correct state of HE agent after 75 seconds (30 seconds to report transition from LocalMaintenance state, 30 seconds to report new score, 15 seconds delay on the engine side) 

Therefore, when you try to immediately use a host, just returned from the maintenance state, engine is not able to make a correct decision and tries to operate hosted engine, while it is still in a incorrect state. 

There is no way to fix that behaviour, cause our system is not a realtime system and we have a lot of delays in it (as i mentioned above, waiting for more than 75 seconds between actions should be safe). 


The only thing we can do, is to try to decrease hosted engine monitoring cycle time and, therefore, decrease total time required to synchronize engine and HE state.

Comment 23 Nikolai Sednev 2017-04-20 11:03:40 UTC

Please provide your input forth to comment #22.

Comment 24 Yaniv Lavi 2017-04-23 16:03:04 UTC

It's sound like an edge case, therefore the solution sounds ok to me.

Comment 28 Kobi Hakimi 2017-04-25 14:47:08 UTC

Hi,
Since we doing automation and this scenario of change host status repeat many times in our regression runs, we face the side effect of it almost in each run.
maybe the proper way to solve it is to change the status just after checking in the background that the current status is the correct one.
wdy?

Comment 30 Elad 2017-04-25 16:17:07 UTC

This issue is more that just annoying. 
Besides that it completely  blocks some of our test cases in automation, it can cause troubles during basic HE hosts update. 
For example, in order to deactivate and activate the HE hosts for update, the common scenario would be to put all the hosts, except for the host that has the HE VM running on, in maintenance and re-activate them. Then, maintenance and activate the last host. In this case, the issue occurs every time (unless we wait for the other hosts score).

Comment 34 Marina Kalinin 2017-08-14 20:51:41 UTC

Denis, so far all the changes are in hosted-engine. Maybe I missed some discussion above, but how about changing the engine to prevent it getting into this state initially, if impossible to remove the delay? For instance, add maybe a test in CanDoAction if it is HE cluster and the rest of the hosts are in maintenance, do not allow maintenance for this host? This will ensure that there is always available host to run HE VM and prevent this bug as well.

Comment 37 Martin Sivák 2017-08-23 10:33:15 UTC

We reduced the time delay to the designed 10 seconds thanks to other fix. Can you please retest with 4.1.5? It should be much less visible now and there is not much we can do about the rest. You should wait before the score stabilizes in your test cases.

Comment 38 Nikolai Sednev 2017-08-23 16:16:46 UTC

Created attachment 1317149 [details]
delay being screened

Comment 39 Nikolai Sednev 2017-08-23 16:17:32 UTC

(In reply to Martin Sivák from comment #37)
> We reduced the time delay to the designed 10 seconds thanks to other fix.
> Can you please retest with 4.1.5? It should be much less visible now and
> there is not much we can do about the rest. You should wait before the score
> stabilizes in your test cases.

Delay is roughly about 1 minute and 22 seconds.

Following reproduction steps were made:
1)I've set second host in to maintenance using UI.
2)I've activated back the second host and started stopwatch.
3)Once score returned to 3400 in CLI and UI, I've clicked the stopwatch.

Please see the screencast from the attachment.

Comment 40 Nikolai Sednev 2017-09-06 16:40:32 UTC

Reproduction on ovirt-engine-setup-4.1.6.1-0.1.el7.noarch and ovirt-hosted-engine-setup-2.1.3.7-1.el7ev.noarch:
1.Set host A in to maintenance.
2.Wait until HE-VM migrated to host B.
3.Activate host A.
4.Wait until host A becomes active in UI.
5.Set in to maintenance host B.

I still see that host being stuck in preparing for maintenance.
Please observe my reproduction in attachment https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88Ykpod0VVZi1qWlU/view?usp=sharing.
Setting back to assigned, forth to reproduction was successful.

Comment 41 Red Hat Bugzilla Rules Engine 2017-09-06 16:40:41 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 42 Martin Sivák 2017-09-27 11:06:58 UTC

We improved what we could and there is nothing more we can do atm. Please wait a bit longer before putting the other host to maintenance.

Note You need to log in before you can comment on or make changes to this bug.