Description of problem: It is possible to get stuck in Preparing for maintenance when the following is performed using only 2 hosted engine hosts (A, B). Lets assume the HE VM runs on host B Put A into maintenance Activate A As soon as A is Up (be fast, the HE score must still be 0) put host B to maintenance mode The other scenario that can be used after this is hit for the first time is just: VM is running on B, but B is already in LocalMaintenance state (and probably Preparing for maintenance in the engine) Activate B Put B back to maintenance again Version-Release number of selected component (if applicable): all components from current 4.0.6 snapshots as of 29th of Nov 2016 How reproducible: Always, but you need to be somewhat fast. Actual results: Stuck in preparing for maintenance Expected results: No maintenance mode attempted and the user informed. Additional info: This is caused by a two transition sequences in hosted engine agent. The first flow: The state machine went from EngineMigratingAway to ReinitializeFSM meaning something weird happened (13:42:02). VDSM finished the migration at 13:43:44. It needs to be checked what was in new_data.migration_result variable when the EngineMigratingAway failed. Second case is: Host B activated 16:55:45,307 LocalMaintenance-ReinitializeFSM 16:56:10,111 ReinitializeFSM-EngineStarting The agent needs couple of seconds to realize the Engine is properly Up. But it won't make it, because the maintenance mode is set again. 16:56:36,855 EngineStarting-LocalMaintenance This can be prevented by an engine patch: https://gerrit.ovirt.org/#/c/67300/
Hi Martin, I believe it better to check ha-agent status from the engine and accept maintenance operation just in the case when the host has state EngineUp and fails maintenance operation in all other cases with appropriate message from the engine side(sure I talk only on cases when HA VM runs on the host) What do you think?
That it is a bad idea :) The state names and possible transitions are internal knowledge of the agent and can (and in fact do) change. We do not export the state name in machine readable format for a good reason. The score paired with local maintenance flag is enough to filter most (if not all) of the issues out.
(In reply to Martin Sivák from comment #2) > That it is a bad idea :) The state names and possible transitions are > internal knowledge of the agent and can (and in fact do) change. We do not > export the state name in machine readable format for a good reason. > > The score paired with local maintenance flag is enough to filter most (if > not all) of the issues out. IMHO we should consider not moving any hosted-engine-hosts in to the maintenance with running HE-VM on them, unless there is at least one hosted-engine-host with positive ha score available, otherwise customers would get stuck with "preparing-for-maintenance" hosts and we'll get tons of questions regarding these corner cases.
> IMHO we should consider not moving any hosted-engine-hosts in to the > maintenance with running HE-VM on them, unless there is at least one > hosted-engine-host with positive ha score available And that is exactly what https://gerrit.ovirt.org/#/c/67300/ is doing.
What do you mean by machine readable format? We have hosted engine client API that give you possbility to get all information that you need(I believe in the same way you pass the host score to the engine via VDSM) print he_client.get_all_stats() {0: {'maintenance': False}, 1: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1162589 (Thu Dec 1 09:12:05 2016)\nhost-id=1\nscore=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n', 'hostname': 'puma23.scl.lab.tlv.redhat.com', 'host-id': 1, 'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}', 'score': 0, 'stopped': False, 'maintenance': True, 'crc32': 'b8623ef2', 'host-ts': 1162589}, 2: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=590695 (Thu Dec 1 09:11:50 2016)\nhost-id=2\nscore=3400\nmaintenance=False\nstate=EngineUp\nstopped=False\n', 'hostname': 'puma26.scl.lab.tlv.redhat.com', 'host-id': 2, 'engine-status': '{"health": "good", "vm": "up", "detail": "up"}', 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '90aab4bd', 'host-ts': 590695}, 3: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1162571 (Thu Dec 1 09:12:05 2016)\nhost-id=3\nscore=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n', 'hostname': 'puma27.scl.lab.tlv.redhat.com', 'host-id': 3, 'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}', 'score': 0, 'stopped': False, 'maintenance': True, 'crc32': 'ca6ca357', 'host-ts': 1162571}} Also when user will have host that stuck in the "prepearing to maintenance" state, how does he know that he need to check ovirt-ha-agent log, because of the engine log lack of the information? For me it does not look user friendly at all.
Artyom: The dict is exactly what I am talking about. The state machine's current state does not have its own key in there. It is only reported in a string form using the extra key together with other data. We _only_ put it there for debug purposes (internally it is not even guaranteed it will be 100% correct). > 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\n > timestamp=1162571 (Thu Dec 1 09:12:05 2016)\nhost-id=3\n > score=0\nmaintenance=True\nstate=LocalMaintenance\nstopped=False\n' We won't ever use the state name in the engine. Period. We would have to maintain the logic at two places if we did and we really really do not want that.
*** Bug 1362618 has been marked as a duplicate of this bug. ***
There are multiple patches merged that should prevent getting into situations like this both in the engine and in the hosted engine tools.
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
oVirt 4.1.0 GA has been released, re-targeting to 4.1.1. Please check if this issue is correctly targeted or already included in 4.1.0.
As far as i understand, patches are already merged to required branches.
Still being reproduced. 1)Have at least two hosts and put one of them with the HE-VM in to maintenance. 2)HE-VM migrated to another host. 3)Activate back host from maintenance. 4)Once host becomes active, set to maintenance host with HE-VM, you will have to be fast enough. Host with HE-VM will stuck in preparing for maintenance. Components on hosts: libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 rhevm-appliance-20160721.0-2.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch vdsm-4.19.6-1.el7ev.x86_64 ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: rhev-guest-tools-iso-4.1-3.el7ev.noarch rhevm-dependencies-4.1.0-1.el7ev.noarch rhevm-doc-4.1.0-2.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.0-1.el7ev.noarch rhevm-4.1.1.2-0.1.el7.noarch Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017 Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo)
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Denis - the bug was re-opened - are you looking at it?
Nikolai, attach logs please.
I think this issue might be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1411319 && https://bugzilla.redhat.com/show_bug.cgi?id=1419326 Reproduced based on: "Lets assume the HE VM runs on host B (alma03) Put A(alma04) into maintenance Activate A(alma04) As soon as A(alma04) is Up (be fast, the HE score must still be 0) put host B(alma03) to maintenance mode The other scenario that can be used after this is hit for the first time is just: VM is running on B(alma03), but B(alma03) is already in LocalMaintenance state (and probably Preparing for maintenance in the engine) Activate B(alma03) Put B(alma03) back to maintenance again" For me the second reproduction scenario worked. Please see the attached screen cast.
Created attachment 1262193 [details] sosreport-nsednev-he-1.qa.lab.tlv.redhat.com-20170312103030.tar.xz
alma04: https://drive.google.com/open?id=0B85BEaDBcF88SklvOTI1QjJ6aGc alma03: https://drive.google.com/open?id=0B85BEaDBcF88a25DbzZ2M0lsZDg
Screen cast: https://drive.google.com/open?id=0B85BEaDBcF88UzZJazZKUU52U2s
I'm not able to reproduce it. While HE score is 0 or host A it still in local maintenance (from HE point of view), engine does not allows me to put host B into maintenance mode. Could you please check if it is still reproducible on your side?
(In reply to Denis Chaplygin from comment #20) > I'm not able to reproduce it. While HE score is 0 or host A it still in > local maintenance (from HE point of view), engine does not allows me to put > host B into maintenance mode. > > Could you please check if it is still reproducible on your side? Forth to our conversation, I've reproduced the scenario on your environment, which was an upstream by the way, while my original environment was a downstream.
I investigated that issue and realized, that there is not too much can be done. We are trying to synchronize two different apps, living on their own schedules. And, unfortunately, those schedules are quite different. At the moment Hosted engine updates it's status with about 30 seconds delay and status change propagation from HE to the engine takes about 15 seconds. So, in worst case, engine will see correct state of HE agent after 75 seconds (30 seconds to report transition from LocalMaintenance state, 30 seconds to report new score, 15 seconds delay on the engine side) Therefore, when you try to immediately use a host, just returned from the maintenance state, engine is not able to make a correct decision and tries to operate hosted engine, while it is still in a incorrect state. There is no way to fix that behaviour, cause our system is not a realtime system and we have a lot of delays in it (as i mentioned above, waiting for more than 75 seconds between actions should be safe). The only thing we can do, is to try to decrease hosted engine monitoring cycle time and, therefore, decrease total time required to synchronize engine and HE state.
Please provide your input forth to comment #22.
It's sound like an edge case, therefore the solution sounds ok to me.
Hi, Since we doing automation and this scenario of change host status repeat many times in our regression runs, we face the side effect of it almost in each run. maybe the proper way to solve it is to change the status just after checking in the background that the current status is the correct one. wdy?
This issue is more that just annoying. Besides that it completely blocks some of our test cases in automation, it can cause troubles during basic HE hosts update. For example, in order to deactivate and activate the HE hosts for update, the common scenario would be to put all the hosts, except for the host that has the HE VM running on, in maintenance and re-activate them. Then, maintenance and activate the last host. In this case, the issue occurs every time (unless we wait for the other hosts score).
Denis, so far all the changes are in hosted-engine. Maybe I missed some discussion above, but how about changing the engine to prevent it getting into this state initially, if impossible to remove the delay? For instance, add maybe a test in CanDoAction if it is HE cluster and the rest of the hosts are in maintenance, do not allow maintenance for this host? This will ensure that there is always available host to run HE VM and prevent this bug as well.
We reduced the time delay to the designed 10 seconds thanks to other fix. Can you please retest with 4.1.5? It should be much less visible now and there is not much we can do about the rest. You should wait before the score stabilizes in your test cases.
Created attachment 1317149 [details] delay being screened
(In reply to Martin Sivák from comment #37) > We reduced the time delay to the designed 10 seconds thanks to other fix. > Can you please retest with 4.1.5? It should be much less visible now and > there is not much we can do about the rest. You should wait before the score > stabilizes in your test cases. Delay is roughly about 1 minute and 22 seconds. Following reproduction steps were made: 1)I've set second host in to maintenance using UI. 2)I've activated back the second host and started stopwatch. 3)Once score returned to 3400 in CLI and UI, I've clicked the stopwatch. Please see the screencast from the attachment.
Reproduction on ovirt-engine-setup-4.1.6.1-0.1.el7.noarch and ovirt-hosted-engine-setup-2.1.3.7-1.el7ev.noarch: 1.Set host A in to maintenance. 2.Wait until HE-VM migrated to host B. 3.Activate host A. 4.Wait until host A becomes active in UI. 5.Set in to maintenance host B. I still see that host being stuck in preparing for maintenance. Please observe my reproduction in attachment https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88Ykpod0VVZi1qWlU/view?usp=sharing. Setting back to assigned, forth to reproduction was successful.
We improved what we could and there is nothing more we can do atm. Please wait a bit longer before putting the other host to maintenance.