Bug 1411319
Summary: | After successful migration of HE vm from host a to host b, immediately migrating back to host a fails with "Operation Canceled" from rhev manager portal. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ameya Charekar <achareka> | |
Component: | ovirt-hosted-engine-ha | Assignee: | Andrej Krejcir <akrejcir> | |
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | |
Severity: | low | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.0.5 | CC: | achareka, akrejcir, bmcclain, dchaplyg, dfediuck, lsurette, mavital, melewis, mkalinin, msivak, nsednev, ykaul | |
Target Milestone: | ovirt-4.1.2 | Keywords: | Triaged | |
Target Release: | --- | Flags: | nsednev:
testing_plan_complete+
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Previously, the results of a successful self-hosted engine migration were lost and the self-hosted engine agent reported that the migration failed. This was due to a race condition in VDSM and the self-hosted engine agent interface. Now, the absence of the virtual machine is treated as a sign of a successful migration. This means that the self-hosted engine agent correctly states 'EngineDown' on migration completion.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1436002 (view as bug list) | Environment: | ||
Last Closed: | 2017-05-25 06:46:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1264085, 1419326 | |||
Bug Blocks: | 1436002 | |||
Attachments: |
Description
Ameya Charekar
2017-01-09 13:06:57 UTC
Please attach the output of the hosted-engine --vm-status and check what the host information in webadmin says (hosted engine score for example). Ok, I see something weird in the logs. The state goes from EngineMigratingAway directly to ReinitializeFSM. That means the migration returned something unexpected. It goes downhill from there.. (EngineStarting, EngineUnexpectedlyDown when it finally migrates away) Can you please apply https://gerrit.ovirt.org/#/c/67540/ (it adds better logging) and reproduce the issue? It is possible something changed slightly in the vdsm protocol for migrations. Works for me on these components on host: rhvm-appliance-4.1.20170119.1-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 libvirt-client-2.0.0-10.el7_3.4.x86_64 mom-0.5.8-1.el7ev.noarch vdsm-4.19.2-2.el7ev.x86_64 ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-daemon-0.5.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: rhev-guest-tools-iso-4.1-3.el7ev.noarch rhevm-doc-4.1.0-1.el7ev.noarch rhevm-dependencies-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.0-1.el7ev.noarch rhevm-4.1.0.1-0.1.el7.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch rhevm-branding-rhev-4.1.0-0.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) During migrations, from a to b, and then b to a, host a drops it's HA score to zero, without any reason. After several migration attempts, got error message about failed migration: " Operation Canceled Error while executing action: HostedEngine: Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details: The host puma19.scl.lab.tlv.redhat.com did not satisfy internal filter HA because it is not a Hosted Engine host.. The host puma18.scl.lab.tlv.redhat.com did not satisfy internal filter Memory because its available memory is too low (15167.000000 MB) to run the VM." Components on hosts: libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 rhevm-appliance-20160721.0-2.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch vdsm-4.19.6-1.el7ev.x86_64 ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: rhev-guest-tools-iso-4.1-3.el7ev.noarch rhevm-dependencies-4.1.0-1.el7ev.noarch rhevm-doc-4.1.0-2.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.0-1.el7ev.noarch rhevm-4.1.1.2-0.1.el7.noarch Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017 Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) (In reply to Nikolai Sednev from comment #17) > During migrations, from a to b, and then b to a, host a drops it's HA score > to zero, without any reason. There must be a reason, will you attach the logs please? Mar 6, 2017 5:44:13 PM Migration failed (VM: HostedEngine, Source: alma04.qa.lab.tlv.redhat.com). Logs being attached. Created attachment 1260445 [details]
sosreport-alma03.qa.lab.tlv.redhat.com-20170306174542.tar.xz
Created attachment 1260447 [details]
sosreport-alma04.qa.lab.tlv.redhat.com-20170306174536.tar.xz
Created attachment 1260448 [details]
sosreport-nsednev-he-1.qa.lab.tlv.redhat.com-20170306174523.tar.xz
You FailedQA report talks about pumaX hosts, the logs are from almaX hosts... But anyway: alma03 was the source and seems to be OK: MainThread::INFO::2017-03-06 17:44:21,964 HostedEngine::(_monitor_migration) VM not found, assuming that migration is complete MainThread::INFO::2017-03-06 17:44:21,966 HostedEngine::(consume) Migration to None complete, no longer monitoring vm MainThread::INFO::2017-03-06 17:44:21,970 BrokerLink::(notify) Trying: notify time=1488815061.97 type=state_transition detail=EngineMigratingAway-EngineDown hostname='alma03.qa.lab.tlv.redhat.com' MainThread::INFO::2017-03-06 17:44:22,073 BrokerLink::(notify) Success, was notification of state_transition (EngineMigratingAway-EngineDown) sent? sent MainThread::INFO::2017-03-06 17:44:39,423 HostedEngine::(start_monitoring) Current state EngineDown (score: 3400) MainThread::INFO::2017-03-06 17:44:39,423 HostedEngine::(start_monitoring) Best remote host alma04.qa.lab.tlv.redhat.com (id: 2, score: 3400) alma04 received the VM and is running happily too: MainThread::INFO::2017-03-06 17:44:05,071 HostedEngine::(consume) Engine vm is running on host alma03.qa.lab.tlv.redhat.com (id 1) MainThread::INFO::2017-03-06 17:44:35,227 HostedEngine::(start_monitoring) Current state EngineStarting (score: 3400) MainThread::INFO::2017-03-06 17:45:05,331 HostedEngine::(start_monitoring) Current state EngineUp (score: 3400) The hosted engine logs do not correspond to what you are describing. I see the failure in the engine, but it would mean the time it not in sync (engine is about 20 sec behind nodes) and you were trying another migration from alma04 back to alma03. Well it did not allow you to do that because the monitoring still hasn't registered alma03 as being ready (with not enough memory reason): 2017-03-06 17:44:13,164+02 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (org.ovirt.thread.pool-6-thread-14) [4ebcedba-fa0f-47c5-9519-b4b48a5e78fb] Candidate host 'alma03.qa.lab.tlv.redhat.com' ('8c70eb98-b950-4d73-9782-38b1bdbc9afc') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: 4ebcedba-fa0f-47c5-9519-b4b48a5e78fb) 2017-03-06 17:44:13,209+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-14) [4ebcedba-fa0f-47c5-9519-b4b48a5e78fb] EVENT_ID: VM_MIGRATION_NO_VDS_TO_MIGRATE_TO(166), Correlation ID: 4ebcedba-fa0f-47c5-9519-b4b48a5e78fb, Job ID: dced6cbd-e253-4889-8be7-75b9bedc66ad, Call Stack: null, Custom Event ID: -1, Message: No available host was found to migrate VM HostedEngine to. That is definitely by design and not a bug in this case. But please check where the logs are coming from, because you used different hostnames in comment #17. (In reply to Martin Sivák from comment #23) > You FailedQA report talks about pumaX hosts, the logs are from almaX hosts... > > But anyway: > > alma03 was the source and seems to be OK: > > MainThread::INFO::2017-03-06 17:44:21,964 HostedEngine::(_monitor_migration) > VM not found, assuming that migration is complete > MainThread::INFO::2017-03-06 17:44:21,966 HostedEngine::(consume) Migration > to None complete, no longer monitoring vm > MainThread::INFO::2017-03-06 17:44:21,970 BrokerLink::(notify) Trying: > notify time=1488815061.97 type=state_transition > detail=EngineMigratingAway-EngineDown hostname='alma03.qa.lab.tlv.redhat.com' > MainThread::INFO::2017-03-06 17:44:22,073 BrokerLink::(notify) Success, was > notification of state_transition (EngineMigratingAway-EngineDown) sent? sent > > MainThread::INFO::2017-03-06 17:44:39,423 HostedEngine::(start_monitoring) > Current state EngineDown (score: 3400) > MainThread::INFO::2017-03-06 17:44:39,423 HostedEngine::(start_monitoring) > Best remote host alma04.qa.lab.tlv.redhat.com (id: 2, score: 3400) > > alma04 received the VM and is running happily too: > > MainThread::INFO::2017-03-06 17:44:05,071 HostedEngine::(consume) Engine vm > is running on host alma03.qa.lab.tlv.redhat.com (id 1) > MainThread::INFO::2017-03-06 17:44:35,227 HostedEngine::(start_monitoring) > Current state EngineStarting (score: 3400) > MainThread::INFO::2017-03-06 17:45:05,331 HostedEngine::(start_monitoring) > Current state EngineUp (score: 3400) > > The hosted engine logs do not correspond to what you are describing. I see > the failure in the engine, but it would mean the time it not in sync (engine > is about 20 sec behind nodes) and you were trying another migration from > alma04 back to alma03. Well it did not allow you to do that because the > monitoring still hasn't registered alma03 as being ready (with not enough > memory reason): > > 2017-03-06 17:44:13,164+02 INFO > [org.ovirt.engine.core.bll.scheduling.SchedulingManager] > (org.ovirt.thread.pool-6-thread-14) [4ebcedba-fa0f-47c5-9519-b4b48a5e78fb] > Candidate host 'alma03.qa.lab.tlv.redhat.com' > ('8c70eb98-b950-4d73-9782-38b1bdbc9afc') was filtered out by > 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: > 4ebcedba-fa0f-47c5-9519-b4b48a5e78fb) > 2017-03-06 17:44:13,209+02 WARN > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (org.ovirt.thread.pool-6-thread-14) [4ebcedba-fa0f-47c5-9519-b4b48a5e78fb] > EVENT_ID: VM_MIGRATION_NO_VDS_TO_MIGRATE_TO(166), Correlation ID: > 4ebcedba-fa0f-47c5-9519-b4b48a5e78fb, Job ID: > dced6cbd-e253-4889-8be7-75b9bedc66ad, Call Stack: null, Custom Event ID: -1, > Message: No available host was found to migrate VM HostedEngine to. > > > That is definitely by design and not a bug in this case. But please check > where the logs are coming from, because you used different hostnames in > comment #17. Hosts are different now as I've reproduced on another hosts pair, almas are my new clean deployment, pumas are not available now. HE-VM should have memory reservation, we have verified the RFE for it here: https://bugzilla.redhat.com/show_bug.cgi?id=1264085 I've reproduced the exact steps from https://bugzilla.redhat.com/show_bug.cgi?id=1411319#c0 and the expected result is to be able to migrate HE-VM from a to b and then without any delay back to a, without any issues. Hosted engine memory reservation needs to be enabled (HeSparesCount needs to be set to something else than 0). But it won't help in this case, because the engine only refreshes stats once per 15 seconds. And the old-source new-destination host looks occupied until the used memory is refreshed. Since I do not see any score reduction to 0 in your reproducer I consider this bug to be still VERIFIED. The root cause of the original report (see Additional info from #c0 and the Doc Text field) is therefore still fixed. What you want (immediate update of available memory after VM migration/shutdown) is something different and can be tracked in a new bug (it will also belong to the virt team). Host alma04 dropped it's score to zero as appears bellow when I've migrated HE-VM as you may see bellow: [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 9f3ccded local_conf_timestamp : 19051 Host timestamp : 19035 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19035 (Tue Mar 7 18:12:34 2017) host-id=1 score=3400 vm_conf_refresh_time=19051 (Tue Mar 7 18:12:49 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 8b0ccf9f local_conf_timestamp : 3721 Host timestamp : 3706 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3706 (Tue Mar 7 18:12:27 2017) host-id=2 score=3400 vm_conf_refresh_time=3721 (Tue Mar 7 18:12:43 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False You have new mail in /var/spool/mail/root [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "migration source"} Score : 3400 stopped : False Local maintenance : False crc32 : 597bd555 local_conf_timestamp : 19204 Host timestamp : 19187 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19187 (Tue Mar 7 18:15:06 2017) host-id=1 score=3400 vm_conf_refresh_time=19204 (Tue Mar 7 18:15:23 2017) conf_on_shared_storage=True maintenance=False state=EngineMigratingAway stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : bfb734b4 local_conf_timestamp : 3877 Host timestamp : 3858 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3858 (Tue Mar 7 18:14:59 2017) host-id=2 score=3400 vm_conf_refresh_time=3877 (Tue Mar 7 18:15:18 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "migration source"} Score : 3400 stopped : False Local maintenance : False crc32 : 597bd555 local_conf_timestamp : 19204 Host timestamp : 19187 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19187 (Tue Mar 7 18:15:06 2017) host-id=1 score=3400 vm_conf_refresh_time=19204 (Tue Mar 7 18:15:23 2017) conf_on_shared_storage=True maintenance=False state=EngineMigratingAway stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : bfb734b4 local_conf_timestamp : 3877 Host timestamp : 3858 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3858 (Tue Mar 7 18:14:59 2017) host-id=2 score=3400 vm_conf_refresh_time=3877 (Tue Mar 7 18:15:18 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "migration source"} Score : 3400 stopped : False Local maintenance : False crc32 : 597bd555 local_conf_timestamp : 19204 Host timestamp : 19187 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19187 (Tue Mar 7 18:15:06 2017) host-id=1 score=3400 vm_conf_refresh_time=19204 (Tue Mar 7 18:15:23 2017) conf_on_shared_storage=True maintenance=False state=EngineMigratingAway stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : 7baa7fec local_conf_timestamp : 3907 Host timestamp : 3892 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3892 (Tue Mar 7 18:15:33 2017) host-id=2 score=3400 vm_conf_refresh_time=3907 (Tue Mar 7 18:15:49 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "migration source"} Score : 3400 stopped : False Local maintenance : False crc32 : 597bd555 local_conf_timestamp : 19204 Host timestamp : 19187 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19187 (Tue Mar 7 18:15:06 2017) host-id=1 score=3400 vm_conf_refresh_time=19204 (Tue Mar 7 18:15:23 2017) conf_on_shared_storage=True maintenance=False state=EngineMigratingAway stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : 7baa7fec local_conf_timestamp : 3907 Host timestamp : 3892 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3892 (Tue Mar 7 18:15:33 2017) host-id=2 score=3400 vm_conf_refresh_time=3907 (Tue Mar 7 18:15:49 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : bc5c26aa local_conf_timestamp : 19240 Host timestamp : 19219 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19219 (Tue Mar 7 18:15:38 2017) host-id=1 score=3400 vm_conf_refresh_time=19240 (Tue Mar 7 18:15:58 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : 7baa7fec local_conf_timestamp : 3907 Host timestamp : 3892 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3892 (Tue Mar 7 18:15:33 2017) host-id=2 score=3400 vm_conf_refresh_time=3907 (Tue Mar 7 18:15:49 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : bc5c26aa local_conf_timestamp : 19240 Host timestamp : 19219 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19219 (Tue Mar 7 18:15:38 2017) host-id=1 score=3400 vm_conf_refresh_time=19240 (Tue Mar 7 18:15:58 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : 7baa7fec local_conf_timestamp : 3907 Host timestamp : 3892 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3892 (Tue Mar 7 18:15:33 2017) host-id=2 score=3400 vm_conf_refresh_time=3907 (Tue Mar 7 18:15:49 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : bc5c26aa local_conf_timestamp : 19240 Host timestamp : 19219 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19219 (Tue Mar 7 18:15:38 2017) host-id=1 score=3400 vm_conf_refresh_time=19240 (Tue Mar 7 18:15:58 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "migration destination"} Score : 3400 stopped : False Local maintenance : False crc32 : 7baa7fec local_conf_timestamp : 3907 Host timestamp : 3892 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3892 (Tue Mar 7 18:15:33 2017) host-id=2 score=3400 vm_conf_refresh_time=3907 (Tue Mar 7 18:15:49 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : bc5c26aa local_conf_timestamp : 19240 Host timestamp : 19219 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19219 (Tue Mar 7 18:15:38 2017) host-id=1 score=3400 vm_conf_refresh_time=19240 (Tue Mar 7 18:15:58 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 189f5936 local_conf_timestamp : 3938 Host timestamp : 3922 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3922 (Tue Mar 7 18:16:04 2017) host-id=2 score=3400 vm_conf_refresh_time=3938 (Tue Mar 7 18:16:19 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 80efaf72 local_conf_timestamp : 19695 Host timestamp : 19680 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19680 (Tue Mar 7 18:23:18 2017) host-id=1 score=0 vm_conf_refresh_time=19695 (Tue Mar 7 18:23:33 2017) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Jan 1 07:32:26 1970 --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : a0469f44 local_conf_timestamp : 4331 Host timestamp : 4316 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4316 (Tue Mar 7 18:22:37 2017) host-id=2 score=3400 vm_conf_refresh_time=4331 (Tue Mar 7 18:22:53 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False You have new mail in /var/spool/mail/root [root@alma03 ~]# Mar 7, 2017 6:15:44 PM Migration failed (VM: HostedEngine, Source: alma03.qa.lab.tlv.redhat.com). Attaching sosreports from both hosts. alma03: https://drive.google.com/open?id=0B85BEaDBcF88bXEyTDhiM1VRcXc alma04: https://drive.google.com/open?id=0B85BEaDBcF88UklpZS05elZaRFE Target host got it's score zeroed and migration failed. That is still something different, the original report says the source host reduced its score to 0 without any reason (the VM was still up). Your test failed, because the VM really crashed on the destination (vdsm.log): 2017-03-07 18:16:50,110+0200 ERROR (vm/a1abbc04) [virt.vm] (vmId='a1abbc04-00b1-4184-af5e-dc6ad3f184d1') The vm start process failed (vm:617) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 553, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2001, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3777, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: resource busy: Failed to acquire lock: error -243 2017-03-07 18:16:50,116+0200 INFO (vm/a1abbc04) [virt.vm] (vmId='a1abbc04-00b1-4184-af5e-dc6ad3f184d1') Changed state to Down: resource busy: Failed to acquire lock: error -243 (code=1) (vm:1202) And hosted engine did exactly what it was supposed to do - blacklisted the host. This is again exactly what should have happened in this case (a real VM failure). Moreover, your setup is not behaving properly, did you notice all the sanlock and vmstat errors in the log? Is your storage OK? 2017-03-07 17:01:07,490+0200 ERROR (tasks/7) [storage.TaskManager.Task] (Task='77b6247b-f5a3-434a-93c5-b5a77f426fcc') Unexpected error AcquireHostIdFailure: Cannot acquire host id: (u'5313e5df-d775-4d04-b153-6bc4e0b15e07', SanlockException(115, 'Sanlock lockspace add failure', 'Operation now in progress')) 2017-03-07 18:15:33,643+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1) in 0.00 seconds (__init__:515) If we're after "source host reduced its score to 0 without any reason", then here it is https://bugzilla.redhat.com/show_bug.cgi?id=1419326 and then this bug is a duplicate of it. This very bug is about: 1. Migrate HE vm from host a to host b. 2. After successful migration again immediately migrate to host a. Actual results: immediate migration fails to host a. Expected results: Migration should be successful as we are manually migrating HE vm and there is no issue with HE host a. I've followed the exact reproduction steps and failed, e.g. I've migrated from a to b and then immediately back to host a and failed. What you're talking about is "Additional info" "After migration Host a's Score is 0 and hence engine does not allow migration with error message: The host a did not satisfy internal filter HA because it is not a Hosted Engine host.." In my flow also host's "a" (alma04) score is zero as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1411319#c26 e.g. [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 80efaf72 local_conf_timestamp : 19695 Host timestamp : 19680 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19680 (Tue Mar 7 18:23:18 2017) host-id=1 score=0 vm_conf_refresh_time=19695 (Tue Mar 7 18:23:33 2017) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Jan 1 07:32:26 1970 --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : a0469f44 local_conf_timestamp : 4331 Host timestamp : 4316 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4316 (Tue Mar 7 18:22:37 2017) host-id=2 score=3400 vm_conf_refresh_time=4331 (Tue Mar 7 18:22:53 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False You have new mail in /var/spool/mail/root [root@alma03 ~]# Sanlock logs irrelevant as these were fixed and all hosts were fine with storage. This environment initially got restored from backup and met with https://bugzilla.redhat.com/show_bug.cgi?id=1235200#c21 and also with https://bugzilla.redhat.com/show_bug.cgi?id=1417518#c9 So at the time I've tried to verify this bug, all sanlock issues were already fixed. No, the original report is about source host (host A or alma03 in your case from #26) > Migration failed (VM: HostedEngine, Source: alma03.qa.lab.tlv.redhat.com). reducing its score to 0, because it didn't properly detect the finished migration (the VM disappeared, but was happily running on the destination). In your case from comment #26 the VM really crashed (was no longer running anywhere) on the destination (host _B_) and alma04 reduced its score to 0 as it was supposed to. That is something completely different. Also your system was reporting weird errors. Just because something looks similar does not mean it is the same bug. Reproduce the exact situation or move this back to VERIFIED (and possibly open a new bug). This is original: " 1. Migrate HE vm from host a to host b. 2. After successful migration again immediately migrate to host a." First source is "a", then HE-VM was migrated successfully to its destination "b". Second source is "b" and destination "a". This is my case: First source was alma04, destination was alma03 and HE-VM successfully migrated it's first migration. Second immediate migration was from alma03 and it's destination was alma04 and migration had met with failure and alma04 had zero score. My case is equal to original in HE-VM migration end result - failure. My environment was restored from backup as I already explained and also that issue was fixed, so I don't see how errors in logs influence migration flow. The final result is the same, HE-VM can't immediately migrated between pair of hosted-engine-hosts, right after first migration. Forth to your input I'll try to reproduce this bug for a fourth time and will open another bug to fix the new root cause. Created attachment 1261471 [details]
screencast-2017-03-09_10.58.20.mkv
Created attachment 1261473 [details]
sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz
Created attachment 1261474 [details]
sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz
Created attachment 1261476 [details]
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz
Fully reproduced on clean and fresh environment with these components on hosts: rhvm-appliance-4.1.20170221.0-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.4.x86_64 mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.4-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-host-deploy-1.6.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.4-1.el7ev.noarch vdsm-4.19.7-1.el7ev.x86_64 ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-doc-4.1.0-2.el7ev.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch rhev-guest-tools-iso-4.1-4.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-4.1.1.3-0.1.el7.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Please see attached logs from both hosts, engine and screen cast from the reproduction. I'll also reopening https://bugzilla.redhat.com/show_bug.cgi?id=1419326, please consider on which of these two bugs to be closed. > Second immediate migration was from alma03 and it's destination was alma04 > and migration had met with failure and alma04 had zero score. Exactly, the destination failed in that case. Not the source host. That means it was a different behaviour and a correct one from HE perspective (there might be an issue with sanlock or vdsm maybe.. but the score 0 was correct in this case). > Forth to your input I'll try to reproduce this bug for a fourth time and > will open another bug to fix the new root cause. The last reproducer finally seems to be the proper one (_source_ host reducing score to 0). We can't solve 100% of it until we have event support in vdscli, but I guess we can improve the workarounds. This only happens when the migration + destroy is faster than the current HE refresh period (because we can't detect the migration properly). > I'll also reopening https://bugzilla.redhat.com/show_bug.cgi?id=1419326, please > consider on which of these two bugs to be closed. The product is different on those, this one is RHV, the other one is oVirt. Andrej improved the detection and added a workaround for case when the migration is really fast (less than 10 seconds). Moving this bug to verified as there is no more zero score issue and HA is working better now on these host's components: ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch vdsm-4.19.10-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch ovirt-host-deploy-1.6.3-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.8.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.5.x86_64 mom-0.5.9-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-doc-4.1.0-2.el7ev.noarch rhev-guest-tools-iso-4.1-4.el7ev.noarch rhevm-4.1.1.6-0.1.el7.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) I've tested this bug for HE deployed on NFS storage domain and pair of NFS data storage domains. Works for me on rhevm-4.1.1.6-0.1.el7.noarch. |