Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1448699

Summary: Migration of the HE VM via engine will drop destination host to the status 'EngineUnexpectedlyDown'.
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Nikolai Sednev <nsednev>
Component: AgentAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 2.1.0.6CC: akrejcir, alukiano, bugs, dfediuck, mavital, mgoldboi, msivak, mzamazal, nsednev, stirabos
Target Milestone: ovirt-4.1.3Keywords: Regression, Triaged
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
mgoldboi: planning_ack+
msivak: devel_ack+
mavital: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-06 13:31:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1455605    
Attachments:
Description Flags
screencast-2017-05-07_14.28.31.mkv
none
screencast-2017-05-07_14.43.18.mkv
none
sosreport-alma03-20170507144128.tar.xz
none
sosreport-alma04-20170507144043.tar.xz
none
sosreport-engine-20170507074037.tar.xz
none
ha not failing to converge
none
engine logs
none
engine shown as down, although running on alma03 none

Description Nikolai Sednev 2017-05-07 11:38:25 UTC
Description of problem:
Migration of the HE VM via engine will drop destination host to the status 'EngineUnexpectedlyDown'.

Version-Release number of selected component (if applicable):

Hosts:
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
libvirt-client-2.0.0-10.el7_3.5.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
vdsm-4.19.12-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch
ovirt-host-deploy-1.6.4-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.6-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017
Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.2-1.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhevm-4.1.2-0.1.el7.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


How reproducible:
100%

Steps to Reproduce:
1.Deploy clean HE over NFS storage domain on couple hosts and add at least one NFS data storage domain.
2.Start migrating HE-VM using migrate button from UI.
3.After several migrations you will see via CLI that destination host getting score of zero and status 'EngineUnexpectedlyDown', 
4.After some time both hosts seen in UI as "Unavailable due to HA score" and HE-VM shown as "Down".

Actual results:
Migration of HE-VM getting stuck and environment stops working normally.

Expected results:
Migration should work normally and destination host's score should not drop to zero. 

Additional info:
sosreports and screencast being attached.

Comment 1 Nikolai Sednev 2017-05-07 11:41:34 UTC
alma04 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 3f09400a
local_conf_timestamp               : 17345
Host timestamp                     : 17330
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=17330 (Sun May  7 14:27:37 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=17345 (Sun May  7 14:27:52 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "migration source"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 918fd52d
local_conf_timestamp               : 17339
Host timestamp                     : 17318
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=17318 (Sun May  7 14:27:36 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=17339 (Sun May  7 14:27:56 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

Comment 2 Nikolai Sednev 2017-05-07 11:42:17 UTC
Created attachment 1276843 [details]
screencast-2017-05-07_14.28.31.mkv

Comment 3 Nikolai Sednev 2017-05-07 11:44:03 UTC
Created attachment 1276844 [details]
screencast-2017-05-07_14.43.18.mkv

Comment 4 Nikolai Sednev 2017-05-07 11:48:44 UTC
Created attachment 1276845 [details]
sosreport-alma03-20170507144128.tar.xz

Comment 5 Nikolai Sednev 2017-05-07 11:49:42 UTC
Created attachment 1276846 [details]
sosreport-alma04-20170507144043.tar.xz

Comment 6 Nikolai Sednev 2017-05-07 11:50:46 UTC
Created attachment 1276847 [details]
sosreport-engine-20170507074037.tar.xz

Comment 7 Nikolai Sednev 2017-05-07 11:53:54 UTC
Engine was deployed using rhvm-appliance-4.1.20170425.0-1.el7.noarch.

Comment 8 Artyom 2017-05-07 12:57:09 UTC
I can see, that under VDSM log on alma03 we have traceback:
2017-05-07 14:23:24,187+0300 ERROR (vm/536f534d) [virt.vm] (vmId='536f534d-ab11-4021-a980-11ac558c2991') The vm start process failed (vm:632)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 563, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 2021, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3777, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: resource busy: Failed to acquire lock: error -243

Can it be that because of fsm.NOWAIT that we added under the patch https://gerrit.ovirt.org/#/c/74870/1/ovirt_hosted_engine_ha/agent/states.py, the ha-agent starts the HE VM, before it destroyed on the source host?

Comment 9 Martin Sivák 2017-05-10 11:12:27 UTC
The host that lost the race for starting the VM (lock was already taken by someone else) should not be penalized. The detection of this case probably failed or is no longer accurate.

Comment 10 Yaniv Kaul 2017-05-11 12:16:00 UTC
Based on the specific comment:
"3.After several migrations ..."

I'm reducing the severity, removing the blocker flag and moving to 4.1.3

Comment 11 Red Hat Bugzilla Rules Engine 2017-05-11 12:16:06 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 12 Nikolai Sednev 2017-05-11 12:20:28 UTC
(In reply to Yaniv Kaul from comment #10)
> Based on the specific comment:
> "3.After several migrations ..."
> 
> I'm reducing the severity, removing the blocker flag and moving to 4.1.3
Just to clear things a bit, by writing "several", I mean that there were at least 2-4 migrations.

Comment 13 Nikolai Sednev 2017-06-12 11:00:09 UTC
Was not properly fixed.
I still see that although now destination host not getting score of 0, the migration still fails and HE-VM being shown as down, although its actually UP and hosts ha failing to converge.
All relevant logs and screen-casts being attached.
I've tested on these components on hosts:
mom-0.5.9-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.9.x86_64
ovirt-hosted-engine-ha-2.1.1-1.el7ev.noarch
ovirt-host-deploy-1.6.5-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64
ovirt-hosted-engine-setup-2.1.2-2.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-setup-lib-1.1.1-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.17-1.el7ev.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017
Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhevm-4.1.3.1-0.1.el7.noarch
rhevm-setup-plugins-4.1.2-1.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-doc-4.1.3-1.el7ev.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 14 Red Hat Bugzilla Rules Engine 2017-06-12 11:00:16 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 15 Nikolai Sednev 2017-06-12 11:02:03 UTC
Created attachment 1287026 [details]
ha not failing to converge

Comment 16 Nikolai Sednev 2017-06-12 11:03:12 UTC
Created attachment 1287028 [details]
engine logs

Comment 17 Nikolai Sednev 2017-06-12 11:08:19 UTC
Another screen cast of what happened during migrations of the engine from alma03 to alma04 and then back to alma03.
https://drive.google.com/open?id=0B85BEaDBcF88b09UUVdNb21pQkE

Sosreport from alma03
https://drive.google.com/open?id=0B85BEaDBcF88TWdaWmtucC1LVW8

Sosreport from alma04
https://drive.google.com/open?id=0B85BEaDBcF88VHMtcGEwYVl4Wlk

Comment 18 Nikolai Sednev 2017-06-12 11:13:30 UTC
alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 3206b7d4
local_conf_timestamp               : 98438
Host timestamp                     : 98421
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=98421 (Mon Jun 12 13:55:59 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=98438 (Mon Jun 12 13:56:15 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 05:24:27 1970


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : c2b51f9b
local_conf_timestamp               : 68794
Host timestamp                     : 68778
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=68778 (Mon Jun 12 13:56:03 2017)
        host-id=2
        score=0
        vm_conf_refresh_time=68794 (Mon Jun 12 13:56:19 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 21:14:41 1970

Comment 19 Martin Sivák 2017-06-12 14:21:01 UTC
So where is the VM running if both hosted engine and webadmin (!) report it as down? Please realize that the webadmin is provided by a service from the VM that is down according to all sources of information we have here.

There is something weird going on, we definitely need to see what libvirt is doing and why VDSM does not see and report the VM that is apparently up.

Can you please do the following on both nodes?

virsh -r list
vdsClient -s 0 getAllVmStats

Comment 20 Nikolai Sednev 2017-06-12 14:33:40 UTC
alma03 ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------
 3     HostedEngine                   running


[root@alma03 ~]# vdsClient -s 0 getAllVmStats
[root@alma03 ~]# 

alma04 ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------

You have new mail in /var/spool/mail/root
[root@alma04 ~]# vdsClient -s 0 getAllVmStats
[root@alma04 ~]#

Comment 21 Nikolai Sednev 2017-06-12 14:34:47 UTC
Created attachment 1287062 [details]
engine shown as down, although running on alma03

Comment 22 Martin Sivák 2017-06-12 15:48:02 UTC
Ok, so there is the issue. It has nothing to do with hosted engine directly here. VDSM does no see the VM for some reason. Hmm who do we ask about this...

Comment 23 Yaniv Kaul 2017-06-15 10:25:40 UTC
(In reply to Martin Sivák from comment #22)
> Ok, so there is the issue. It has nothing to do with hosted engine directly
> here. VDSM does no see the VM for some reason. Hmm who do we ask about
> this...

Any news?

Comment 24 Martin Sivák 2017-06-19 13:45:09 UTC
Hi Milan, can you please take a look at the VDSM log to see what might have happened there?

Comment 25 Milan Zamazal 2017-06-20 13:51:13 UTC
I can see there that:

- Migration was started from alma04 to alma03.

- The VM was started on the destination:

  2017-05-07 14:26:47,489+0300 INFO  (jsonrpc/2) [vdsm.api] START __init__

- Then a mysterious Destroy call came on the destination (alma04):

  2017-05-07 14:27:00,802+0300 INFO  (jsonrpc/0) [vdsm.api] START destroy(gracefulAttempts=1) (api:39)

  I can't see where the call came from, I can't see it in Engine log (the one reported here) nor in the source host log.

- The VM was reported as successfully "destroyed" by Vdsm, while it was still running.

- The migration happily continued and has eventually successfully finished, without the destination Vdsm being aware about the VM being still there.

So there are two strange events here:

1. The Destroy call of unknown origin on the destination.

2. The migrating VM survives the Destroy call.

As for 1., I can't say.  Regarding 2., I think the problem is that Vm.releaseVm doesn't actually call destroy when the VM is not yet "connected" (which is the case on migration destination until the migration finishes). It's correct that the VM is not destroyed in such a case, but it's apparently not correct that no error is reported and the VM is removed from the VM list (I guess it might be related to canceling a migration, without considering the wrong case here).

While fixing 2. would probably prevent the problem from arising, the Destroy call shouldn't happen and someone should try to find out where it came from and why.

Comment 26 Milan Zamazal 2017-06-20 14:18:54 UTC
It seems the same happens in the later logs, after this destroy call:

2017-06-11 19:02:01,318+0300 INFO  (jsonrpc/7) [vdsm.api] START destroy(gracefulAttempts=1) from=::1,38406 (api:46)

(From localhost?)

Comment 29 Doron Fediuck 2017-06-22 13:31:34 UTC
As explained before this issue is not a blocker. Removing the relevant flag.
What we're seeing from comment 13, as explained in comment 22, is something different which is unrelated to hosted engine and needs to be handled accordingly.

Please check this bz for the original issue which should be solved already.
As for the new issue, we need a proper reproducer for it in a different bz and all the relevant information as described by Milan in comment 25:
"
I can't see where the call came from, I can't see it in Engine log (the one reported here) nor in the source host log.
"

Comment 30 Red Hat Bugzilla Rules Engine 2017-06-22 13:31:41 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 31 Nikolai Sednev 2017-06-28 21:39:53 UTC
Moving to verified as original scenario not being reproduced any more and destination host's score not being zeroed. A new bug will be opened forth to comments #13,17,20 and 29.

Comment 32 Nikolai Sednev 2017-07-03 10:25:25 UTC
Opened this one https://bugzilla.redhat.com/show_bug.cgi?id=1467063 forthto my latest comment #31.