Bug 1448699
| Summary: | Migration of the HE VM via engine will drop destination host to the status 'EngineUnexpectedlyDown'. | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Nikolai Sednev <nsednev> |
| Component: | Agent | Assignee: | Andrej Krejcir <akrejcir> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.1.0.6 | CC: | akrejcir, alukiano, bugs, dfediuck, mavital, mgoldboi, msivak, mzamazal, nsednev, stirabos |
| Target Milestone: | ovirt-4.1.3 | Keywords: | Regression, Triaged |
| Target Release: | --- | Flags: | rule-engine:
ovirt-4.1+
rule-engine: blocker+ mgoldboi: planning_ack+ msivak: devel_ack+ mavital: testing_ack+ |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-07-06 13:31:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1455605 | ||
| Attachments: | |||
|
Description
Nikolai Sednev
2017-05-07 11:38:25 UTC
alma04 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : alma03
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : 3f09400a
local_conf_timestamp : 17345
Host timestamp : 17330
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=17330 (Sun May 7 14:27:37 2017)
host-id=1
score=0
vm_conf_refresh_time=17345 (Sun May 7 14:27:52 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : alma04
Host ID : 2
Engine status : {"health": "good", "vm": "up", "detail": "migration source"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 918fd52d
local_conf_timestamp : 17339
Host timestamp : 17318
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=17318 (Sun May 7 14:27:36 2017)
host-id=2
score=3400
vm_conf_refresh_time=17339 (Sun May 7 14:27:56 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
Created attachment 1276843 [details]
screencast-2017-05-07_14.28.31.mkv
Created attachment 1276844 [details]
screencast-2017-05-07_14.43.18.mkv
Created attachment 1276845 [details]
sosreport-alma03-20170507144128.tar.xz
Created attachment 1276846 [details]
sosreport-alma04-20170507144043.tar.xz
Created attachment 1276847 [details]
sosreport-engine-20170507074037.tar.xz
Engine was deployed using rhvm-appliance-4.1.20170425.0-1.el7.noarch. I can see, that under VDSM log on alma03 we have traceback:
2017-05-07 14:23:24,187+0300 ERROR (vm/536f534d) [virt.vm] (vmId='536f534d-ab11-4021-a980-11ac558c2991') The vm start process failed (vm:632)
Traceback (most recent call last):
File "/usr/share/vdsm/virt/vm.py", line 563, in _startUnderlyingVm
self._run()
File "/usr/share/vdsm/virt/vm.py", line 2021, in _run
self._connection.createXML(domxml, flags),
File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3777, in createXML
if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: resource busy: Failed to acquire lock: error -243
Can it be that because of fsm.NOWAIT that we added under the patch https://gerrit.ovirt.org/#/c/74870/1/ovirt_hosted_engine_ha/agent/states.py, the ha-agent starts the HE VM, before it destroyed on the source host?
The host that lost the race for starting the VM (lock was already taken by someone else) should not be penalized. The detection of this case probably failed or is no longer accurate. Based on the specific comment: "3.After several migrations ..." I'm reducing the severity, removing the blocker flag and moving to 4.1.3 This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. (In reply to Yaniv Kaul from comment #10) > Based on the specific comment: > "3.After several migrations ..." > > I'm reducing the severity, removing the blocker flag and moving to 4.1.3 Just to clear things a bit, by writing "several", I mean that there were at least 2-4 migrations. Was not properly fixed. I still see that although now destination host not getting score of 0, the migration still fails and HE-VM being shown as down, although its actually UP and hosts ha failing to converge. All relevant logs and screen-casts being attached. I've tested on these components on hosts: mom-0.5.9-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.9.x86_64 ovirt-hosted-engine-ha-2.1.1-1.el7ev.noarch ovirt-host-deploy-1.6.5-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64 ovirt-hosted-engine-setup-2.1.2-2.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-setup-lib-1.1.1-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch vdsm-4.19.17-1.el7ev.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017 Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhev-guest-tools-iso-4.1-5.el7ev.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch rhevm-4.1.3.1-0.1.el7.noarch rhevm-setup-plugins-4.1.2-1.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-doc-4.1.3-1.el7ev.noarch Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Created attachment 1287026 [details]
ha not failing to converge
Created attachment 1287028 [details]
engine logs
Another screen cast of what happened during migrations of the engine from alma03 to alma04 and then back to alma03. https://drive.google.com/open?id=0B85BEaDBcF88b09UUVdNb21pQkE Sosreport from alma03 https://drive.google.com/open?id=0B85BEaDBcF88TWdaWmtucC1LVW8 Sosreport from alma04 https://drive.google.com/open?id=0B85BEaDBcF88VHMtcGEwYVl4Wlk alma03 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : alma03
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : 3206b7d4
local_conf_timestamp : 98438
Host timestamp : 98421
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=98421 (Mon Jun 12 13:55:59 2017)
host-id=1
score=0
vm_conf_refresh_time=98438 (Mon Jun 12 13:56:15 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Fri Jan 2 05:24:27 1970
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : alma04
Host ID : 2
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : c2b51f9b
local_conf_timestamp : 68794
Host timestamp : 68778
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=68778 (Mon Jun 12 13:56:03 2017)
host-id=2
score=0
vm_conf_refresh_time=68794 (Mon Jun 12 13:56:19 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Thu Jan 1 21:14:41 1970
So where is the VM running if both hosted engine and webadmin (!) report it as down? Please realize that the webadmin is provided by a service from the VM that is down according to all sources of information we have here. There is something weird going on, we definitely need to see what libvirt is doing and why VDSM does not see and report the VM that is apparently up. Can you please do the following on both nodes? virsh -r list vdsClient -s 0 getAllVmStats alma03 ~]# virsh -r list Id Name State ---------------------------------------------------- 3 HostedEngine running [root@alma03 ~]# vdsClient -s 0 getAllVmStats [root@alma03 ~]# alma04 ~]# virsh -r list Id Name State ---------------------------------------------------- You have new mail in /var/spool/mail/root [root@alma04 ~]# vdsClient -s 0 getAllVmStats [root@alma04 ~]# Created attachment 1287062 [details]
engine shown as down, although running on alma03
Ok, so there is the issue. It has nothing to do with hosted engine directly here. VDSM does no see the VM for some reason. Hmm who do we ask about this... (In reply to Martin Sivák from comment #22) > Ok, so there is the issue. It has nothing to do with hosted engine directly > here. VDSM does no see the VM for some reason. Hmm who do we ask about > this... Any news? Hi Milan, can you please take a look at the VDSM log to see what might have happened there? I can see there that: - Migration was started from alma04 to alma03. - The VM was started on the destination: 2017-05-07 14:26:47,489+0300 INFO (jsonrpc/2) [vdsm.api] START __init__ - Then a mysterious Destroy call came on the destination (alma04): 2017-05-07 14:27:00,802+0300 INFO (jsonrpc/0) [vdsm.api] START destroy(gracefulAttempts=1) (api:39) I can't see where the call came from, I can't see it in Engine log (the one reported here) nor in the source host log. - The VM was reported as successfully "destroyed" by Vdsm, while it was still running. - The migration happily continued and has eventually successfully finished, without the destination Vdsm being aware about the VM being still there. So there are two strange events here: 1. The Destroy call of unknown origin on the destination. 2. The migrating VM survives the Destroy call. As for 1., I can't say. Regarding 2., I think the problem is that Vm.releaseVm doesn't actually call destroy when the VM is not yet "connected" (which is the case on migration destination until the migration finishes). It's correct that the VM is not destroyed in such a case, but it's apparently not correct that no error is reported and the VM is removed from the VM list (I guess it might be related to canceling a migration, without considering the wrong case here). While fixing 2. would probably prevent the problem from arising, the Destroy call shouldn't happen and someone should try to find out where it came from and why. It seems the same happens in the later logs, after this destroy call: 2017-06-11 19:02:01,318+0300 INFO (jsonrpc/7) [vdsm.api] START destroy(gracefulAttempts=1) from=::1,38406 (api:46) (From localhost?) As explained before this issue is not a blocker. Removing the relevant flag. What we're seeing from comment 13, as explained in comment 22, is something different which is unrelated to hosted engine and needs to be handled accordingly. Please check this bz for the original issue which should be solved already. As for the new issue, we need a proper reproducer for it in a different bz and all the relevant information as described by Milan in comment 25: " I can't see where the call came from, I can't see it in Engine log (the one reported here) nor in the source host log. " This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Moving to verified as original scenario not being reproduced any more and destination host's score not being zeroed. A new bug will be opened forth to comments #13,17,20 and 29. Opened this one https://bugzilla.redhat.com/show_bug.cgi?id=1467063 forthto my latest comment #31. |