Bug 1458745
Summary: | [DR] - Hosted engine VM migration fails with 10ms latency between hosted engine hosts | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Elad <ebenahar> | ||||
Component: | General | Assignee: | bugs <bugs> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.1.0.5 | CC: | bugs, dfediuck, ebenahar, jcall, mavital, msivak, tnisan, ylavi | ||||
Target Milestone: | ovirt-4.2.0 | Keywords: | TestOnly, Triaged | ||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
|
||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-12-20 10:57:55 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1337914, 1478848, 1498327 | ||||||
Bug Blocks: | 1284364, 1534978 | ||||||
Attachments: |
|
Description
Elad
2017-06-05 11:30:19 UTC
Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch vdsm-4.19.11-1.el7ev.x86_64 libvirt-daemon-2.0.0-10.el7_3.8.x86_64 qemu-img-rhev-2.6.0-28.el7_3.9.x86_64 libvirt-daemon-driver-qemu-2.0.0-10.el7_3.8.x86_64 qemu-kvm-common-rhev-2.6.0-28.el7_3.9.x86_64 qemu-kvm-tools-rhev-2.6.0-28.el7_3.9.x86_64 ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64 sanlock-3.4.0-1.el7.x86_64 selinux-policy-3.13.1-102.el7_3.16.noarch device-mapper-multipath-0.4.9-99.el7_3.3.x86_64 rhevm-4.1.2-0.1.el7.noarch rhvm-appliance-4.1.20170425.0-1.el7.noarch Created attachment 1285083 [details] A migration failure of "normal" VM Before I answer the comments I have two questions about the test scenario: - Do we support remote storage at all? Or a cluster that is not all at a single location? I am pretty sure we are telling people it is a bad idea (especially when using NFS). - Was the latency just between the hosts of between hosts and storage as well? The reason for all the issues you were describing: The test setup was not really able to synchronize the two separated sets of hosts with this latency. The hosts from each group did not see the hosts from each other group as reliable. This caused the host that was moving to maintenance to avoid any migration as there was no "best backup destination". The engine then got stuck in Preparing for maintenance (recoverable by Activating the host again). > How reproducible: > Most of HE VM migration attempts Manual or maintenance flow related? Hosted engine uses migration in just those two cases. I saw manual migration to finish fine and maintenance to fail because there was no good enough destination. > Sometimes HE VM migration fails with the following error from Sanlock: > libvirtError: resource busy: Failed to acquire lock: error -243 This is not a migration error. You should indicate which host reported this, because this was reported by a free and active host from the second group. And the host did the right thing. It could not find the engine and so it tried to start a new instance of it. Sanlock properly prevented that, because a completely good engine is still up on another host. > MainThread::INFO::2017-05-29 > 10:44:51,984::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown > (score: 0) This is a separate bug, we have an issue in detection the locking failure. VDSM events will solve this (we just merged the support to vdsm master branch). > if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', > dom=self) > libvirtError: Failed to acquire lock: No space left on device This is either a storage space issue or a latency issue when talking to the remote storage. Not hosted engine related... > Expected results: > Need to decide if HE VM migration should succeed (considering the fact that > non HE VMs migration works properly with the same conditions) It definitely does not work fine either. I see multiple attempts were needed to actually migrate the VM. We do not take this risk with HE VM. We only perform one migration attempt and then let the admin to fix the situation. Note: The engine can do a better job to inform the admin about this, it has all the data already. Please note that severe storage connection latency or packet loss can cause sanlock to lose synchronization and that can result in involuntary host reboot via watchdog. Elad, can you clarify the topology? The latency is on which network? - To the storage? - On the migration network? - The mgmt network? - Some of the above are essentially on the same interface / same network? (we need to test each independently) (In reply to Yaniv Kaul from comment #3) > Elad, can you clarify the topology? The latency is on which network? 1 DC, 1 cluster, 4 hosts in it deployed as hosted engine. 2 of the hosts with latency simulated by tc on their NIC for all in and out traffic. The NIC is used for all host traffic, no separation (mgmt, storage, VMs migration) (In reply to Elad from comment #4) > 2 of the hosts > with latency simulated by tc on their NIC for all in and out traffic. The > NIC is used for all host traffic, no separation (mgmt, storage, VMs > migration) In this case this is unrelated to hosted engine. We can possibly add a warning as Martin mentioned, but it will be relevant for all storage related activities. https://bugzilla.redhat.com/show_bug.cgi?id=1448699 might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1458745#c2 " > MainThread::INFO::2017-05-29 > 10:44:51,984::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown > (score: 0) This is a separate bug, we have an issue in detection the locking failure. VDSM events will solve this (we just merged the support to vdsm master branch)." We are working on active-active testing for oVirt 4.2. We want to make sure it works well and this is part of that work. The storage must be reliable enough, the connections apparently break too much in the simulated environment. Was the delay of every packet independent? A ---- sent B ---- sent Or was it sequential? A ---- sent B ---- sent The first case would be something we might want to handle to some degree. The second case is something you won't encounter on a real network. Hosted engine has nothing to do with both.. it just asks for data from storage. But it only considers other hosts up if the last report arrived less than a minute ago. And NFS probably takes more than just a single packet to perform any disk operation. Martin, all I can say for sure is that the latency on the host's NIC was 10ms during the migration And how was the latency inserted? Have you tried downloading a plain file to see the effect of it? independent latency: the download starts slower, but reaches full speed sequential latency: it will be sloooow (In reply to Martin Sivák from comment #10) > And how was the latency inserted? Have you tried downloading a plain file to > see the effect of it? Using tc: # tc qdisc add dev eth0 root netem delay 10ms The effect was seen in ping, latency was higher than 10ms for each icmp packet > > independent latency: the download starts slower, but reaches full speed > sequential latency: it will be sloooow We enabled migration profiles for hosted engine VM and it might fix migration issues as well. Please see #1478848 As for storage and synchronization issues when latency is introduced, I do not see anything we can do here from the hosted engine side. The grace periods are already pretty long (1 minute for host liveliness check, 5 minutes for engine up check). I just realized we fixed couple of bugs where we interrogated the storage too often. That combined with the latency might have also contributed to the issue. Can you please retest using the latest 4.2 Hosted Engine (or even 4.1, it contains the most important fixes as well)? HE environment with 3 hosts on NFS storage HE VM runs on host_3 Verification steps: 1) Configure network delay on host_1 and host_2 # tc qdisc show dev enp26s0f0 qdisc netem 8001: root refcnt 65 limit 1000 delay 10.0ms 2) Put host with HE VM to maintenance 3) HE VM migrated to host_1 without any troubles I tried it a number of times and it looks fine, no exceptions under agent or vdsm log. I checked it on both 4.1 and 4.2. 4.2 ================================= [root@alma07 ~]# rpm -qa | grep vdsm vdsm-python-4.20.8-1.el7ev.noarch vdsm-4.20.8-1.el7ev.x86_64 vdsm-hook-vfio-mdev-4.20.8-1.el7ev.noarch vdsm-yajsonrpc-4.20.8-1.el7ev.noarch vdsm-jsonrpc-4.20.8-1.el7ev.noarch vdsm-hook-ethtool-options-4.20.8-1.el7ev.noarch vdsm-hook-openstacknet-4.20.8-1.el7ev.noarch vdsm-http-4.20.8-1.el7ev.noarch vdsm-client-4.20.8-1.el7ev.noarch vdsm-hook-vhostmd-4.20.8-1.el7ev.noarch vdsm-api-4.20.8-1.el7ev.noarch vdsm-hook-vmfex-dev-4.20.8-1.el7ev.noarch vdsm-hook-fcoe-4.20.8-1.el7ev.noarch [root@alma07 ~]# rpm -qa | grep hoste ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch ovirt-hosted-engine-setup-2.2.0-0.4.master.git0b67c21.el7ev.noarch 4.1 ================================= [root@alma05 ~]# rpm -qa | grep vdsm vdsm-hook-vmfex-dev-4.19.40-1.el7ev.noarch vdsm-xmlrpc-4.19.40-1.el7ev.noarch vdsm-cli-4.19.40-1.el7ev.noarch vdsm-yajsonrpc-4.19.40-1.el7ev.noarch vdsm-api-4.19.40-1.el7ev.noarch vdsm-jsonrpc-4.19.40-1.el7ev.noarch vdsm-python-4.19.40-1.el7ev.noarch vdsm-client-4.19.40-1.el7ev.noarch vdsm-4.19.40-1.el7ev.x86_64 [root@alma05 ~]# rpm -qa | grep hosted ovirt-hosted-engine-setup-2.1.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.8-1.el7ev.noarch I opened another bug https://bugzilla.redhat.com/show_bug.cgi?id=1519289 that I encountered via verification process, but it does not really relate to this bug. This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |