Bug 1436002 - After successful migration of HE vm from host a to host b, immediately migrating back to host a and then immediately migrating back to host b, fails with "Operation Canceled" from rhev manager portal.
Summary: After successful migration of HE vm from host a to host b, immediately migrat...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.1.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Martin Sivák
QA Contact: meital avital
URL:
Whiteboard:
Depends On: 1411319 1419326 1427748 1436613
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-26 15:51 UTC by Nikolai Sednev
Modified: 2020-07-16 09:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of: 1411319
Environment:
Last Closed: 2017-03-28 10:52:22 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:
nsednev: testing_plan_complete+


Attachments (Terms of Use)
sosreport-puma18.scl.lab.tlv.redhat.com-20170326184455.tar.xz (8.61 MB, application/x-xz)
2017-03-26 16:01 UTC, Nikolai Sednev
no flags Details
sosreport-puma19.scl.lab.tlv.redhat.com-20170326184449.tar.xz (9.28 MB, application/x-xz)
2017-03-26 16:02 UTC, Nikolai Sednev
no flags Details
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170326184441.tar.xz (8.61 MB, application/x-xz)
2017-03-26 16:03 UTC, Nikolai Sednev
no flags Details
Configuring HE spares count (37.70 KB, image/png)
2017-03-28 09:34 UTC, Martin Sivák
no flags Details
screencast-2017-03-28_13.22.40.mkv (2.71 MB, application/octet-stream)
2017-03-28 10:24 UTC, Nikolai Sednev
no flags Details

Comment 2 Nikolai Sednev 2017-03-26 15:54:46 UTC
After successful migration of HE vm from host a to host b, immediately migrating back to host a and then immediately migrating back to host b, fails with "Operation Canceled" from rhev manager portal.

Memory should be always reserved for swift HE-VM migrations on hosted-engine-hosts.

host's components:
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.10-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch
ovirt-host-deploy-1.6.3-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.8.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
mom-0.5.9-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.0-2.el7ev.noarch
rhev-guest-tools-iso-4.1-4.el7ev.noarch
rhevm-4.1.1.6-0.1.el7.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 3 Nikolai Sednev 2017-03-26 16:01:41 UTC
Created attachment 1266512 [details]
sosreport-puma18.scl.lab.tlv.redhat.com-20170326184455.tar.xz

Comment 4 Nikolai Sednev 2017-03-26 16:02:55 UTC
Created attachment 1266513 [details]
sosreport-puma19.scl.lab.tlv.redhat.com-20170326184449.tar.xz

Comment 5 Nikolai Sednev 2017-03-26 16:03:58 UTC
Created attachment 1266514 [details]
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170326184441.tar.xz

Comment 6 Nikolai Sednev 2017-03-26 16:04:59 UTC
Screen-cast, take a closer look at 6:59:
https://drive.google.com/open?id=0B85BEaDBcF88LS1TaERjZDdTWHM

Comment 7 Martin Sivák 2017-03-28 08:33:42 UTC
We already support this, but you have to enable it first.

*** This bug has been marked as a duplicate of bug 1264085 ***

Comment 8 Martin Sivák 2017-03-28 09:34:27 UTC
Created attachment 1266892 [details]
Configuring HE spares count

This is where you specify how many hosts need to be reserved for hosted engine.

Comment 9 Nikolai Sednev 2017-03-28 10:24:22 UTC
Created attachment 1266901 [details]
screencast-2017-03-28_13.22.40.mkv

Comment 10 Nikolai Sednev 2017-03-28 10:25:46 UTC
I see that this is not working for me even with enabled HeSparesCount value to 2.

Comment 11 Martin Sivák 2017-03-28 10:52:22 UTC
Ok after talking to Nikolai I have to explain couple of things:

- HeSparesCount does not magically free any space, it serves as a safeguard to make sure the already free space stays free for the purpose of the incoming hosted engine VM (and that it stays free on the right amount of hosted engine hosts)

- ovirt-engine does stats refresh only couple of times per minute (about once per 15 seconds or when an async event arrives). This might cause the immediately triggered Migrate command to fail with insufficient memory as the free memory was not yet recomputed. This is an inherent engine limitation.

So to sum it up, we can reserve memory, but the engine is not hard realtime system and there are small delays in data collection. The sysadmin needs to be aware of this limitation. We can't fix this using the current architecture.

Comment 12 Nikolai Sednev 2017-08-08 16:11:34 UTC
(In reply to Martin Sivák from comment #11)
> Ok after talking to Nikolai I have to explain couple of things:
> 
> - HeSparesCount does not magically free any space, it serves as a safeguard
> to make sure the already free space stays free for the purpose of the
> incoming hosted engine VM (and that it stays free on the right amount of
> hosted engine hosts)
> 
> - ovirt-engine does stats refresh only couple of times per minute (about
> once per 15 seconds or when an async event arrives). This might cause the
> immediately triggered Migrate command to fail with insufficient memory as
> the free memory was not yet recomputed. This is an inherent engine
> limitation.
> 
> So to sum it up, we can reserve memory, but the engine is not hard realtime
> system and there are small delays in data collection. The sysadmin needs to
> be aware of this limitation. We can't fix this using the current
> architecture.
Do we have this documented somewhere? I think if we expect from admin to be aware of this delay, we should properly document this limitation.

Comment 13 Nikolai Sednev 2017-08-08 16:12:30 UTC
Can you please provide me with relevant documentation coverage forth to https://bugzilla.redhat.com/show_bug.cgi?id=1436002#c12 ?

Comment 14 Yaniv Kaul 2017-08-08 17:42:27 UTC
(In reply to Nikolai Sednev from comment #12)
> (In reply to Martin Sivák from comment #11)
> > Ok after talking to Nikolai I have to explain couple of things:
> > 
> > - HeSparesCount does not magically free any space, it serves as a safeguard
> > to make sure the already free space stays free for the purpose of the
> > incoming hosted engine VM (and that it stays free on the right amount of
> > hosted engine hosts)
> > 
> > - ovirt-engine does stats refresh only couple of times per minute (about
> > once per 15 seconds or when an async event arrives). This might cause the
> > immediately triggered Migrate command to fail with insufficient memory as
> > the free memory was not yet recomputed. This is an inherent engine
> > limitation.
> > 
> > So to sum it up, we can reserve memory, but the engine is not hard realtime
> > system and there are small delays in data collection. The sysadmin needs to
> > be aware of this limitation. We can't fix this using the current
> > architecture.
> Do we have this documented somewhere? I think if we expect from admin to be
> aware of this delay, we should properly document this limitation.

Why? Does it auto-fix itself after a minute? If so, it's good enough. Why would you want to ping-pong the HE VM? 
Let's not add useless content to documentation, there are more items missing there with higher priority. 

Setting needinfo to understand from the reporter if it happens after awhile.

Comment 15 Nikolai Sednev 2017-08-09 06:17:46 UTC
(In reply to Yaniv Kaul from comment #14)
> (In reply to Nikolai Sednev from comment #12)
> > (In reply to Martin Sivák from comment #11)
> > > Ok after talking to Nikolai I have to explain couple of things:
> > > 
> > > - HeSparesCount does not magically free any space, it serves as a safeguard
> > > to make sure the already free space stays free for the purpose of the
> > > incoming hosted engine VM (and that it stays free on the right amount of
> > > hosted engine hosts)
> > > 
> > > - ovirt-engine does stats refresh only couple of times per minute (about
> > > once per 15 seconds or when an async event arrives). This might cause the
> > > immediately triggered Migrate command to fail with insufficient memory as
> > > the free memory was not yet recomputed. This is an inherent engine
> > > limitation.
> > > 
> > > So to sum it up, we can reserve memory, but the engine is not hard realtime
> > > system and there are small delays in data collection. The sysadmin needs to
> > > be aware of this limitation. We can't fix this using the current
> > > architecture.
> > Do we have this documented somewhere? I think if we expect from admin to be
> > aware of this delay, we should properly document this limitation.
> 
> Why? Does it auto-fix itself after a minute? If so, it's good enough.
The issue was not fixed, it is still happening. 

 
> Why would you want to ping-pong the HE VM? 

Its not a negative test, but basic functionality and automation testing crushed several times because of this during VM migration testing. Customer may also want to migrate VM back immediately if initial migration was made by mistake.

> Let's not add useless content to documentation, there are more items missing
> there with higher priority. 
If there are limitations of the product, which influence basic functionality and they should be known by admin as was written, then proper documentation should exist.
> 
> Setting needinfo to understand from the reporter if it happens after awhile.

Comment 16 Yaniv Kaul 2017-08-09 08:25:36 UTC
Per my discussion with Meital, please adjust the tests to wait 1m between migrations.

Comment 17 Lucy Bopf 2017-09-12 00:45:02 UTC
Nikolai, Yaniv,

Just re-reading this bug, and I'm unclear, based on the discussion, whether there is still documentation impact. Can you confirm?

If so, can you (Nikolai) summarize the impact to a customer: in what flow would they encounter this limitation, how does it impact them, and what do we recommend that they do to avoid or minimize the impact?

Comment 18 Yaniv Kaul 2017-09-12 06:50:18 UTC
(In reply to Lucy Bopf from comment #17)
> Nikolai, Yaniv,
> 
> Just re-reading this bug, and I'm unclear, based on the discussion, whether
> there is still documentation impact. Can you confirm?

There is no need to document this issue.

> 
> If so, can you (Nikolai) summarize the impact to a customer: in what flow
> would they encounter this limitation, how does it impact them, and what do
> we recommend that they do to avoid or minimize the impact?


Note You need to log in before you can comment on or make changes to this bug.