After successful migration of HE vm from host a to host b, immediately migrating back to host a and then immediately migrating back to host b, fails with "Operation Canceled" from rhev manager portal. Memory should be always reserved for swift HE-VM migrations on hosted-engine-hosts. host's components: ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch vdsm-4.19.10-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch ovirt-host-deploy-1.6.3-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.8.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.5.x86_64 mom-0.5.9-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-doc-4.1.0-2.el7ev.noarch rhev-guest-tools-iso-4.1-4.el7ev.noarch rhevm-4.1.1.6-0.1.el7.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo)
Created attachment 1266512 [details] sosreport-puma18.scl.lab.tlv.redhat.com-20170326184455.tar.xz
Created attachment 1266513 [details] sosreport-puma19.scl.lab.tlv.redhat.com-20170326184449.tar.xz
Created attachment 1266514 [details] sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170326184441.tar.xz
Screen-cast, take a closer look at 6:59: https://drive.google.com/open?id=0B85BEaDBcF88LS1TaERjZDdTWHM
We already support this, but you have to enable it first. *** This bug has been marked as a duplicate of bug 1264085 ***
Created attachment 1266892 [details] Configuring HE spares count This is where you specify how many hosts need to be reserved for hosted engine.
Created attachment 1266901 [details] screencast-2017-03-28_13.22.40.mkv
I see that this is not working for me even with enabled HeSparesCount value to 2.
Ok after talking to Nikolai I have to explain couple of things: - HeSparesCount does not magically free any space, it serves as a safeguard to make sure the already free space stays free for the purpose of the incoming hosted engine VM (and that it stays free on the right amount of hosted engine hosts) - ovirt-engine does stats refresh only couple of times per minute (about once per 15 seconds or when an async event arrives). This might cause the immediately triggered Migrate command to fail with insufficient memory as the free memory was not yet recomputed. This is an inherent engine limitation. So to sum it up, we can reserve memory, but the engine is not hard realtime system and there are small delays in data collection. The sysadmin needs to be aware of this limitation. We can't fix this using the current architecture.
(In reply to Martin Sivák from comment #11) > Ok after talking to Nikolai I have to explain couple of things: > > - HeSparesCount does not magically free any space, it serves as a safeguard > to make sure the already free space stays free for the purpose of the > incoming hosted engine VM (and that it stays free on the right amount of > hosted engine hosts) > > - ovirt-engine does stats refresh only couple of times per minute (about > once per 15 seconds or when an async event arrives). This might cause the > immediately triggered Migrate command to fail with insufficient memory as > the free memory was not yet recomputed. This is an inherent engine > limitation. > > So to sum it up, we can reserve memory, but the engine is not hard realtime > system and there are small delays in data collection. The sysadmin needs to > be aware of this limitation. We can't fix this using the current > architecture. Do we have this documented somewhere? I think if we expect from admin to be aware of this delay, we should properly document this limitation.
Can you please provide me with relevant documentation coverage forth to https://bugzilla.redhat.com/show_bug.cgi?id=1436002#c12 ?
(In reply to Nikolai Sednev from comment #12) > (In reply to Martin Sivák from comment #11) > > Ok after talking to Nikolai I have to explain couple of things: > > > > - HeSparesCount does not magically free any space, it serves as a safeguard > > to make sure the already free space stays free for the purpose of the > > incoming hosted engine VM (and that it stays free on the right amount of > > hosted engine hosts) > > > > - ovirt-engine does stats refresh only couple of times per minute (about > > once per 15 seconds or when an async event arrives). This might cause the > > immediately triggered Migrate command to fail with insufficient memory as > > the free memory was not yet recomputed. This is an inherent engine > > limitation. > > > > So to sum it up, we can reserve memory, but the engine is not hard realtime > > system and there are small delays in data collection. The sysadmin needs to > > be aware of this limitation. We can't fix this using the current > > architecture. > Do we have this documented somewhere? I think if we expect from admin to be > aware of this delay, we should properly document this limitation. Why? Does it auto-fix itself after a minute? If so, it's good enough. Why would you want to ping-pong the HE VM? Let's not add useless content to documentation, there are more items missing there with higher priority. Setting needinfo to understand from the reporter if it happens after awhile.
(In reply to Yaniv Kaul from comment #14) > (In reply to Nikolai Sednev from comment #12) > > (In reply to Martin Sivák from comment #11) > > > Ok after talking to Nikolai I have to explain couple of things: > > > > > > - HeSparesCount does not magically free any space, it serves as a safeguard > > > to make sure the already free space stays free for the purpose of the > > > incoming hosted engine VM (and that it stays free on the right amount of > > > hosted engine hosts) > > > > > > - ovirt-engine does stats refresh only couple of times per minute (about > > > once per 15 seconds or when an async event arrives). This might cause the > > > immediately triggered Migrate command to fail with insufficient memory as > > > the free memory was not yet recomputed. This is an inherent engine > > > limitation. > > > > > > So to sum it up, we can reserve memory, but the engine is not hard realtime > > > system and there are small delays in data collection. The sysadmin needs to > > > be aware of this limitation. We can't fix this using the current > > > architecture. > > Do we have this documented somewhere? I think if we expect from admin to be > > aware of this delay, we should properly document this limitation. > > Why? Does it auto-fix itself after a minute? If so, it's good enough. The issue was not fixed, it is still happening. > Why would you want to ping-pong the HE VM? Its not a negative test, but basic functionality and automation testing crushed several times because of this during VM migration testing. Customer may also want to migrate VM back immediately if initial migration was made by mistake. > Let's not add useless content to documentation, there are more items missing > there with higher priority. If there are limitations of the product, which influence basic functionality and they should be known by admin as was written, then proper documentation should exist. > > Setting needinfo to understand from the reporter if it happens after awhile.
Per my discussion with Meital, please adjust the tests to wait 1m between migrations.
Nikolai, Yaniv, Just re-reading this bug, and I'm unclear, based on the discussion, whether there is still documentation impact. Can you confirm? If so, can you (Nikolai) summarize the impact to a customer: in what flow would they encounter this limitation, how does it impact them, and what do we recommend that they do to avoid or minimize the impact?
(In reply to Lucy Bopf from comment #17) > Nikolai, Yaniv, > > Just re-reading this bug, and I'm unclear, based on the discussion, whether > there is still documentation impact. Can you confirm? There is no need to document this issue. > > If so, can you (Nikolai) summarize the impact to a customer: in what flow > would they encounter this limitation, how does it impact them, and what do > we recommend that they do to avoid or minimize the impact?