Description of problem: Guarantee there is always memory free for hosted-engine start on a host being part of self-hosted env. Why? By some miracle our hosted-engine failed and could not be started on other host which was involved in the self-hosted env because there was no free memory. Solution was to kill some VMs to free memory for hosted-engine. When memory was available HA agent made hosted-engine VM start successfully. IMO there's a fault in design - there should be always a guarantee that there is enough free memory on at least one host involved in self-hosted environment to if hosted-engine VM crashes on host 'foo' it could be successfully started on host 'bar' but other hosts could have no free memory. Smells like chicken/egg problem, please brainstorm. Version-Release number of selected component (if applicable): 3.5.4 How reproducible: 100% Steps to Reproduce: 1. install self-hosted engine and have 2 hosts for this self-hosted engine env 2. have hosted-engine running on host 1 3. add into same Data Center another host (thus host 3) 4. make a load (starts VMs) on host 2 and host 3 so there's no free memory 5. kill hosted-engine on host 1 and immediately create load there so there's no free memory (thus it needs to be started somewhere else - host 2) 6. see how HA agent tries to start hosted-engine Actual results: qemu-kvm returns 'cannot allocate memory' Expected results: there should be some kind of guarantee that in this case (above) host 2 doesn't allow start of VMs which would "eat" memory reserved for hosted-engine start Additional info: submitting this BZ so this issue is not forgotten (meanwhile i'll stick to notifier for passive check)
(I wanted to go with quota but if memory on two involved hosts differ, i can't defined max memory in quota for other VMs but hosted-engine.)
Our HostedEngine is down again and could not be started because of not enough free ram. As workaround we are going to waste memory with a process with will "eat" equal ram as HostedEngine and we will kill it via hooks when needed.
Martin, is this already implemented?
Yes, this is already in.
The fix for this issue should be included in oVirt 4.1.0 beta 1 released on December 1st. If not included please move back to modified.
I've deployed 2 hosts with 32Gig of RAM and HE with 16384Gig of RAM. I've created some VMs with 1Gig RAM guaranteed memory per each VM and got started 43 active VMs+HE-VM over all. 43*1024=44032MB of RAM in total, while on puma18 were running 29 VMs and on puma19 were 14 guest VMs and HE-VM (15 VMs including HE-VM on puma19). When I've tried to start additional VM, I've received this error: " Operation Canceled Error while executing action: Memory_load-41: Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details: The host puma18.scl.lab.tlv.redhat.com did not satisfy internal filter Memory because its available memory is too low (100.000000 MB) to run the VM. The host puma19.scl.lab.tlv.redhat.com did not satisfy internal filter Memory because its available memory is too low (0.000000 MB) to run the VM. " RAM reported from WEBUI on puma18 (29 guest-VMs runnig on host): Physical Memory: 32067 MB total, 6413 MB used, 25654 MB free Swap Size: 2047 MB total, 0 MB used, 2047 MB free Shared Memory: 27% Max free Memory for scheduling new VMs: 100 MB RAM reported from WEBUI on puma19 (host that was running HE-VM with 16Gig RAM and 14 guest-VMs): Physical Memory: 32067 MB total, 18278 MB used, 13789 MB free Swap Size: 2047 MB total, 0 MB used, 2047 MB free Shared Memory: 7% Max free Memory for scheduling new VMs: 0 MB Now I've manually stopped HE-VM on puma19 to see if it will get started on puma18 (as because of manually killing HE-VM on puma19, its HA score being zeroed and HA should start HE-VM on another host with positive best score, which in my case should be puma18, as I have only 2 hosted-engine-hosts in my environment. puma19 ~]# hosted-engine --vm-poweroff puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18.scl.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 24c902b4 local_conf_timestamp : 12304 Host timestamp : 518905 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=518905 (Mon Jan 2 14:52:08 2017) host-id=1 score=3400 vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19.scl.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 0b49f26a local_conf_timestamp : 10819 Host timestamp : 517458 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=517458 (Mon Jan 2 14:52:34 2017) host-id=2 score=0 vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False puma18 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18.scl.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : f5d68fe8 local_conf_timestamp : 12304 Host timestamp : 519035 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=519035 (Mon Jan 2 14:54:19 2017) host-id=1 score=3400 vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19.scl.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 613a9a8b local_conf_timestamp : 10819 Host timestamp : 517581 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=517581 (Mon Jan 2 14:54:37 2017) host-id=2 score=0 vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Wed Jan 7 01:54:53 1970 puma18 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18.scl.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 29dfc563 local_conf_timestamp : 12304 Host timestamp : 521268 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=521268 (Mon Jan 2 15:31:32 2017) host-id=1 score=3400 vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19.scl.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : fb1a3aa2 local_conf_timestamp : 10819 Host timestamp : 519787 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=519787 (Mon Jan 2 15:31:23 2017) host-id=2 score=3400 vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False Finally HE-VM was started on puma18 and host was running 30VMs over all, while 29 guest-VMs and HE-VM. puma18: Physical Memory: 32067 MB total, 10582 MB used, 21485 MB free Swap Size: 2047 MB total, 0 MB used, 2047 MB free Shared Memory: 30% Max free Memory for scheduling new VMs: Works for me on these components on hosts: ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.8-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-0.0.master.20161221071755.git46cacd3.el7.centos.noarch ovirt-setup-lib-1.1.0-1.el7.centos.noarch libvirt-client-2.0.0-10.el7_3.2.x86_64 ovirt-release41-pre-4.1.0-0.6.beta2.20161221025826.gitc487776.el7.centos.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 ovirt-hosted-engine-ha-2.1.0-0.0.master.20161221070856.20161221070854.git387fa53.el7.centos.noarch rhevm-appliance-20161116.0-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-host-deploy-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch vdsm-4.18.999-1218.gitd36143e.el7.centos.x86_64 ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016 Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: ovirt-engine-setup-plugin-ovirt-engine-common-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-imageio-proxy-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch ovirt-iso-uploader-4.1.0-0.0.master.20160909154152.git14502bd.el7.centos.noarch ovirt-engine-userportal-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-dbscripts-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-extensions-api-impl-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch ovirt-host-deploy-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch python-ovirt-engine-sdk4-4.1.0-0.1.a0.20161215git77fce51.el7.centos.x86_64 ovirt-host-deploy-java-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch ovirt-release41-pre-4.1.0-0.6.beta2.20161221025826.gitc487776.el7.centos.noarch ovirt-setup-lib-1.1.0-1.el7.centos.noarch ovirt-engine-extension-aaa-jdbc-1.1.2-1.el7.noarch ovirt-engine-dwh-setup-4.1.0-0.0.master.20161129154019.el7.centos.noarch ovirt-imageio-proxy-setup-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch ovirt-engine-tools-backup-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-websocket-proxy-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-setup-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-backend-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-tools-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-webadmin-portal-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-restapi-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-vmconsole-proxy-helper-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-setup-plugin-ovirt-engine-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-wildfly-overlay-10.0.0-1.el7.noarch ovirt-engine-cli-3.6.9.2-1.el7.centos.noarch ovirt-web-ui-0.1.1-2.el7.centos.x86_64 ovirt-engine-setup-base-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-vmconsole-1.0.4-1.el7.centos.noarch ovirt-engine-dwh-4.1.0-0.0.master.20161129154019.el7.centos.noarch ovirt-engine-setup-plugin-websocket-proxy-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-hosts-ansible-inventory-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-engine-dashboard-1.1.0-0.4.20161128git5ed6f96.el7.centos.noarch ovirt-engine-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-guest-agent-common-1.0.13-1.20161220085008.git165fff1.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.centos.noarch ovirt-engine-wildfly-10.1.0-1.el7.x86_64 ovirt-engine-lib-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch ovirt-vmconsole-proxy-1.0.4-1.el7.centos.noarch Linux version 3.10.0-514.2.2.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Dec 6 23:06:41 UTC 2016 Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.3.1611 (Core)
*** Bug 1436002 has been marked as a duplicate of this bug. ***