Bug 1264085 - [RFE] Reserve enough free memory for hosted-engine to start on a host being part of self-hosted env
Summary: [RFE] Reserve enough free memory for hosted-engine to start on a host being p...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: RFEs
Version: 3.5.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Martin Sivák
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1403956 1406001
Blocks: 1411319 1427748 1436613
TreeView+ depends on / blocked
 
Reported: 2015-09-17 13:17 UTC by Jiri Belka
Modified: 2017-03-28 10:31 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-02-15 14:53:21 UTC
oVirt Team: SLA
Embargoed:
dfediuck: ovirt-4.1?
gklein: testing_plan_complete+
rule-engine: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 46111 0 None None None 2016-10-26 12:43:37 UTC

Description Jiri Belka 2015-09-17 13:17:14 UTC
Description of problem:
Guarantee there is always memory free for hosted-engine start on a host being part of self-hosted env.

Why? By some miracle our hosted-engine failed and could not be started on other host which was involved in the self-hosted env because there was no free memory.

Solution was to kill some VMs to free memory for hosted-engine. When memory was available HA agent made hosted-engine VM start successfully.

IMO there's a fault in design - there should be always a guarantee that there is enough free memory on at least one host involved in self-hosted environment to if hosted-engine VM crashes on host 'foo' it could be successfully started on host 'bar' but other hosts could have no free memory.

Smells like chicken/egg problem, please brainstorm.

Version-Release number of selected component (if applicable):
3.5.4

How reproducible:
100%

Steps to Reproduce:
1. install self-hosted engine and have 2 hosts for this self-hosted engine env
2. have hosted-engine running on host 1
3. add into same Data Center another host (thus host 3)
4. make a load (starts VMs) on host 2 and host 3 so there's no free memory
5. kill hosted-engine on host 1 and immediately create load there so there's
   no free memory (thus it needs to be started somewhere else - host 2)
6. see how HA agent tries to start hosted-engine

Actual results:
qemu-kvm returns 'cannot allocate memory'

Expected results:
there should be some kind of guarantee that in this case (above) host 2 doesn't allow start of VMs which would "eat" memory reserved for hosted-engine start

Additional info:
submitting this BZ so this issue is not forgotten
(meanwhile i'll stick to notifier for passive check)

Comment 1 Jiri Belka 2015-09-17 13:19:01 UTC
(I wanted to go with quota but if memory on two involved hosts differ, i can't defined max memory in quota for other VMs but hosted-engine.)

Comment 2 Jiri Belka 2015-11-12 09:37:28 UTC
Our HostedEngine is down again and could not be started because of not enough free ram. As workaround we are going to waste memory with a process with will "eat" equal ram as HostedEngine and we will kill it via hooks when needed.

Comment 3 Doron Fediuck 2016-07-17 08:24:14 UTC
Martin,
is this already implemented?

Comment 4 Martin Sivák 2016-10-26 12:43:37 UTC
Yes, this is already in.

Comment 5 Sandro Bonazzola 2016-12-12 13:56:37 UTC
The fix for this issue should be included in oVirt 4.1.0 beta 1 released on December 1st. If not included please move back to modified.

Comment 6 Nikolai Sednev 2017-01-02 13:38:08 UTC
I've deployed 2 hosts with 32Gig of RAM and HE with 16384Gig of RAM.
I've created some VMs with 1Gig RAM guaranteed memory per each VM and got started 43 active VMs+HE-VM over all.
43*1024=44032MB of RAM in total, while on puma18 were running 29 VMs and on puma19 were 14 guest VMs and HE-VM (15 VMs including HE-VM on puma19).

When I've tried to start additional VM, I've received this error:
"
Operation Canceled
Error while executing action:

Memory_load-41:

    Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:
    The host puma18.scl.lab.tlv.redhat.com did not satisfy internal filter Memory because its available memory is too low (100.000000 MB) to run the VM.
    The host puma19.scl.lab.tlv.redhat.com did not satisfy internal filter Memory because its available memory is too low (0.000000 MB) to run the VM.
"

RAM reported from WEBUI on puma18 (29 guest-VMs runnig on host):
Physical Memory:
32067 MB total, 6413 MB used, 25654 MB free 
Swap Size:
2047 MB total, 0 MB used, 2047 MB free
Shared Memory:
27%
Max free Memory for scheduling new VMs:
100 MB

RAM reported from WEBUI on puma19 (host that was running HE-VM with 16Gig RAM and 14 guest-VMs):
Physical Memory:
32067 MB total, 18278 MB used, 13789 MB free
Swap Size:
2047 MB total, 0 MB used, 2047 MB free
Shared Memory:
7%
Max free Memory for scheduling new VMs:
0 MB

Now I've manually stopped HE-VM on puma19 to see if it will get started on puma18 (as because of manually killing HE-VM on puma19, its HA score being zeroed and HA should start HE-VM on another host with positive best score, which in my case should be puma18, as I have only 2 hosted-engine-hosts in my environment.

puma19 ~]# hosted-engine --vm-poweroff
puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18.scl.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 24c902b4
local_conf_timestamp               : 12304
Host timestamp                     : 518905
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=518905 (Mon Jan  2 14:52:08 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19.scl.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 0b49f26a
local_conf_timestamp               : 10819
Host timestamp                     : 517458
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=517458 (Mon Jan  2 14:52:34 2017)
        host-id=2
        score=0
        vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False

puma18 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18.scl.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : f5d68fe8
local_conf_timestamp               : 12304
Host timestamp                     : 519035
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=519035 (Mon Jan  2 14:54:19 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19.scl.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 613a9a8b
local_conf_timestamp               : 10819
Host timestamp                     : 517581
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=517581 (Mon Jan  2 14:54:37 2017)
        host-id=2
        score=0
        vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Wed Jan  7 01:54:53 1970

puma18 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18.scl.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 29dfc563
local_conf_timestamp               : 12304
Host timestamp                     : 521268
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=521268 (Mon Jan  2 15:31:32 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12304 (Tue Dec 27 18:08:37 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19.scl.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : fb1a3aa2
local_conf_timestamp               : 10819
Host timestamp                     : 519787
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=519787 (Mon Jan  2 15:31:23 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=10819 (Tue Dec 27 18:08:23 2016)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


Finally HE-VM was started on puma18 and host was running 30VMs over all, while 29 guest-VMs and HE-VM.

puma18:
Physical Memory:
32067 MB total, 10582 MB used, 21485 MB free
Swap Size:
2047 MB total, 0 MB used, 2047 MB free
Shared Memory:
30%
Max free Memory for scheduling new VMs:

Works for me on these components on hosts:
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.8-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-0.0.master.20161221071755.git46cacd3.el7.centos.noarch
ovirt-setup-lib-1.1.0-1.el7.centos.noarch
libvirt-client-2.0.0-10.el7_3.2.x86_64
ovirt-release41-pre-4.1.0-0.6.beta2.20161221025826.gitc487776.el7.centos.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
ovirt-hosted-engine-ha-2.1.0-0.0.master.20161221070856.20161221070854.git387fa53.el7.centos.noarch
rhevm-appliance-20161116.0-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-host-deploy-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
vdsm-4.18.999-1218.gitd36143e.el7.centos.x86_64
ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016
Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

On engine:
ovirt-engine-setup-plugin-ovirt-engine-common-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-imageio-proxy-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
ovirt-iso-uploader-4.1.0-0.0.master.20160909154152.git14502bd.el7.centos.noarch
ovirt-engine-userportal-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-dbscripts-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
ovirt-host-deploy-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch
python-ovirt-engine-sdk4-4.1.0-0.1.a0.20161215git77fce51.el7.centos.x86_64
ovirt-host-deploy-java-1.6.0-0.0.master.20161215101008.gitb76ad50.el7.centos.noarch
ovirt-release41-pre-4.1.0-0.6.beta2.20161221025826.gitc487776.el7.centos.noarch
ovirt-setup-lib-1.1.0-1.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.2-1.el7.noarch
ovirt-engine-dwh-setup-4.1.0-0.0.master.20161129154019.el7.centos.noarch
ovirt-imageio-proxy-setup-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
ovirt-engine-tools-backup-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-websocket-proxy-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-setup-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-backend-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-tools-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-webadmin-portal-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-restapi-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-wildfly-overlay-10.0.0-1.el7.noarch
ovirt-engine-cli-3.6.9.2-1.el7.centos.noarch
ovirt-web-ui-0.1.1-2.el7.centos.x86_64
ovirt-engine-setup-base-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-vmconsole-1.0.4-1.el7.centos.noarch
ovirt-engine-dwh-4.1.0-0.0.master.20161129154019.el7.centos.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-hosts-ansible-inventory-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-engine-dashboard-1.1.0-0.4.20161128git5ed6f96.el7.centos.noarch
ovirt-engine-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-guest-agent-common-1.0.13-1.20161220085008.git165fff1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.centos.noarch
ovirt-engine-wildfly-10.1.0-1.el7.x86_64
ovirt-engine-lib-4.1.0-0.3.beta2.20161221085908.el7.centos.noarch
ovirt-vmconsole-proxy-1.0.4-1.el7.centos.noarch
Linux version 3.10.0-514.2.2.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Dec 6 23:06:41 UTC 2016
Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.3.1611 (Core)

Comment 7 Martin Sivák 2017-03-28 08:33:42 UTC
*** Bug 1436002 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.