Bug 1820140 - Hosted Engine VM can get memory hotplug to more than the physically available RAM
Summary: Hosted Engine VM can get memory hotplug to more than the physically available...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.4.0
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ovirt-4.4.1
: ---
Assignee: Steven Rosenberg
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-02 10:59 UTC by Nikolai Sednev
Modified: 2020-08-17 08:34 UTC (History)
5 users (show)

Fixed In Version: rhv-4.4.1-10
Doc Type: Bug Fix
Doc Text:
Previously, with RHV Manager running as a self-hosted engine, the user could hotplug memory on the self-hosted engine virtual machine and exceed the physical memory of the host. In that case, restarting the virtual machine failed due to insufficient memory. The current release fixes this issue. It prevents the user from setting the self-hosted engine virtual machine's memory to exceed the active host's physical memory. You can only save configurations where the self-hosted engine virtual machine's memory is less than the active host's physical memory.
Clone Of:
Environment:
Last Closed: 2020-07-08 08:25:32 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
sosreport from alma03 (13.88 MB, application/x-xz)
2020-04-02 10:59 UTC, Nikolai Sednev
no flags Details
sosreport from alma04 (12.05 MB, application/x-xz)
2020-04-02 11:00 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 109698 0 master MERGED core: Memory Size must be less than Physical Size 2020-08-17 08:31:59 UTC

Description Nikolai Sednev 2020-04-02 10:59:44 UTC
Created attachment 1675677 [details]
sosreport from alma03

Description of problem:
HE-VM will never start on the environment and will stuck in monitoring loop forever.

On pair of hosts with 32G of RAM, deploy HE and add to it memory hotplug to become 32G RAM. Power-off engine VM and check that no ha-host can start the engine on it, due to monitoring loop coming from insufficient RAM memory on both hosts and that their score became 0.
MainThread::INFO::2020-04-02 12:49:42,037::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2020-04-02 12:49:52,178::states::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Thu Apr  2 12:47:12 2020
MainThread::INFO::2020-04-02 12:49:52,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2020-04-02 12:50:02,324::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)

alma04 ~]# virsh -r list --all
 Id   Name           State
-------------------------------
 -    HostedEngine   shut off

alma03 ~]# virsh -r list --all
 Id   Name           State
-------------------------------
 -    HostedEngine   shut off

alma03 ~]# hosted-engine --vm-status


--== Host alma04.qa.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 235570
Score                              : 0
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : alma04.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : b8e9aa31
conf_on_shared_storage             : True
local_conf_timestamp               : 235571
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=235570 (Thu Apr  2 12:55:29 2020)
        host-id=1
        score=0
        vm_conf_refresh_time=235571 (Thu Apr  2 12:55:29 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Sat Jan  3 19:27:20 1970


--== Host alma03.qa.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 2229
Score                              : 0
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : alma03.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : c286953e
conf_on_shared_storage             : True
local_conf_timestamp               : 2229
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2229 (Thu Apr  2 12:55:33 2020)
        host-id=2
        score=0
        vm_conf_refresh_time=2229 (Thu Apr  2 12:55:34 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 02:38:07 1970



alma03 ~]# lsmem
RANGE                                  SIZE  STATE REMOVABLE   BLOCK
0x0000000000000000-0x0000000007ffffff  128M online        no       0
0x0000000008000000-0x0000000067ffffff  1.5G online       yes    1-12
0x0000000068000000-0x000000007fffffff  384M online        no   13-15
0x0000000100000000-0x0000000107ffffff  128M online        no      32
0x0000000108000000-0x000000068fffffff 22.1G online       yes  33-209
0x0000000690000000-0x00000006dfffffff  1.3G online        no 210-219
0x00000006e0000000-0x0000000737ffffff  1.4G online       yes 220-230
0x0000000738000000-0x000000083fffffff  4.1G online        no 231-263
0x0000000840000000-0x0000000857ffffff  384M online       yes 264-266
0x0000000858000000-0x000000087fffffff  640M online        no 267-271

Memory block size:       128M
Total online memory:      32G
Total offline memory:      0B

alma04 ~]# lsmem
RANGE                                  SIZE  STATE REMOVABLE   BLOCK
0x0000000000000000-0x0000000007ffffff  128M online        no       0
0x0000000008000000-0x000000002fffffff  640M online       yes     1-5
0x0000000030000000-0x0000000037ffffff  128M online        no       6
0x0000000038000000-0x0000000047ffffff  256M online       yes     7-8
0x0000000048000000-0x000000004fffffff  128M online        no       9
0x0000000050000000-0x0000000067ffffff  384M online       yes   10-12
0x0000000068000000-0x000000007fffffff  384M online        no   13-15
0x0000000100000000-0x0000000107ffffff  128M online        no      32
0x0000000108000000-0x0000000147ffffff    1G online       yes   33-40
0x0000000148000000-0x0000000157ffffff  256M online        no   41-42
0x0000000158000000-0x000000015fffffff  128M online       yes      43
0x0000000160000000-0x000000016fffffff  256M online        no   44-45
0x0000000170000000-0x0000000177ffffff  128M online       yes      46
0x0000000178000000-0x0000000187ffffff  256M online        no   47-48
0x0000000188000000-0x00000001a7ffffff  512M online       yes   49-52
0x00000001a8000000-0x00000001afffffff  128M online        no      53
0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
0x00000001b8000000-0x00000001bfffffff  128M online        no      55
0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
0x00000001c8000000-0x00000001cfffffff  128M online        no      57
0x00000001d0000000-0x00000001e7ffffff  384M online       yes   58-60
0x00000001e8000000-0x00000001efffffff  128M online        no      61
0x00000001f0000000-0x000000021fffffff  768M online       yes   62-67
0x0000000220000000-0x0000000227ffffff  128M online        no      68
0x0000000228000000-0x000000022fffffff  128M online       yes      69
0x0000000230000000-0x0000000237ffffff  128M online        no      70
0x0000000238000000-0x000000023fffffff  128M online       yes      71
0x0000000240000000-0x0000000247ffffff  128M online        no      72
0x0000000248000000-0x000000027fffffff  896M online       yes   73-79
0x0000000280000000-0x0000000287ffffff  128M online        no      80
0x0000000288000000-0x00000002bfffffff  896M online       yes   81-87
0x00000002c0000000-0x00000002ffffffff    1G online        no   88-95
0x0000000300000000-0x000000030fffffff  256M online       yes   96-97
0x0000000310000000-0x0000000327ffffff  384M online        no  98-100
0x0000000328000000-0x000000032fffffff  128M online       yes     101
0x0000000330000000-0x000000033fffffff  256M online        no 102-103
0x0000000340000000-0x0000000377ffffff  896M online       yes 104-110
0x0000000378000000-0x000000037fffffff  128M online        no     111
0x0000000380000000-0x00000003a7ffffff  640M online       yes 112-116
0x00000003a8000000-0x00000003afffffff  128M online        no     117
0x00000003b0000000-0x00000003d7ffffff  640M online       yes 118-122
0x00000003d8000000-0x00000003f7ffffff  512M online        no 123-126
0x00000003f8000000-0x0000000417ffffff  512M online       yes 127-130
0x0000000418000000-0x000000041fffffff  128M online        no     131
0x0000000420000000-0x0000000427ffffff  128M online       yes     132
0x0000000428000000-0x0000000457ffffff  768M online        no 133-138
0x0000000458000000-0x000000045fffffff  128M online       yes     139
0x0000000460000000-0x0000000497ffffff  896M online        no 140-146
0x0000000498000000-0x000000049fffffff  128M online       yes     147
0x00000004a0000000-0x000000050fffffff  1.8G online        no 148-161
0x0000000510000000-0x0000000517ffffff  128M online       yes     162
0x0000000518000000-0x000000054fffffff  896M online        no 163-169
0x0000000550000000-0x0000000557ffffff  128M online       yes     170
0x0000000558000000-0x0000000597ffffff    1G online        no 171-178
0x0000000598000000-0x000000059fffffff  128M online       yes     179
0x00000005a0000000-0x00000005d7ffffff  896M online        no 180-186
0x00000005d8000000-0x00000005dfffffff  128M online       yes     187
0x00000005e0000000-0x000000060fffffff  768M online        no 188-193
0x0000000610000000-0x0000000617ffffff  128M online       yes     194
0x0000000618000000-0x000000067fffffff  1.6G online        no 195-207
0x0000000680000000-0x0000000687ffffff  128M online       yes     208
0x0000000688000000-0x00000006dfffffff  1.4G online        no 209-219
0x00000006e0000000-0x00000006ffffffff  512M online       yes 220-223
0x0000000700000000-0x00000007ffffffff    4G online        no 224-255
0x0000000800000000-0x000000080fffffff  256M online       yes 256-257
0x0000000810000000-0x0000000817ffffff  128M online        no     258
0x0000000818000000-0x0000000857ffffff    1G online       yes 259-266
0x0000000858000000-0x000000087fffffff  640M online        no 267-271

Memory block size:       128M
Total online memory:      32G
Total offline memory:      0B

Version-Release number of selected component (if applicable):
Tested on host with these components:
rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Engine:
ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch
ovirt-engine-4.4.0-0.26.master.el8ev.noarch
openvswitch2.11-2.11.0-48.el8fdp.x86_64
Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over NFS on pair of hosts with 32GB RAM.
2.Add memory hotplug to HE with, so it'll get 32GB RAM.
3.Power-off the engine in global maintenance mode.
4.Disable global maintenance mode and check hosted-engine --vm-status and virsh -r list --all on both hosts.

Actual results:
Nothing prevents customer from setting maximum RAM size to HE-VM equal to host's maximum available RAM memory. HE-VM will get's in to the monitoring loop, where no ha-host can start it. HE-VM will never start on the environment and will stuck in monitoring loop forever.

Expected results:
Customer should be warned that it's not possible to consume maximum available RAM for HE-VM as it's the maximum available RAM on host. Memory hotplug should check for maximum RAM on host and limit it's addition to HE-VM in such a way, which could enable the host to start HE-VM. 

Additional info:
Logs from both hosts.

Comment 1 Nikolai Sednev 2020-04-02 11:00:59 UTC
Created attachment 1675678 [details]
sosreport from alma04

Comment 2 Michal Skrivanek 2020-04-03 12:35:09 UTC
2020-04-02T10:06:46.235081Z qemu-kvm: cannot set up guest memory 'pc.ram': Cannot allocate memory

In general doesn't sound too interesting. Don't use more memory than you physically have....

Comment 3 Sandro Bonazzola 2020-04-03 13:42:29 UTC
moving to virt, I expect this to be the case also for any other non hosted engine VM.
To me, this can be closed as won't fix. Nobody should allocate more memory than available for VMs.
The only difference here is that once you change hosted engine VM memory to be too much and you shutdown it, it takes manual action to go into the OVF and fix it because there's no engine around for doing that through UI.

Comment 4 Nikolai Sednev 2020-04-05 05:56:56 UTC
(In reply to Michal Skrivanek from comment #2)
> 2020-04-02T10:06:46.235081Z qemu-kvm: cannot set up guest memory 'pc.ram':
> Cannot allocate memory
> 
> In general doesn't sound too interesting. Don't use more memory than you
> physically have....

System doesn't warns customer about that, accepts that change and continues working, but then if engine powered-off for some reason, its done, you can't start it. The only way IMHO to revert this, should be altering manually OVF or by following restore procedure, which might not be available.

Comment 5 Arik 2020-06-11 16:02:53 UTC
(In reply to Sandro Bonazzola from comment #3)
> moving to virt, I expect this to be the case also for any other non hosted
> engine VM.

Well, I don't think it would make sense to introduce such a validation for all VMs -
let's say I have a cluster with 100 hosts with 32G and one host of 64G
when editing a VM, should the engine know that it can set its memory up to 64G? what if a second later that host disconnects?...

Specifically for HE VM and considering the implication of setting such incorrect configuration (explained in comment 3 and comment 4),
we can limit the memory to the max memory of all (active?) hosts (in the cluster? in the data center? :) ) that are ha-hosts

Comment 6 Nikolai Sednev 2020-07-06 14:08:05 UTC
Host's RAM is Total online memory: 32G
I've added more RAM to the engine's 16384MB, so it got to 18432MB without any issue.
I've tried to add up to maximal value of 32768MB and received an error:

"
Operation Canceled
Error while executing action: 

HostedEngine:
Cannot edit VM. Memory size (32768MB) cannot exceed the minimal memory size of Hosted Engine hosts (31985MB)."

Works for me on latest Software Version:4.4.1.7-0.3.el8ev.
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

Reported issue no longer exists.

Comment 7 Nikolai Sednev 2020-07-06 14:14:51 UTC
Fix: Prevent the user from setting a Hosted Engine Virtual Machine's memory to be larger than the physical memory of the active Hosted Engine Host.

I saw in the error message "Cannot edit VM. Memory size (32768MB) cannot exceed the minimal memory size of Hosted Engine hosts (31985MB).". 
"cannot exceed the minimal memory size" probably should be changed to "cannot exceed the maximal memory size".

Comment 8 Michal Skrivanek 2020-07-07 08:18:40 UTC
why? it's the minimum of all the hosted engine hosts memory sizes, as the HE VM needs to be able to run on any of them.

Comment 9 Nikolai Sednev 2020-07-07 08:36:04 UTC
(In reply to Michal Skrivanek from comment #8)
> why? it's the minimum of all the hosted engine hosts memory sizes, as the HE
> VM needs to be able to run on any of them.

And it's unclear from the message to customers.

Comment 10 Arik 2020-07-07 08:38:41 UTC
(In reply to Nikolai Sednev from comment #9)
> And it's unclear from the message to customers.

How would you suggest to improve it?

Comment 11 Arik 2020-07-07 08:44:52 UTC
We can discuss it in bz 1854164

Comment 12 Sandro Bonazzola 2020-07-08 08:25:32 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.