Created attachment 1675677 [details] sosreport from alma03 Description of problem: HE-VM will never start on the environment and will stuck in monitoring loop forever. On pair of hosts with 32G of RAM, deploy HE and add to it memory hotplug to become 32G RAM. Power-off engine VM and check that no ha-host can start the engine on it, due to monitoring loop coming from insufficient RAM memory on both hosts and that their score became 0. MainThread::INFO::2020-04-02 12:49:42,037::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) MainThread::INFO::2020-04-02 12:49:52,178::states::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Thu Apr 2 12:47:12 2020 MainThread::INFO::2020-04-02 12:49:52,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) MainThread::INFO::2020-04-02 12:50:02,324::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) alma04 ~]# virsh -r list --all Id Name State ------------------------------- - HostedEngine shut off alma03 ~]# virsh -r list --all Id Name State ------------------------------- - HostedEngine shut off alma03 ~]# hosted-engine --vm-status --== Host alma04.qa.lab.tlv.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 235570 Score : 0 Engine status : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"} Hostname : alma04.qa.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : b8e9aa31 conf_on_shared_storage : True local_conf_timestamp : 235571 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=235570 (Thu Apr 2 12:55:29 2020) host-id=1 score=0 vm_conf_refresh_time=235571 (Thu Apr 2 12:55:29 2020) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Sat Jan 3 19:27:20 1970 --== Host alma03.qa.lab.tlv.redhat.com (id: 2) status ==-- Host ID : 2 Host timestamp : 2229 Score : 0 Engine status : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"} Hostname : alma03.qa.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : c286953e conf_on_shared_storage : True local_conf_timestamp : 2229 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2229 (Thu Apr 2 12:55:33 2020) host-id=2 score=0 vm_conf_refresh_time=2229 (Thu Apr 2 12:55:34 2020) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Jan 1 02:38:07 1970 alma03 ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x0000000007ffffff 128M online no 0 0x0000000008000000-0x0000000067ffffff 1.5G online yes 1-12 0x0000000068000000-0x000000007fffffff 384M online no 13-15 0x0000000100000000-0x0000000107ffffff 128M online no 32 0x0000000108000000-0x000000068fffffff 22.1G online yes 33-209 0x0000000690000000-0x00000006dfffffff 1.3G online no 210-219 0x00000006e0000000-0x0000000737ffffff 1.4G online yes 220-230 0x0000000738000000-0x000000083fffffff 4.1G online no 231-263 0x0000000840000000-0x0000000857ffffff 384M online yes 264-266 0x0000000858000000-0x000000087fffffff 640M online no 267-271 Memory block size: 128M Total online memory: 32G Total offline memory: 0B alma04 ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x0000000007ffffff 128M online no 0 0x0000000008000000-0x000000002fffffff 640M online yes 1-5 0x0000000030000000-0x0000000037ffffff 128M online no 6 0x0000000038000000-0x0000000047ffffff 256M online yes 7-8 0x0000000048000000-0x000000004fffffff 128M online no 9 0x0000000050000000-0x0000000067ffffff 384M online yes 10-12 0x0000000068000000-0x000000007fffffff 384M online no 13-15 0x0000000100000000-0x0000000107ffffff 128M online no 32 0x0000000108000000-0x0000000147ffffff 1G online yes 33-40 0x0000000148000000-0x0000000157ffffff 256M online no 41-42 0x0000000158000000-0x000000015fffffff 128M online yes 43 0x0000000160000000-0x000000016fffffff 256M online no 44-45 0x0000000170000000-0x0000000177ffffff 128M online yes 46 0x0000000178000000-0x0000000187ffffff 256M online no 47-48 0x0000000188000000-0x00000001a7ffffff 512M online yes 49-52 0x00000001a8000000-0x00000001afffffff 128M online no 53 0x00000001b0000000-0x00000001b7ffffff 128M online yes 54 0x00000001b8000000-0x00000001bfffffff 128M online no 55 0x00000001c0000000-0x00000001c7ffffff 128M online yes 56 0x00000001c8000000-0x00000001cfffffff 128M online no 57 0x00000001d0000000-0x00000001e7ffffff 384M online yes 58-60 0x00000001e8000000-0x00000001efffffff 128M online no 61 0x00000001f0000000-0x000000021fffffff 768M online yes 62-67 0x0000000220000000-0x0000000227ffffff 128M online no 68 0x0000000228000000-0x000000022fffffff 128M online yes 69 0x0000000230000000-0x0000000237ffffff 128M online no 70 0x0000000238000000-0x000000023fffffff 128M online yes 71 0x0000000240000000-0x0000000247ffffff 128M online no 72 0x0000000248000000-0x000000027fffffff 896M online yes 73-79 0x0000000280000000-0x0000000287ffffff 128M online no 80 0x0000000288000000-0x00000002bfffffff 896M online yes 81-87 0x00000002c0000000-0x00000002ffffffff 1G online no 88-95 0x0000000300000000-0x000000030fffffff 256M online yes 96-97 0x0000000310000000-0x0000000327ffffff 384M online no 98-100 0x0000000328000000-0x000000032fffffff 128M online yes 101 0x0000000330000000-0x000000033fffffff 256M online no 102-103 0x0000000340000000-0x0000000377ffffff 896M online yes 104-110 0x0000000378000000-0x000000037fffffff 128M online no 111 0x0000000380000000-0x00000003a7ffffff 640M online yes 112-116 0x00000003a8000000-0x00000003afffffff 128M online no 117 0x00000003b0000000-0x00000003d7ffffff 640M online yes 118-122 0x00000003d8000000-0x00000003f7ffffff 512M online no 123-126 0x00000003f8000000-0x0000000417ffffff 512M online yes 127-130 0x0000000418000000-0x000000041fffffff 128M online no 131 0x0000000420000000-0x0000000427ffffff 128M online yes 132 0x0000000428000000-0x0000000457ffffff 768M online no 133-138 0x0000000458000000-0x000000045fffffff 128M online yes 139 0x0000000460000000-0x0000000497ffffff 896M online no 140-146 0x0000000498000000-0x000000049fffffff 128M online yes 147 0x00000004a0000000-0x000000050fffffff 1.8G online no 148-161 0x0000000510000000-0x0000000517ffffff 128M online yes 162 0x0000000518000000-0x000000054fffffff 896M online no 163-169 0x0000000550000000-0x0000000557ffffff 128M online yes 170 0x0000000558000000-0x0000000597ffffff 1G online no 171-178 0x0000000598000000-0x000000059fffffff 128M online yes 179 0x00000005a0000000-0x00000005d7ffffff 896M online no 180-186 0x00000005d8000000-0x00000005dfffffff 128M online yes 187 0x00000005e0000000-0x000000060fffffff 768M online no 188-193 0x0000000610000000-0x0000000617ffffff 128M online yes 194 0x0000000618000000-0x000000067fffffff 1.6G online no 195-207 0x0000000680000000-0x0000000687ffffff 128M online yes 208 0x0000000688000000-0x00000006dfffffff 1.4G online no 209-219 0x00000006e0000000-0x00000006ffffffff 512M online yes 220-223 0x0000000700000000-0x00000007ffffffff 4G online no 224-255 0x0000000800000000-0x000000080fffffff 256M online yes 256-257 0x0000000810000000-0x0000000817ffffff 128M online no 258 0x0000000818000000-0x0000000857ffffff 1G online yes 259-266 0x0000000858000000-0x000000087fffffff 640M online no 267-271 Memory block size: 128M Total online memory: 32G Total offline memory: 0B Version-Release number of selected component (if applicable): Tested on host with these components: rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Engine: ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch ovirt-engine-4.4.0-0.26.master.el8ev.noarch openvswitch2.11-2.11.0-48.el8fdp.x86_64 Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa) How reproducible: 100% Steps to Reproduce: 1.Deploy HE over NFS on pair of hosts with 32GB RAM. 2.Add memory hotplug to HE with, so it'll get 32GB RAM. 3.Power-off the engine in global maintenance mode. 4.Disable global maintenance mode and check hosted-engine --vm-status and virsh -r list --all on both hosts. Actual results: Nothing prevents customer from setting maximum RAM size to HE-VM equal to host's maximum available RAM memory. HE-VM will get's in to the monitoring loop, where no ha-host can start it. HE-VM will never start on the environment and will stuck in monitoring loop forever. Expected results: Customer should be warned that it's not possible to consume maximum available RAM for HE-VM as it's the maximum available RAM on host. Memory hotplug should check for maximum RAM on host and limit it's addition to HE-VM in such a way, which could enable the host to start HE-VM. Additional info: Logs from both hosts.
Created attachment 1675678 [details] sosreport from alma04
2020-04-02T10:06:46.235081Z qemu-kvm: cannot set up guest memory 'pc.ram': Cannot allocate memory In general doesn't sound too interesting. Don't use more memory than you physically have....
moving to virt, I expect this to be the case also for any other non hosted engine VM. To me, this can be closed as won't fix. Nobody should allocate more memory than available for VMs. The only difference here is that once you change hosted engine VM memory to be too much and you shutdown it, it takes manual action to go into the OVF and fix it because there's no engine around for doing that through UI.
(In reply to Michal Skrivanek from comment #2) > 2020-04-02T10:06:46.235081Z qemu-kvm: cannot set up guest memory 'pc.ram': > Cannot allocate memory > > In general doesn't sound too interesting. Don't use more memory than you > physically have.... System doesn't warns customer about that, accepts that change and continues working, but then if engine powered-off for some reason, its done, you can't start it. The only way IMHO to revert this, should be altering manually OVF or by following restore procedure, which might not be available.
(In reply to Sandro Bonazzola from comment #3) > moving to virt, I expect this to be the case also for any other non hosted > engine VM. Well, I don't think it would make sense to introduce such a validation for all VMs - let's say I have a cluster with 100 hosts with 32G and one host of 64G when editing a VM, should the engine know that it can set its memory up to 64G? what if a second later that host disconnects?... Specifically for HE VM and considering the implication of setting such incorrect configuration (explained in comment 3 and comment 4), we can limit the memory to the max memory of all (active?) hosts (in the cluster? in the data center? :) ) that are ha-hosts
Host's RAM is Total online memory: 32G I've added more RAM to the engine's 16384MB, so it got to 18432MB without any issue. I've tried to add up to maximal value of 32768MB and received an error: " Operation Canceled Error while executing action: HostedEngine: Cannot edit VM. Memory size (32768MB) cannot exceed the minimal memory size of Hosted Engine hosts (31985MB)." Works for me on latest Software Version:4.4.1.7-0.3.el8ev. ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) Reported issue no longer exists.
Fix: Prevent the user from setting a Hosted Engine Virtual Machine's memory to be larger than the physical memory of the active Hosted Engine Host. I saw in the error message "Cannot edit VM. Memory size (32768MB) cannot exceed the minimal memory size of Hosted Engine hosts (31985MB).". "cannot exceed the minimal memory size" probably should be changed to "cannot exceed the maximal memory size".
why? it's the minimum of all the hosted engine hosts memory sizes, as the HE VM needs to be able to run on any of them.
(In reply to Michal Skrivanek from comment #8) > why? it's the minimum of all the hosted engine hosts memory sizes, as the HE > VM needs to be able to run on any of them. And it's unclear from the message to customers.
(In reply to Nikolai Sednev from comment #9) > And it's unclear from the message to customers. How would you suggest to improve it?
We can discuss it in bz 1854164
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.