Bug 1427301 - ovirt-engine killed by oom-kill on a 4GB ram Hosted Engine with 16 VMs managed
Summary: ovirt-engine killed by oom-kill on a 4GB ram Hosted Engine with 16 VMs managed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-appliance
Classification: oVirt
Component: General
Version: 4.1
Hardware: x86_64
OS: Linux
unspecified
high vote
Target Milestone: ovirt-4.1.1
: 4.1
Assignee: Yuval Turgeman
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-2
TreeView+ depends on / blocked
 
Reported: 2017-02-27 20:35 UTC by Bryan Gurney
Modified: 2017-04-21 09:30 UTC (History)
19 users (show)

Fixed In Version: rhvm-appliance-4.1-20170403.0
Doc Type: Bug Fix
Doc Text:
Cause: The appliance is shipped with 4G ram Consequence: ovirt-engine gets killed (Out of memory) Fix: Set the memory and cpus to their recommended values (16G ram and 4 CPUs) Result: The appliance is shipped with the engine's recommended ram/cpu values
Clone Of:
Environment:
Last Closed: 2017-04-21 09:30:30 UTC
oVirt Team: Integration
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
/var/log/messages output from the time of oom-killer (16.94 KB, text/plain)
2017-02-27 20:38 UTC, Bryan Gurney
no flags Details
Output from "ps aux" (sorted by RSS descending) from the hosted-engine VM 2 days after reboot (15.11 KB, text/plain)
2017-02-27 20:40 UTC, Bryan Gurney
no flags Details
/var/log/messages-20170226 file (2017-02-20 14:29:28 EST to 2017-02-26 03:31:01 EST) (805.77 KB, text/plain)
2017-03-01 13:38 UTC, Bryan Gurney
no flags Details
/var/log/ovirt-engine/server.log from the hosted-engine VM (858.13 KB, text/plain)
2017-03-01 14:30 UTC, Bryan Gurney
no flags Details
/var/log/ovirt-engine/engine.log-20170223.gz from the hosted-engine VM (1.52 MB, application/x-gzip)
2017-03-01 14:33 UTC, Bryan Gurney
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 73335 master ABANDONED build: set appliance mem to 8G and CPUs to 2 2017-03-02 13:18:49 UTC

Description Bryan Gurney 2017-02-27 20:35:22 UTC
Description of problem:


Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.1.0.1 (hosts are running RHVH-4.1-20170209.0)


How reproducible: Uncertain; this was encountered one time after the first time completing the RHV environment setup process, approximately 5 days prior to the occurrence of the issue.


Steps to Reproduce:
1. Install RHV on a 3-node cluster using the steps outlined in https://access.redhat.com/articles/2578391 as a guide.

2. Run ovirt-hosted-engine-setup to set up a hosted engine on one of the nodes.  Do not use an answer file; instead, enter the options when requested.  Use defaults for all of the options that have defaults, and enter information for all of the options that do not have a default.  (In this state, the "OVEHOSTED_VM/applianceMem" should have a value of "4096", indicating 4 GiB of memory for the virtual machine.)

3. Start using the cluster.  (We created 15 VMs with 2 GiB of memory and 2 CPU cores, an IDE disk device with 180 GiB virtual size, and RHEL 7 installed as the operating system.)

Actual results:
The java process that was running the ovirt-engine server instance was terminated by oom-killer.

Expected results:
The java process that runs the ovirt-engine server instance is left alone.


Additional info:
After reporting this to our contact at Red Hat, it was recommended that we increase the memory on the hosted engine VM to 8 GiB.  2 days and 20 hours after this change was made, running "free -m" on the system shows 2670 MiB used, 3440 MiB free, 1710 MiB buff/cache.

Comment 1 Bryan Gurney 2017-02-27 20:38:33 UTC
Created attachment 1258190 [details]
/var/log/messages output from the time of oom-killer

Comment 2 Bryan Gurney 2017-02-27 20:40:41 UTC
Created attachment 1258191 [details]
Output from "ps aux" (sorted by RSS descending) from the hosted-engine VM 2 days after reboot

Comment 3 Sandro Bonazzola 2017-02-28 08:09:24 UTC
Doesn't look like a hosted engine specific issue. If engine was running on a 4 GB bare metal server it would have been killed as well.
Moving to ovirt-engine and infra team for further investigation.

Comment 4 Doron Fediuck 2017-02-28 11:24:39 UTC
Adding some info from our mail thread:

On 02/24/2017 03:32 PM, Bryan Gurney wrote:
> Yes, the hosted engine VM has 4 GB of RAM configured; that was the default
> during the ovirt-hosted-engine-setup process.  Should the default for the
> hosted engine's VM be set higher than 4 GB?
>

Yes, I think you need to set it to a higher value. The engine is
configured by default to use at most 1 GiB of heap. But then it also
consumes off-heap space (stacks, native buffers, etc), so it can consume
up to 2 GiB of RSS (assuming there are no off-heap leaks). Then you have
also the DWH server, and the database. For production environments will
all the components in one machine we usually recommend 16 GiB. That is
way too much, in my opinion. I think that something that 8 GiB can be
healthy for you. Note that as this is a VM, the memory won't be wasted,
the hypervisor will dedicate the unused memory to other VMs.

Comment 5 Sandro Bonazzola 2017-03-01 09:45:50 UTC
Let's increase appliance default memory to 8Gb

Comment 6 Martin Perina 2017-03-01 10:51:47 UTC
(In reply to Sandro Bonazzola from comment #5)
> Let's increase appliance default memory to 8Gb

Please don't do that yet, we need to understand the root cause of the issue. According to QA increasing default memory for appliance will have a significant impact on QA automation.

Roy, could you please investigate this issue?

Comment 7 Martin Perina 2017-03-01 10:52:42 UTC
Could you please share with us complete engine logs before the OOM kill, so we can investigate engine utilization?

Comment 8 Yaniv Lavi 2017-03-01 13:17:06 UTC
We want to align the spec with the minimal requirement for bare metal engine and allow user to choose less. In the Grafton use case they would want to decide on the amount allocated. Sahina please track this bug and decide if to pass a lower memory parameter to HE setup with answer file.

Comment 9 Sandro Bonazzola 2017-03-01 13:22:30 UTC
(In reply to Yaniv Dary from comment #8)
> We want to align the spec with the minimal requirement for bare metal engine
> and allow user to choose less.

Note that according to [1] minimum for bare metal is 4GB, recommended is 16GB.
On phone call discussion it was pointed out to go with recommended.

Ok to set the appliance to recommended?
- 16 GB RAM
- 4 cores
- 50 GB disk

[1] http://www.ovirt.org/documentation/install-guide/chap-System_Requirements/

> In the Grafton use case they would want to
> decide on the amount allocated. Sahina please track this bug and decide if
> to pass a lower memory parameter to HE setup with answer file.

Comment 10 Bryan Gurney 2017-03-01 13:38:14 UTC
Created attachment 1258682 [details]
/var/log/messages-20170226 file (2017-02-20 14:29:28 EST to 2017-02-26 03:31:01 EST)

I've attached the file from /var/log/messages-20170226 on the hosted-engine VM, which covers the time of the oom-killer event (Feb 22 11:38:48 EST).

Comment 11 Martin Perina 2017-03-01 14:17:59 UTC
Bryan, could you please provide also engine.log and server.log from the time as I mentioned in Comment 7? We are not able to see engine internal processes from /var/log/messages ...

Comment 13 Bryan Gurney 2017-03-01 14:30:25 UTC
Created attachment 1258690 [details]
/var/log/ovirt-engine/server.log from the hosted-engine VM

Comment 14 Bryan Gurney 2017-03-01 14:33:35 UTC
Created attachment 1258691 [details]
/var/log/ovirt-engine/engine.log-20170223.gz from the hosted-engine VM

I've attached the server.log and the engine.log file that covers the time of the oom-killer event.  (The engine.log-20170223.gz file covers from 2017-02-22 03:16:06,312-05 to 2017-02-23 02:40:22,316-05.)

Comment 18 Yaniv Kaul 2017-03-02 15:42:28 UTC
What will happen on upgrade? 
Do we need docs bug or release note on it?

Comment 19 Yuval Turgeman 2017-03-02 16:28:47 UTC
I am not that familiar with the upgrade process yet, I can check.

Comment 21 Ryan Barry 2017-04-04 18:26:59 UTC
(In reply to Yaniv Kaul from comment #18)
> What will happen on upgrade? 
> Do we need docs bug or release note on it?

We'll need a release note, probably.

Even if the appliance is 'yum updated', this just ships a new OVA. I'm not sure whether ovirt-hosted-engine-ha|agent|setup automatically pick this up and use the new OVA (and associated VM definition), but I doubt it...

Simone?

Comment 23 Nikolai Sednev 2017-04-06 04:58:32 UTC
Required parameters being received during deployment from an appliance, unless we will have it, we can't proceed with the verification.
Returning back to assigned.

Comment 24 Red Hat Bugzilla Rules Engine 2017-04-06 04:58:43 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 25 Ryan Barry 2017-04-06 05:03:16 UTC
(In reply to Nikolai Sednev from comment #23)
> Required parameters being received during deployment from an appliance,
> unless we will have it, we can't proceed with the verification.
> Returning back to assigned.

A new build was delivered today. You should have been on the smoke test...

Comment 28 Simone Tiraboschi 2017-04-06 07:19:51 UTC
(In reply to Ryan Barry from comment #21)
> Even if the appliance is 'yum updated', this just ships a new OVA. I'm not
> sure whether ovirt-hosted-engine-ha|agent|setup automatically pick this up
> and use the new OVA (and associated VM definition), but I doubt it...
> 
> Simone?

No, it will just update the OVA source but nothing will automatically happen on the running engine VM.
The user has to manually edit the VM definition in the engine allocating more ram and restart it to make it effective.

Comment 29 Nikolai Sednev 2017-04-09 07:34:32 UTC
rhvm-appliance-4.1.20170403.0-1.el7.noarch bring these new default values now:

Please specify the memory size of the VM in MB (Defaults to appliance OVF value): [16384]: 
          The following CPU types are supported by this host:
                 - model_Westmere: Intel Westmere Family
                 - model_Nehalem: Intel Nehalem Family
                 - model_Penryn: Intel Penryn Family
                 - model_Conroe: Intel Conroe Family
          Please specify the CPU type to be used by the VM [model_Westmere]: 
          Please specify the number of virtual CPUs for the VM (Defaults to appliance OVF value): [4]: 

Moving to verified.

Components on hosts:
rhvm-appliance-4.1.20170403.0-1.el7.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-host-deploy-1.6.3-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
vdsm-4.19.10.1-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild@x86-039.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


Note You need to log in before you can comment on or make changes to this bug.