Bug 1026100

Summary: ovirt-engine is killed by oom-killer in is21
Product: Red Hat Enterprise Virtualization Manager Reporter: Katarzyna Jachim <kjachim>
Component: ovirt-engineAssignee: Juan Hernández <juan.hernandez>
Status: CLOSED CURRENTRELEASE QA Contact: Katarzyna Jachim <kjachim>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.3.0CC: aberezin, acathrow, amureini, bazulay, eedri, gickowic, iheim, juan.hernandez, kjachim, lpeer, lustalov, ncredi, pprakash, pstehlik, Rhev-m-bugs, srevivo, yeylon
Target Milestone: ---Keywords: AutomationBlocker, TestBlocker
Target Release: 3.2.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: is23.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 22:16:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1032811    
Attachments:
Description Flags
/var/log/messages from machines with ovirt engines from linked failed tests
none
test logs (vdsm, engine, server etc.)
none
test logs (vdsm, engine, server etc.) from network test
none
/var/log/messages from machine with 4GB RAM & is22 none

Comment 1 Katarzyna Jachim 2013-11-03 17:20:55 UTC
Created attachment 818797 [details]
/var/log/messages from machines with ovirt engines from linked failed tests

Comment 2 Katarzyna Jachim 2013-11-03 17:22:23 UTC
Created attachment 818798 [details]
test logs (vdsm, engine, server etc.)

Comment 3 Katarzyna Jachim 2013-11-03 17:23:56 UTC
Created attachment 818799 [details]
test logs (vdsm, engine, server etc.) from network test

Comment 8 Juan Hernández 2013-11-06 14:59:21 UTC
If the kernel OOM killer killed the process there will be no heap dump, as it is the Java virtual machine that writes it and is already dead. This dump generation only works when the Java virtual machine detects and OutOfMemory exception.

Comment 9 Juan Hernández 2013-11-06 15:56:42 UTC
Looking at the output from the OOM killer I see that the Java virtual machine was killed because it was the single process consuming more RAM. But it was consuming only 440 MiB of actual RAM and 2.69 GiB of virtual space. This is normal.

What is really consuming RAM in that machine is PostgreSQL. It has 81 process consuming each an average of 26 MiB for a total of 2.1 GiB. But the OOM killer doesn't take into account these processes because the parent protects itself using from the OOM. From the PostgreSQL start script:

  PG_OOM_ADJ=-17
  ...
  test x"$PG_OOM_ADJ" != x && echo "$PG_OOM_ADJ" > /proc/self/oom_adj

As the real memory hog here is PostgreSQL I would suggest to reduce its memory usage, reducing the number of connections to the database (PostgreSQL creates one subprocess per connection). In /etc/ovirt-engine/engine.conf add the following:

  ENGINE_DB_MIN_CONNECTIONS=1
  ENGINE_DB_MAX_CONNECTIONS=50 # The default is 100

This is probably unfeasible, as it will probably introduce other problems. All in all the only viable solution may be to increase the memory available in the machine (apparently they have only 3 GiB) or move the database to an external machine.

Comment 10 Barak 2013-11-07 11:30:21 UTC
Katarzyna, Gadi,

A few questions:
- is this a clear issue appears only starting at is21 ?
- Is there something new is21 from the test env:
  * new RHEL 6.5 version ?
  * new EAP version ? 
- Can you please rerun those tests with 4.5 GB and see if it passes ?

Comment 11 Katarzyna Jachim 2013-11-07 12:25:54 UTC
(In reply to Barak from comment #10)

> - is this a clear issue appears only starting at is21 ?

Yes, it has never happened before in nightly runs, I did happen on my (and only my) environment, see https://bugzilla.redhat.com/show_bug.cgi?id=910779

Comment 12 Gadi Ickowicz 2013-11-07 13:01:18 UTC
(In reply to Barak from comment #10)
> Katarzyna, Gadi,
> 
> A few questions:
> - is this a clear issue appears only starting at is21 ?
I have never had this happen on my engine before is21.
> - Is there something new is21 from the test env:
>   * new RHEL 6.5 version ?
>   * new EAP version ? 
in my environment (and it should be identical to our testing env for nightly runs) I am using JBEAP-6.2.0.ER7, since is21.
> - Can you please rerun those tests with 4.5 GB and see if it passes ?

Comment 13 Juan Hernández 2013-11-07 18:23:03 UTC
Have we always tested this with 3 GiB of RAM? Seems strange to me because we officially require a minimum of 4 GiB.

Comment 14 Barak 2013-11-07 18:32:18 UTC
Juan - per comment #12 it looks like a correlation between JBEAP-6.2.0.ER7 and this failure.
Can you please take a look into it?

Comment 15 Barak 2013-11-07 18:36:55 UTC
Katarzyna,

To understand the urgency -  Can you please rerun those tests with 4.5 GB and see if it passes ?


In addition, in comment#11 (Bug 910779) you specified it had happened also on is18, did by any chance you started using JBEAP-6.2.0.ER5 on the test env on is18 ?

Comment 16 Gadi Ickowicz 2013-11-10 06:58:48 UTC
(In reply to Juan Hernández from comment #13)
> Have we always tested this with 3 GiB of RAM? Seems strange to me because we
> officially require a minimum of 4 GiB.
I am using 4GiB of RAM on my engine

Comment 17 Barak 2013-11-11 09:31:11 UTC
Katarzyna, please provide the info on comment #15

Comment 18 Barak 2013-11-11 09:32:50 UTC
juan, can you please take a look into comment #14

Comment 19 Juan Hernández 2013-11-11 09:56:36 UTC
Gadi, regarding comment #16, according to the attached logs the machines where the failure was detected had only 3 GiB of RAM:

Nov  3 08:40:15 jenkins-vm-02 kernel: 786428 pages RAM
Nov  3 03:28:19 jenkins-vm-09 kernel: 786428 pages RAM

(786428 pages * 4 KiB/page ~ 3 GiB)

If we are seeing this with machines with 4 GiB of RAM as well I would like to get the logs (the /var/log/messages files) from those.

Barak, there could be a correlation with the version of EAP, but I really don't expect it, as EAP isn't consuming the amount of memory that causes the problem. It is PostgreSQL that consumes it. This can be caused by a change in the version of PostgreSQL or by a change in the behavior of the application: it may be consuming more connections than before.

As it isn't clear what is the amount of RAM that we are using to run the tests (3 or 4 GiB) I would suggest to make sure what is the amount of RAM that we used before is21 and the amount that we use now.

In addition I don't have an environment where I can reproduce this, so I would appreciate if QE can prepare one so that I can connect and check what is happening when the issue appears. Katarzyna, Gadi, is that feasible?

Comment 20 Katarzyna Jachim 2013-11-12 08:20:43 UTC
Juan, we have increased memory on our machines to 4 GB, but it hasn't helped, check attached log (it is already is22 running there). I can also give you access to my setup, there is still is21 there and it is dying constantly after a few hours of doing nothing (just create a vm and leave it) - just ping me on IRC. You can also start the test which fails always because of this problem (3.3-storage_export_import_*) in our Jenkins and log into this machine - if you don't access to it (you probably don't), again, ping me on IRC, I can start it and send you information where it is running - it should fail after ca. 1.5 hour.

Comment 21 Katarzyna Jachim 2013-11-12 08:21:31 UTC
Created attachment 822778 [details]
/var/log/messages from machine with 4GB RAM & is22

Comment 22 Juan Hernández 2013-11-12 12:25:32 UTC
I think that this is happening from is21 because we recently merged a patch [1] that changes the default minimum number of connections from 1 to 75.

This means that the engine will open at least 75 database connections regardless of what is actually needed. Each connection corresponds to one PostgreSQL process, and those process are consuming the RAM and triggering the OOM killer.

According to the statistics from the application server the maximum number of connections actually used for the tests is 5, so the reamining 70 are only overhead. This can be obtained as follows:

  # /usr/share/jbossas/bin/jboss-cli.sh --connect --controller=localhost:8706
  [standalone@localhost:8706 /] ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool
  ActiveCount=75         DestroyedCount=0       MaxWaitTime=1          
  AvailableCount=99      InUseCount=1           TimedOut=0             
  AverageBlockingTime=1  MaxCreationTime=18     TotalBlockingTime=12   
  AverageCreationTime=6  MaxUsedCount=5         TotalCreationTime=469  
  CreatedCount=75        MaxWaitCount=0         

The relevant output value is "MaxUsedCount".

To double check I would suggest to add the following to /etc/ovirt-engine/engine.conf:

  ENGINE_DB_MIN_CONNECTIONS=1 # This used to be the default before is21

Then repeat the test. If it works then we should revert the part of the patch that changes the default from 71 to 1 or use a smaller default.

It would also be nice to get the value of "MaxUsedCount" after running the QA tests, as this will give us an indication of what is the actual number of connection required to run those loads.

[1] http://gerrit.ovirt.org/19735

Comment 23 Juan Hernández 2013-11-12 16:05:45 UTC
The Jenkins job where this issue was detected has been executed again with the following inside /etc/ovirt-engine/engine/99-my.conf:

  ENGINE_DB_MIN_CONNECTIONS=1

It finished successfully, so it confirms that the problem is in the default database pool configuration.

Note also that the machine where this test was executed had 4 GiB of RAM and during the test it always had at least 100 MiB of RAM free and 1 GiB of RAM free+buffers+cache, so 4 GiB are enough to run the load created by the job.

I just submitted a patch to set the default min size to 1 again:

http://gerrit.ovirt.org/21188

Comment 26 Barak 2013-11-12 19:14:23 UTC
This bring a different issue into surface,
Since we demand only 4G on the engine host it looks like such a configuration will not hold the max connection.

Do we need to change the minimal required memory on engine-setup ? (will require a different Bug)

Comment 27 Arthur Berezin 2013-11-13 21:00:08 UTC
Unfortunately we cannot increase the minimum required memory, especially when moving to hosted engine.

How many hosts can the engine support with 70 connections(my guess would be 70, but I want to make sure)?

Can we decrease each connection's size ? 
What optimizations can we do at database level ?

Arthur

Comment 28 Juan Hernández 2013-11-14 09:33:23 UTC
There isn't a direct relationship between the number hosts and the number of required database connections.

I believe that we should be able to work with much less database connections, but this needs to be analyzed in the context of scale tests, as running simple loads won't push the system to use many connections.

For example, in the scenario where this error was discovered (import-export tests) the engine never used more than 5 connections simultaneously, as explained in comment #22, at least that was my observation.

It would be very helpful if we can get statistics about this from the jobs that we run routinely. It is just a matter of executing the following command after finishing the job but before stopping the engine:

  # /usr/share/jbossas/bin/jboss-cli.sh --connect --controller=localhost:8706
  [standalone@localhost:8706 /] ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool

If we can attach the result of that to each job then we can start to build statistics of how many connections are needed for each kind of load. This should also be included in the scale tests.

Anyhow, I definitively suggest to reduce the min size of the pool to 1, as in the proposed patch (already merged upstream), as soon as possible.

Barak, should I dedicate time to study how to optimize at the database level?

Comment 30 Barak 2013-11-14 13:28:57 UTC
(In reply to Juan Hernández from comment #28)

> Barak, should I dedicate time to study how to optimize at the database level?

There is currently an effort done by QE to scale test 3.3,
This is a good opportunity to extract such information.

Larisa can you please use comment #28 as reference for additional information to be gathered by your team (even manually).

And please add it to this bug.

Juan - when we have the information I would appreciate you getting involved to analyze the feedback.

Keep in mind that we are looking for the best fitted (for most use cases) configuration, it is clear that on some scaled environment we'll have to change the configuration.

Comment 31 Katarzyna Jachim 2013-11-20 14:29:59 UTC
Verified on is23.1, works OK (i.e. ovirt-engine is not killed by oom-killer in our long tests even without the workaround from comment #23).

Comment 33 Juan Hernández 2013-12-12 13:59:16 UTC
Not sure I understand the change in the target release from 3.3.0 to 3.2.6. The fix has already been merged and verified in 3.3. Does this mean that we want to use this same bug to backport the change to 3.2.6? Shouldn't it be a cloned bug?

Comment 34 Itamar Heim 2013-12-13 17:24:09 UTC
(In reply to Juan Hernández from comment #33)
> Not sure I understand the change in the target release from 3.3.0 to 3.2.6.
> The fix has already been merged and verified in 3.3. Does this mean that we
> want to use this same bug to backport the change to 3.2.6? Shouldn't it be a
> cloned bug?

before cloning it, it needs 3.2.z flag + 3.2.x target release.
then it needs to actually be cloned.

Comment 35 Itamar Heim 2014-01-21 22:16:21 UTC
Closing - RHEV 3.3 Released

Comment 36 Itamar Heim 2014-01-21 22:23:12 UTC
Closing - RHEV 3.3 Released