Bug 1026100
Summary: | ovirt-engine is killed by oom-killer in is21 | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Katarzyna Jachim <kjachim> |
Component: | ovirt-engine | Assignee: | Juan Hernández <juan.hernandez> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Katarzyna Jachim <kjachim> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.3.0 | CC: | aberezin, acathrow, amureini, bazulay, eedri, gickowic, iheim, juan.hernandez, kjachim, lpeer, lustalov, ncredi, pprakash, pstehlik, Rhev-m-bugs, srevivo, yeylon |
Target Milestone: | --- | Keywords: | AutomationBlocker, TestBlocker |
Target Release: | 3.2.6 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | infra | ||
Fixed In Version: | is23.1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-01-21 22:16:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1032811 | ||
Attachments: |
Created attachment 818798 [details]
test logs (vdsm, engine, server etc.)
Created attachment 818799 [details]
test logs (vdsm, engine, server etc.) from network test
If the kernel OOM killer killed the process there will be no heap dump, as it is the Java virtual machine that writes it and is already dead. This dump generation only works when the Java virtual machine detects and OutOfMemory exception. Looking at the output from the OOM killer I see that the Java virtual machine was killed because it was the single process consuming more RAM. But it was consuming only 440 MiB of actual RAM and 2.69 GiB of virtual space. This is normal. What is really consuming RAM in that machine is PostgreSQL. It has 81 process consuming each an average of 26 MiB for a total of 2.1 GiB. But the OOM killer doesn't take into account these processes because the parent protects itself using from the OOM. From the PostgreSQL start script: PG_OOM_ADJ=-17 ... test x"$PG_OOM_ADJ" != x && echo "$PG_OOM_ADJ" > /proc/self/oom_adj As the real memory hog here is PostgreSQL I would suggest to reduce its memory usage, reducing the number of connections to the database (PostgreSQL creates one subprocess per connection). In /etc/ovirt-engine/engine.conf add the following: ENGINE_DB_MIN_CONNECTIONS=1 ENGINE_DB_MAX_CONNECTIONS=50 # The default is 100 This is probably unfeasible, as it will probably introduce other problems. All in all the only viable solution may be to increase the memory available in the machine (apparently they have only 3 GiB) or move the database to an external machine. Katarzyna, Gadi, A few questions: - is this a clear issue appears only starting at is21 ? - Is there something new is21 from the test env: * new RHEL 6.5 version ? * new EAP version ? - Can you please rerun those tests with 4.5 GB and see if it passes ? (In reply to Barak from comment #10) > - is this a clear issue appears only starting at is21 ? Yes, it has never happened before in nightly runs, I did happen on my (and only my) environment, see https://bugzilla.redhat.com/show_bug.cgi?id=910779 (In reply to Barak from comment #10) > Katarzyna, Gadi, > > A few questions: > - is this a clear issue appears only starting at is21 ? I have never had this happen on my engine before is21. > - Is there something new is21 from the test env: > * new RHEL 6.5 version ? > * new EAP version ? in my environment (and it should be identical to our testing env for nightly runs) I am using JBEAP-6.2.0.ER7, since is21. > - Can you please rerun those tests with 4.5 GB and see if it passes ? Have we always tested this with 3 GiB of RAM? Seems strange to me because we officially require a minimum of 4 GiB. Juan - per comment #12 it looks like a correlation between JBEAP-6.2.0.ER7 and this failure. Can you please take a look into it? Katarzyna, To understand the urgency - Can you please rerun those tests with 4.5 GB and see if it passes ? In addition, in comment#11 (Bug 910779) you specified it had happened also on is18, did by any chance you started using JBEAP-6.2.0.ER5 on the test env on is18 ? (In reply to Juan Hernández from comment #13) > Have we always tested this with 3 GiB of RAM? Seems strange to me because we > officially require a minimum of 4 GiB. I am using 4GiB of RAM on my engine Katarzyna, please provide the info on comment #15 juan, can you please take a look into comment #14 Gadi, regarding comment #16, according to the attached logs the machines where the failure was detected had only 3 GiB of RAM: Nov 3 08:40:15 jenkins-vm-02 kernel: 786428 pages RAM Nov 3 03:28:19 jenkins-vm-09 kernel: 786428 pages RAM (786428 pages * 4 KiB/page ~ 3 GiB) If we are seeing this with machines with 4 GiB of RAM as well I would like to get the logs (the /var/log/messages files) from those. Barak, there could be a correlation with the version of EAP, but I really don't expect it, as EAP isn't consuming the amount of memory that causes the problem. It is PostgreSQL that consumes it. This can be caused by a change in the version of PostgreSQL or by a change in the behavior of the application: it may be consuming more connections than before. As it isn't clear what is the amount of RAM that we are using to run the tests (3 or 4 GiB) I would suggest to make sure what is the amount of RAM that we used before is21 and the amount that we use now. In addition I don't have an environment where I can reproduce this, so I would appreciate if QE can prepare one so that I can connect and check what is happening when the issue appears. Katarzyna, Gadi, is that feasible? Juan, we have increased memory on our machines to 4 GB, but it hasn't helped, check attached log (it is already is22 running there). I can also give you access to my setup, there is still is21 there and it is dying constantly after a few hours of doing nothing (just create a vm and leave it) - just ping me on IRC. You can also start the test which fails always because of this problem (3.3-storage_export_import_*) in our Jenkins and log into this machine - if you don't access to it (you probably don't), again, ping me on IRC, I can start it and send you information where it is running - it should fail after ca. 1.5 hour. Created attachment 822778 [details]
/var/log/messages from machine with 4GB RAM & is22
I think that this is happening from is21 because we recently merged a patch [1] that changes the default minimum number of connections from 1 to 75. This means that the engine will open at least 75 database connections regardless of what is actually needed. Each connection corresponds to one PostgreSQL process, and those process are consuming the RAM and triggering the OOM killer. According to the statistics from the application server the maximum number of connections actually used for the tests is 5, so the reamining 70 are only overhead. This can be obtained as follows: # /usr/share/jbossas/bin/jboss-cli.sh --connect --controller=localhost:8706 [standalone@localhost:8706 /] ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool ActiveCount=75 DestroyedCount=0 MaxWaitTime=1 AvailableCount=99 InUseCount=1 TimedOut=0 AverageBlockingTime=1 MaxCreationTime=18 TotalBlockingTime=12 AverageCreationTime=6 MaxUsedCount=5 TotalCreationTime=469 CreatedCount=75 MaxWaitCount=0 The relevant output value is "MaxUsedCount". To double check I would suggest to add the following to /etc/ovirt-engine/engine.conf: ENGINE_DB_MIN_CONNECTIONS=1 # This used to be the default before is21 Then repeat the test. If it works then we should revert the part of the patch that changes the default from 71 to 1 or use a smaller default. It would also be nice to get the value of "MaxUsedCount" after running the QA tests, as this will give us an indication of what is the actual number of connection required to run those loads. [1] http://gerrit.ovirt.org/19735 The Jenkins job where this issue was detected has been executed again with the following inside /etc/ovirt-engine/engine/99-my.conf: ENGINE_DB_MIN_CONNECTIONS=1 It finished successfully, so it confirms that the problem is in the default database pool configuration. Note also that the machine where this test was executed had 4 GiB of RAM and during the test it always had at least 100 MiB of RAM free and 1 GiB of RAM free+buffers+cache, so 4 GiB are enough to run the load created by the job. I just submitted a patch to set the default min size to 1 again: http://gerrit.ovirt.org/21188 This bring a different issue into surface, Since we demand only 4G on the engine host it looks like such a configuration will not hold the max connection. Do we need to change the minimal required memory on engine-setup ? (will require a different Bug) Unfortunately we cannot increase the minimum required memory, especially when moving to hosted engine. How many hosts can the engine support with 70 connections(my guess would be 70, but I want to make sure)? Can we decrease each connection's size ? What optimizations can we do at database level ? Arthur There isn't a direct relationship between the number hosts and the number of required database connections. I believe that we should be able to work with much less database connections, but this needs to be analyzed in the context of scale tests, as running simple loads won't push the system to use many connections. For example, in the scenario where this error was discovered (import-export tests) the engine never used more than 5 connections simultaneously, as explained in comment #22, at least that was my observation. It would be very helpful if we can get statistics about this from the jobs that we run routinely. It is just a matter of executing the following command after finishing the job but before stopping the engine: # /usr/share/jbossas/bin/jboss-cli.sh --connect --controller=localhost:8706 [standalone@localhost:8706 /] ls /subsystem=datasources/data-source=ENGINEDataSource/statistics=pool If we can attach the result of that to each job then we can start to build statistics of how many connections are needed for each kind of load. This should also be included in the scale tests. Anyhow, I definitively suggest to reduce the min size of the pool to 1, as in the proposed patch (already merged upstream), as soon as possible. Barak, should I dedicate time to study how to optimize at the database level? (In reply to Juan Hernández from comment #28) > Barak, should I dedicate time to study how to optimize at the database level? There is currently an effort done by QE to scale test 3.3, This is a good opportunity to extract such information. Larisa can you please use comment #28 as reference for additional information to be gathered by your team (even manually). And please add it to this bug. Juan - when we have the information I would appreciate you getting involved to analyze the feedback. Keep in mind that we are looking for the best fitted (for most use cases) configuration, it is clear that on some scaled environment we'll have to change the configuration. Verified on is23.1, works OK (i.e. ovirt-engine is not killed by oom-killer in our long tests even without the workaround from comment #23). Not sure I understand the change in the target release from 3.3.0 to 3.2.6. The fix has already been merged and verified in 3.3. Does this mean that we want to use this same bug to backport the change to 3.2.6? Shouldn't it be a cloned bug? (In reply to Juan Hernández from comment #33) > Not sure I understand the change in the target release from 3.3.0 to 3.2.6. > The fix has already been merged and verified in 3.3. Does this mean that we > want to use this same bug to backport the change to 3.2.6? Shouldn't it be a > cloned bug? before cloning it, it needs 3.2.z flag + 3.2.x target release. then it needs to actually be cloned. Closing - RHEV 3.3 Released Closing - RHEV 3.3 Released |
Created attachment 818797 [details] /var/log/messages from machines with ovirt engines from linked failed tests