Bug 786572
Summary: | elasticsearch can get killed due to out of memory errors. | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Corey Welton <cwelton> |
Component: | Infrastructure | Assignee: | Brad Buckingham <bbuckingham> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Garik Khachikyan <gkhachik> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.0.1 | CC: | bbuckingham, bkearney, ftaylor, gkhachik, jlaska, jorgen.langgat, jsherril, lzap, mkoci, mmccune |
Target Milestone: | Unspecified | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
Katello Version: 0.1.207-1.git.2.7881501.el6
|
|
Last Closed: | 2012-08-22 18:24:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 747354 |
Description
Corey Welton
2012-02-01 19:47:50 UTC
taking this one. We should not give Eliastic VM that much of heap. 256m must be enough... @Mike - after you set a reasonable value please update the algorithm (reserve memory amount) that is calculating number of thin processes for the installation. I am going to increase it by 100 MB, but maybe it's not enough: puppet - increasing OS/BE reserve by 100 MB (will push later today) Adding qa_ack for cloudforms-1.0.0 bumping elasticsearch to 1.5G vs the current 1.0G heap limit. commit d0ea23ae941f6d02fff7cdcc82d05d26953e1f46 Author: Mike McCune <mmccune> Date: Tue Feb 7 17:00:09 2012 -0800 786572 - force in max/min heap sizes to 1.5G vs the current 1G limit Closing out as a dev task; can be reopened if issues persist. Using packages which I believe contain the fix from comment#4 ... * katello-0.1.235-2.el6.noarch * katello-configure-0.1.64-3.el6.noarch * elasticsearch-0.18.4-7.el6.x86_64 .. I observed the following elasticsearch OOMkill... > Out of memory: Kill process 26305 (java) score 234 or sacrifice child > Killed process 26305, UID 497, (java) total-vm:4057440kB, anon-rss:963088kB, file-rss:124kB Are the changes in place sufficient? It's interesting to note that in both of these cases, elasticsearch was killed around the same time a content promotion was taking place. Could be a lead - or could be a red herring. Reopening for dev consideration. New memory setting makes occurance less likely, moving to an RC blocker given the rarity. But Mike's commit introduced another issue - our katello configure calculates how many thin processes should be deployed by default. It reserves 600 MB for OS and backend engines (candlepin, httpd, pulp). I think we should increase this value to 2 GB. The issue is configure now thinks theres enought room for 3 thin processes on a standard 2 GB machine which is apparently not true (1,5 G elastic, one thin is 250 M and backend engines also sucks). I am pushing the change upstream: e7b1651 786572 - increasing memory reserve to 2 GB The practical result of this is typical testing box with only 2 GBs will have only one thin process configured. By the way I think there must be another solution to the original issue - indexing engine MUST work with 500 MB without any problems indexing terrabytes of data. It's just about seeking through some index files. Based on the original error message reported, it appears that the OS 'Out of memory killer' is killing the elasticsearch process. This doesn't, however, mean that elasticsearch is leaking memory. The OOM killer attempts to kill processes based on rules it has defined, attempting to minimize impact on the system. Increasing the memory available to elasticsearch to 1.5g, could actually cause the OOM more likely to occur, since it will leave less memory for the other processes. I'll be doing some tests soon with a smaller value (e.g. 500m) to observe the behavior. That said, I've run through the test scenario that James used which involves syncing and promoting several large products/repos (~60gb in content), templates...etc. I set up my VM to be 'similar' to James' (e.g. reserved 4g ram, 100gb disk and 2 vcpus (James had 4)). I wasn't able to reproduce the OOM condition; however, below is the memory utilization at the end of the run for the processes consuming the most, along w/ the max that the processes reached during the run. delayed_jobs (ruby) = ~896m (max size was ~919m) thin (3 - cpu-1) = ~187m, 180m & 174m (rails app instances) elasticsearch (java)= ~1g (gradually grows to max size of 1g) wsgi:pulp (httpd) = ~469m (wsgi:pulp) (max 659m) - up and down... but mostly continued increasing... mongod = ~141m (max size was ~298m) (up and down) apache (httpd) = 3-27m each (~15 processes) (155m total) tomcat (java) = ~150m (max size was ~337) (up and down) From the above, the scenarios executed appeared to require use of nearly all of the 4g of ram allocated. Also, while there are several areas for us to investigate, it looks like key areas may be in : - delayed jobs - elasticsearch - wsgi:pulp Yes I agree bump to 1.5 does not need to solve the issue. Question: Every single JVM process is able to consume -Xmx value plus some stack frames. In our scenario it was 1 GB by default, but as we can see in the report JVM was killed while consuming 2.4 GB. Something went wrong. BTW ES uses native JNI library "Sigar", memory leaks can easily occur. Threre are many "out of memory" threads regarding ElasticSearch, could be a bug. Here the community member recommends either use 32bit JVM or tune default settings a bit. He is doing some calculation: 20mb * indicies (16 in our case) * shards (5 - default) => 1.600 GB. According the documentation we need to set only 1 shard for Katello since our default installation does not permit multiple-node installations. That could help (5 times less than the default setting). We definitely need to tune our settings. Best documentation on how to configure things is here: http://www.elasticsearch.org/guide/reference/index-modules/ https://github.com/elasticsearch/elasticsearch/blob/master/config/elasticsearch.yml We should consider setting open file limits too: http://www.elasticsearch.org/tutorials/2011/04/06/too-many-open-files.html By the way lots of things in elasticsearch are undocumented, for example the sample configuration in the RHEL6 ES version completely miss "cache" portion of the config. And even reference above does not list all the information which can be found on the mailing lists (cache_size, buffer_size, warm_cache). I'd recommend to keep maximum heap size low - it must be working with lower memory settings. Katello it self consumes a lot of memory as we can see above. Allocating another 1.5 GB makes things even worse. From my past Apache Lucene experiences I am pretty sure it is able to work with 500 MB index with only 64 MB of memory without any problems. Will try some delayedjobs testing today, I am concerned about the memory consumption too. I can confirm with 1.5 GB setting it even easier to overload the machine and let OS to kill elasticsearch (it chooses the process with the biggest mem consumption). Reproducer: Use CLI command and import a RH manifest and sync one repo 3 times. 2 gig machine is dead after third run. I can provide you with a simple CLI script. @Mike - I think we discussed this but can you explain to me again why you set the ES_MAX_MEM/ES_MIN_MEM settings in the elasticsearch.in.sh? I have tested setting it in the /etc/sysconfig/elasticsearch and it works fine. I would expect it there. # rpm -V elasticsearch S.5....T. /usr/share/java/elasticsearch/bin/elasticsearch.in.sh (In reply to comment #12) >. BTW ES uses > native JNI library "Sigar", memory leaks can easily occur. > Note that ES ships with Sigar by default, but we do NOT. When we were doing some packaging it was discovered that ES runs fine without sigar (although some system monitoring won't work without it). In addition there was some conflict with RHEL 6 (as 6.1 is shipping a GIT version of sigar (without the java jar) for matahari), so we dropped it completely. I have the same issue on a 2 GB VM machine running System Engine. I could no longer get to the web interface, and a katello ping showed that elasticsearch was dead. When I try to restart elasticsearch, I get: Error occurred during initialization of VM Could not reserve enough space for object heap I manually changed ES_MIN_MEM and ES_MAX_MEM to 512m in /usr/share/java/elasticsearch/bin/elasticsearch.in.sh. These are apparently hardcoded, so the variables in /etc/sysconfig/elasticsearch do not work. With 512MB min/max it will at least start now. NB: I only had 512 MB swap. I increased swap to 2GB, and now elasticsearch will start with the original ES_MIN_MEM and ES_MAX_MEM set to 1512m. @Justin - you're right, not using it: [root@el ~]# rpm -ql sigar /usr/lib64/libsigar.so /usr/share/doc/sigar-1.6.5 /usr/share/doc/sigar-1.6.5/AUTHORS /usr/share/doc/sigar-1.6.5/ChangeLog /usr/share/doc/sigar-1.6.5/LICENSE /usr/share/doc/sigar-1.6.5/NOTICE /usr/share/doc/sigar-1.6.5/README [root@el ~]# ps ax | grep elastic 4661 ? Sl 0:28 /usr/bin/java -Xms64m -Xmx64m ... [root@el ~]# lsof -p 4661 | grep sigar Anyway we still need to: 1) decrease the JVM HEAP to reasonable value (512m) as ES reserves lot more VIRT memory than the maximum heap size; 2) get sysconfig variables working (Fedora/RHEL standard); 3) tune the default settings (ES defaults are "ready to scale up") commit - 89d215d9a7e4c955318bff99fc26b595f19d6480 Updated installer to set elasticsearch heap at 256m. Tested w/ several large (8gb-20gb) repos performing syncs, promotions...etc. commit - dc6e3d230441d50e91ed3254808c57500304a04f Reduced elasticsearch.yml to use 3 shards vs the default (5). Elasticsearch (by default) is configured to use 5 shards and 1 replica. That allows the server to grow scaling across up to 10 servers. While that is nice, it does come at the cost of some memory. We could reduce the number of shards to 1, given that our initial deployment configuration is only 1 node; however, that would not facilitate future growth/scalability. In order to strike a balance between scalability and minimize impact to memory, the configuration was reduced to 3 shards. Note: from some basic testing with 5 vs 3 shards, observed the following: - 3 shards : VIRT(~1650m), RES(~395m) - 5 shards : VIRT(~1720m), RES(~452m) Scenario run in both configurations before arriving at the above was: - define 2 providers (1 w/ rhel6 & 1 w/ rhel6.1) - sync both 2 times - promote both 2 times so my scenario for verification is going to be: - sync both Fedora16 + CentOS6 providers 2 times. - promote 2 times each. leaving the config properties for elasticsearch untouched. # VERIFIED and it works: ^^^ was able to sync and promote 2 times two repositories without having elasticsearch killed from processes. checked on recent katello git build: --- katello-0.2.14-1.git.0.8140dc7.el6.noarch candlepin-0.5.26-1.el6.noarch pulp-1.0.0-5.el6.noarch getting rid of 6.0.0 version since that doesn't exist |