Description of problem: The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later. The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not have SAR data to highlight the perf hit while running the service. The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts and ~800 vms. I use ~ as this is a rapidly growing env expected to double in size within 6 months. Will increasing the heap space be enough here? Should we advise the customer to provide additional resources to the RHEV-M to also improve the situation? Are there additional things we can change to improve the situation (hosting the service on a separate system in the longer term etc)? [1] https://access.redhat.com/site/solutions/448153 Example log showing the errors : /var/log/ovirt-engine/ovirt-engine-dwhd.log ~~~ 2013-09-19 13:30:29|ETL Service Started Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Calendar.<init>(Calendar.java:951) at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619) at java.util.Calendar.createCalendar(Calendar.java:1030) at java.util.Calendar.getInstance(Calendar.java:968) at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7511) ... 1 more Exception in thread "Thread-1" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-1" Exception in thread "Thread-302" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_5Process(AggregationToHourly.java:12791) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$5.run(AggregationToHourly.java:23203) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in component tJDBCInput_10 org.postgresql.util.PSQLException: Bad value for type short : 306415363 at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getShort(AbstractJdbc2ResultSet.java:1974) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_10Process(AggregationToHourly.java:21422) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$9.run(AggregationToHourly.java:23371) 2013-09-19 14:06:25|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_10|org.postgresql.util.PSQLException:Bad value for type short : 306415363|1 Exception in component tJDBCInput_13 org.postgresql.util.PSQLException: Bad value for type date : 65819507 at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:246) at org.postgresql.jdbc2.TimestampUtils.toDate(TimestampUtils.java:402) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getDate(AbstractJdbc2ResultSet.java:407) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getDate(AbstractJdbc2ResultSet.java:2299) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_13Process(AggregationToHourly.java:16648) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$7.run(AggregationToHourly.java:23287) Caused by: java.lang.NumberFormatException: Expected time to be colon-separated, got '' at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:155) ... 5 more 2013-09-19 14:07:22|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_13|org.postgresql.util.PSQLException:Bad value for type date : 65819507|1 Exception in thread "Thread-336" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_18Process(StatisticsSync.java:13916) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$7.run(StatisticsSync.java:16263) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-333" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_8Process(StatisticsSync.java:7442) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$4.run(StatisticsSync.java:16137) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in component tJDBCOutput_6 Exception in component tJDBCOutput_5 org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:283) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:321) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_10Process(StatisticsSync.java:9760) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$5.run(StatisticsSync.java:16179) Caused by: java.io.IOException: Stream closed at sun.nio.cs.StreamEncoder.ensureOpen(StreamEncoder.java:45) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:140) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at org.postgresql.core.PGStream.flush(PGStream.java:507) at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:1107) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:256) ... 5 more org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:283) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:321) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_12Process(StatisticsSync.java:11806) at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$6.run(StatisticsSync.java:16221) Caused by: java.io.IOException: Unexpected packet type: 55 at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1941) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) ... 5 more Exception in thread "Thread-334" java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-335" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-335" Exception in thread "Thread-303" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_11Process(AggregationToHourly.java:14598) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$6.run(AggregationToHourly.java:23245) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in component tJDBCOutput_9 org.postgresql.util.PSQLException: A result was returned when none was expected. at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:322) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_9Process(AggregationToHourly.java:19195) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$8.run(AggregationToHourly.java:23329) 2013-09-19 14:12:53|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCOutput_9|org.postgresql.util.PSQLException:A result was returned when none was expected.|1 Exception in component tJDBCInput_1 org.postgresql.util.PSQLException: Bad value for type int : 2013-09-19 06:00:00+01 at org.postgresql.jdbc2.AbstractJdbc2ResultSet.toInt(AbstractJdbc2ResultSet.java:2759) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getInt(AbstractJdbc2ResultSet.java:2003) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_1Process(AggregationToHourly.java:2070) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$1.run(AggregationToHourly.java:23035) 2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_1|org.postgresql.util.PSQLException:Bad value for type int : 2013-09-19 06:00:00+01|1 Exception in component tJDBCInput_4 org.postgresql.util.PSQLException: No results were returned by the query. at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:253) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_4Process(AggregationToHourly.java:10040) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$4.run(AggregationToHourly.java:23161) 2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_4|org.postgresql.util.PSQLException:No results were returned by the query.|1 Exception in component tJDBCInput_2 org.postgresql.util.PSQLException: Bad value for type short : 306444130 at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getShort(AbstractJdbc2ResultSet.java:1974) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_2Process(AggregationToHourly.java:5000) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$2.run(AggregationToHourly.java:23077) 2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_2|org.postgresql.util.PSQLException:Bad value for type short : 306444130|1 ~~~ Version-Release number of selected component (if applicable): rhevm-dwh-3.2.1-2.el6ev.noarch Mon 22 Jul 2013 01:41:51 PM BST rhevm-reports-3.2.1-6.el6ev.noarch Mon 22 Jul 2013 01:41:52 PM BST How reproducible: Always Steps to Reproduce: 1. Attempt to start the ovirt-engine-dwh service. 2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive. 3. Stopping the ovirt-engine-dwh service returns the host to normal. Actual results: OOM errors logged and the host becomes almost unresponsive. Expected results: No OOM errors, host remains responsive and data is logged into the dwh DB. Additional info:
Please elaborate on the rhevm environment. how many hosts\vms\clusters and so on? I see two issues in the logs. One that was resolved in 3.2 that causes failure of collection due to vm with more than 16 IPs and the other is the heap size and the solution is to increase the default. Memory leak is unlikely. Yaniv
(In reply to Yaniv Dary from comment #2) > Please elaborate on the rhevm environment. how many hosts\vms\clusters and > so on? Already provided in the description, let me know if you need anything else. Quoting the actual text here : (In reply to Lee Yarwood from comment #0) > Description of problem: > > The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core > each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not > have SAR data to highlight the perf hit while running the service. > > The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts > and ~800 vms. I use ~ as this is a rapidly growing env expected to double in > size within 6 months.
(In reply to Lee Yarwood from comment #3) > (In reply to Yaniv Dary from comment #2) > > Please elaborate on the rhevm environment. how many hosts\vms\clusters and > > so on? > > Already provided in the description, let me know if you need anything else. > > Quoting the actual text here : > > (In reply to Lee Yarwood from comment #0) > > Description of problem: > > > > The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core > > each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not > > have SAR data to highlight the perf hit while running the service. > > > > The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts > > and ~800 vms. I use ~ as this is a rapidly growing env expected to double in > > size within 6 months. Ok, I think the best course of action is to increase heap size and see if that resolves the issue. The large data amounts are probably the cause. Juan, do you agree? Yaniv
(In reply to Yaniv Dary from comment #2) > I see two issues in the logs. One that was resolved in 3.2 that causes > failure of collection due to vm with more than 16 IPs[..] Btw, could you confirm the BZ? The customer is running the latest 3.2.z version after all.
(In reply to Lee Yarwood from comment #5) > (In reply to Yaniv Dary from comment #2) > > I see two issues in the logs. One that was resolved in 3.2 that causes > > failure of collection due to vm with more than 16 IPs[..] > > Btw, could you confirm the BZ? The customer is running the latest 3.2.z > version after all. Then it must be a old log message and it should work with a larger heap.
(In reply to Yaniv Dary from comment #6) > (In reply to Lee Yarwood from comment #5) > > (In reply to Yaniv Dary from comment #2) > > > I see two issues in the logs. One that was resolved in 3.2 that causes > > > failure of collection due to vm with more than 16 IPs[..] > > > > Btw, could you confirm the BZ? The customer is running the latest 3.2.z > > version after all. > > Then it must be a old log message and it should work with a larger heap. Also because this is a vm, you need to check the storage\host stats because the cause to slowness can be the large amount of data being written to disk, while other vms are also busy.
It is my understanding that the very nature of the DWH application is to load huge amounts of data from the database in memory and them crunch them, so I agree with Yaniv that the only way forward is to increase the size of the heap. I don't think additional resources are needed: 16 GiB and 8 CPUs should be enough, if they aren't, in my opinion, there is something that we need to fix. In this regards having a heap dump of the DWH process and a copy of the database could help. Not sure if we want to invest time in this. I would also suggest to restrict the number of CPUs available to the DWH service, so if it goes out of memory again the garbage collector will not take over all the CPUs of the machine. This can be achieved using cgroups, for example: # yum -y install libcgroup # chkconfig cgconfig on # cat >> /etc/cgconfig.conf <<. group dwh { cpuset { cpuset.cpus = "0,1"; # Use only CPUs 0 and 1 cpuset.mems = "0"; # Assuming that there is only one memory node } } . # service cgconfig restart Then edit the history_service.sh script and replace the "exec" that runs the Java virtual machine with "cgexec -g cpuset:dwh": cgexec -g cpuset:dwh ${JAVA_HOME}/bin/java ... Note that this is just a hack to avoid the DWH service overloading the machine, not a solution to the problem. The solution is to increase the heap size.
Brilliant, thanks for the feedback. I'll pass these suggestions on to the customer now.
Itamar, you added current version flags. what is the fix you want here? Yaniv
iiuc, juan suggestions are important and should get their own bug. but do we know why we OOM'd? we need to solve that. i flagged for z-stream to consider it. question if this is a one-off, or reproducible.
(In reply to Itamar Heim from comment #11) > iiuc, juan suggestions are important and should get their own bug. Please specify what you mean. > but do we know why we OOM'd? we need to solve that. We don't give unlimited heap size. The standard is good enough for most cases. In very large environments increase of defaults is needed. I don't think we should change the default currently. > i flagged for z-stream to consider it. question if this is a one-off, or > reproducible.
(In reply to Yaniv Dary from comment #12) > (In reply to Itamar Heim from comment #11) > > iiuc, juan suggestions are important and should get their own bug. > > Please specify what you mean. > the suggestion in comment 8 to cap cpu maybe. > > but do we know why we OOM'd? we need to solve that. > > We don't give unlimited heap size. The standard is good enough for most > cases. > In very large environments increase of defaults is needed. I don't think we > should change the default currently. so how do we track/monitor/warn for users so it won't explode for them?
Lee- is this external DB?
(In reply to Barak from comment #15) > Lee- is this external DB? Internal.
Please update. Yaniv
I have created a test environment with 1000 VMs and 200 hosts running on FakeVDSM. DWHD process started and run without any issues. Engine is responsive and relatively fast. Note that I changed the engine process to use 12GB of heap size. Did the customer change the heap size configuration of the engine process?
(In reply to Liran Zelkha from comment #37) > I have created a test environment with 1000 VMs and 200 hosts running on > FakeVDSM. Hey Liran, Which version are you testing here? 3.3? > DWHD process started and run without any issues. > Engine is responsive and relatively fast. > Note that I changed the engine process to use 12GB of heap size. > Did the customer change the heap size configuration of the engine process? Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by how much. I have no idea if they changed the engine HEAP at all [2]. [1] /usr/share/ovirt-engine-dwh/etl/history_service.sh [2] /etc/ovirt-engine/engine.conf
(In reply to Lee Yarwood from comment #38) > Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by > how much. I have no idea if they changed the engine HEAP at all [2]. > > [1] /usr/share/ovirt-engine-dwh/etl/history_service.sh > [2] /etc/ovirt-engine/engine.conf We'll come back to you on [1] so keeping the needinfo. No changes in [2] /etc/ovirt-engine/engine.conf so default values for engine from /usr/share/ovirt-engine/conf/engine.conf.defaults apply: ENGINE_HEAP_MIN=1g ENGINE_HEAP_MAX=1g ENGINE_PERM_MIN=256m ENGINE_PERM_MAX=256m
I'm using RHEVM 3.2. This will never do. Engine heap is too low for so many VMs/hosts. How much memory does the machine has? Increase memory and the engine will run much faster.
Concerning Engine - my recommendation is to increase to 8GB. For DWH - if it works at 2GB, leave it like that. Engine would work faster and they will probably not feel the DWH.
According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx). This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query: SELECT history_id, history_datetime, current_user_name, vm_id, minutes_in_status, cpu_usage_percent, memory_usage_percent, user_cpu_usage_percent, system_cpu_usage_percent, vm_ip, currently_running_on_host, vm_configuration_version, current_host_configuration_version FROM vm_samples_history WHERE vm_status = 1 AND history_datetime >= ( SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr' ) ORDER BY history_datetime, current_user_name, vm_id In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing. So apparently that query is returning at least 1465765 rows, which is hard to believe. Can run this query manually in the customer environment environment and see what is the number of results? If the query does return this huge number of rows, then we will need to modify the DWH so that it runs queries setting the fetch size to something other than zero. Zero is the default, and it means that the JDBC driver loads all rows in memory before returning the results. For details see here: http://jdbc.postgresql.org/documentation/92/query.html#fetchsize-example The code that runs these queries is generated by Talend Studio, so it won't be easy to change. If we need to do this then we will probably have to request it to Talend, or use something like AspectJ to modify the binary code. If the query doesn't return a large number of rows, then we are probably facing a bug in the JDBC driver, or a bug in the database itself. If the issue can be reproduced then I would suggest to generate a dump of the network traffic between the DWH and the database for further analysis.
(In reply to Juan Hernández from comment #58) > So apparently that query is returning at least 1465765 rows, which is hard > to believe. Can run this query manually in the customer environment > environment and see what is the number of results? I've ran it on a copy of their db (file 00911410_ovirt_engine_history_sql.xz in the download location - not the very latest and live environment) and it does suggest that the order of magnitude is correct: => select count(*) FROM vm_samples_history WHERE vm_status = 1 AND history_datetime >= (SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr'); count --------- 1297627 (1 row) => SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr'; var_datetime ------------------------ 2013-08-29 17:00:00+02 (1 row) Yaniv, how do we proceed from here?
(In reply to Josep 'Pep' Turro Mauri from comment #59) > (In reply to Juan Hernández from comment #58) > > So apparently that query is returning at least 1465765 rows, which is hard > > to believe. Can run this query manually in the customer environment > > environment and see what is the number of results? > > I've ran it on a copy of their db (file > 00911410_ovirt_engine_history_sql.xz in the download location - not the very > latest and live environment) and it does suggest that the order of magnitude > is correct: > > => select count(*) FROM vm_samples_history WHERE vm_status = 1 AND > history_datetime >= (SELECT var_datetime FROM history_configuration WHERE > var_name = 'lastHourAggr'); > count > --------- > 1297627 > (1 row) > > => SELECT var_datetime FROM history_configuration WHERE var_name = > 'lastHourAggr'; > var_datetime > ------------------------ > 2013-08-29 17:00:00+02 > (1 row) > > Yaniv, how do we proceed from here? by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data. First thing will be to change last aggregate date to now, so this data will not be aggregated, then check the log closely and figure out why samples data is not being aggregated (it should happen every hour). The growing select here caused by data not being aggregated is causing this issue. Yaniv
I think the best approach will be to do the hourly aggregation per hourly select instead to select *, this will increase the scale the the DWH can handle and if failure happens it will not increase the select size and we'll handle later hours. The downside is that once an hour is skipped, the data will never be aggregated for that hour. I would also advise to do the same for the daily. This kind of change should first be acceptable by PMs & customers (meaning aggregation failure may lead to a missing hour/day). In addition, such a change as described can not be delivered in zstream. Yaniv for now and to make sure customer issue had been addressed, please add comment how to update the last aggregated hour variable to NOW().
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0177.html