Description of problem: The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later. Example log showing the errors : /var/log/ovirt-engine/ovirt-engine-dwhd.log ~~~ 2013-09-19 13:30:29|ETL Service Started Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947) at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Calendar.<init>(Calendar.java:951) at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619) at java.util.Calendar.createCalendar(Calendar.java:1030) at java.util.Calendar.getInstance(Calendar.java:968) at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163) at ~~~ How reproducible: Always Steps to Reproduce: 1. Attempt to start the ovirt-engine-dwh service. 2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive. 3. Stopping the ovirt-engine-dwh service returns the host to normal. Actual results: OOM errors logged and the host becomes almost unresponsive. Expected results: No OOM errors, host remains responsive and data is logged into the dwh DB. This should be done via hour by hour aggregation approach. Additional info: According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx). This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query: SELECT history_id, history_datetime, current_user_name, vm_id, minutes_in_status, cpu_usage_percent, memory_usage_percent, user_cpu_usage_percent, system_cpu_usage_percent, vm_ip, currently_running_on_host, vm_configuration_version, current_host_configuration_version FROM vm_samples_history WHERE vm_status = 1 AND history_datetime >= ( SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr' ) ORDER BY history_datetime, current_user_name, vm_id In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing. by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data. I have no problem stopping the aggregation process when error occurs, but if a problem is not fixed within the boundaries of hoursToKeepSamples than you lose data anyway.
oVirt 3.5 has been released and should include the fix for this issue.