Description of problem:
The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later.
Example log showing the errors :
/var/log/ovirt-engine/ovirt-engine-dwhd.log
~~~
2013-09-19 13:30:29|ETL Service Started
Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947)
at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Calendar.<init>(Calendar.java:951)
at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619)
at java.util.Calendar.createCalendar(Calendar.java:1030)
at java.util.Calendar.getInstance(Calendar.java:968)
at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163)
at
~~~
How reproducible:
Always
Steps to Reproduce:
1. Attempt to start the ovirt-engine-dwh service.
2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive.
3. Stopping the ovirt-engine-dwh service returns the host to normal.
Actual results:
OOM errors logged and the host becomes almost unresponsive.
Expected results:
No OOM errors, host remains responsive and data is logged into the dwh DB. This should be done via hour by hour aggregation approach.
Additional info:
According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx).
This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query:
SELECT
history_id,
history_datetime,
current_user_name,
vm_id,
minutes_in_status,
cpu_usage_percent,
memory_usage_percent,
user_cpu_usage_percent,
system_cpu_usage_percent,
vm_ip,
currently_running_on_host,
vm_configuration_version,
current_host_configuration_version
FROM
vm_samples_history
WHERE
vm_status = 1 AND
history_datetime >= (
SELECT
var_datetime
FROM
history_configuration
WHERE
var_name = 'lastHourAggr'
)
ORDER BY
history_datetime,
current_user_name,
vm_id
In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing.
by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data.
I have no problem stopping the aggregation process when error occurs, but if a problem is not fixed within the boundaries of hoursToKeepSamples than you lose data anyway.