1091686 – OVIRT35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service.

Bug 1091686 - OVIRT35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service.

Summary: OVIRT35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh serv...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-dwh
Sub Component:
Version:	3.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Shirly Radco
QA Contact:	movciari
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1014134
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-27 09:31 UTC by Yaniv Lavi
Modified:	2016-02-10 19:31 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:	1014134
Environment:
Last Closed:	2014-10-17 12:31:00 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	26952	0	None	None	None	Never

Description Yaniv Lavi 2014-04-27 09:31:25 UTC

Description of problem:

The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later.

Example log showing the errors :

/var/log/ovirt-engine/ovirt-engine-dwhd.log

~~~
2013-09-19 13:30:29|ETL Service Started
Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Calendar.<init>(Calendar.java:951)
        at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619)
        at java.util.Calendar.createCalendar(Calendar.java:1030)
        at java.util.Calendar.getInstance(Calendar.java:968)
        at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163)
        at
~~~

How reproducible:
Always

Steps to Reproduce:
1. Attempt to start the ovirt-engine-dwh service.
2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive.
3. Stopping the ovirt-engine-dwh service returns the host to normal.

Actual results:
OOM errors logged and the host becomes almost unresponsive.

Expected results:
No OOM errors, host remains responsive and data is logged into the dwh DB. This should be done via hour by hour aggregation approach.

Additional info:
According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx).

This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query:

  SELECT
    history_id,
    history_datetime,
    current_user_name,
    vm_id, 
    minutes_in_status, 
    cpu_usage_percent, 
    memory_usage_percent, 
    user_cpu_usage_percent, 
    system_cpu_usage_percent, 
    vm_ip,
    currently_running_on_host, 
    vm_configuration_version, 
    current_host_configuration_version
  FROM
    vm_samples_history
  WHERE
    vm_status = 1 AND
    history_datetime >= (
      SELECT
        var_datetime
      FROM
        history_configuration
      WHERE
        var_name = 'lastHourAggr'
    )
  ORDER BY
    history_datetime,
    current_user_name,
    vm_id

In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing.

by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data.

I have no problem stopping the aggregation process when error occurs, but if a problem is not fixed within the boundaries of hoursToKeepSamples than you lose data anyway.

Comment 1 Sandro Bonazzola 2014-10-17 12:31:00 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.