Bug 1091686

Summary:	OVIRT35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service.
Product:	[Retired] oVirt	Reporter:	Yaniv Lavi <ylavi>
Component:	ovirt-engine-dwh	Assignee:	Shirly Radco <sradco>
Status:	CLOSED CURRENTRELEASE	QA Contact:	movciari
Severity:	high	Docs Contact:
Priority:	high
Version:	3.5	CC:	aberezin, bazulay, gklein, iheim, jentrena, juan.hernandez, lyarwood, nobody, pablo.iranzo, pep, pstehlik, rbalakri, Rhev-m-bugs, sradco, ybronhei, yeylon, ylavi
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	3.5.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	infra
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:	1014134	Environment:
Last Closed:	2014-10-17 12:31:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1014134
Bug Blocks:

Description Yaniv Lavi 2014-04-27 09:31:25 UTC

Description of problem:

The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later.

Example log showing the errors :

/var/log/ovirt-engine/ovirt-engine-dwhd.log

~~~
2013-09-19 13:30:29|ETL Service Started
Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Calendar.<init>(Calendar.java:951)
        at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619)
        at java.util.Calendar.createCalendar(Calendar.java:1030)
        at java.util.Calendar.getInstance(Calendar.java:968)
        at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163)
        at
~~~

How reproducible:
Always

Steps to Reproduce:
1. Attempt to start the ovirt-engine-dwh service.
2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive.
3. Stopping the ovirt-engine-dwh service returns the host to normal.

Actual results:
OOM errors logged and the host becomes almost unresponsive.

Expected results:
No OOM errors, host remains responsive and data is logged into the dwh DB. This should be done via hour by hour aggregation approach.

Additional info:
According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx).

This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query:

  SELECT
    history_id,
    history_datetime,
    current_user_name,
    vm_id, 
    minutes_in_status, 
    cpu_usage_percent, 
    memory_usage_percent, 
    user_cpu_usage_percent, 
    system_cpu_usage_percent, 
    vm_ip,
    currently_running_on_host, 
    vm_configuration_version, 
    current_host_configuration_version
  FROM
    vm_samples_history
  WHERE
    vm_status = 1 AND
    history_datetime >= (
      SELECT
        var_datetime
      FROM
        history_configuration
      WHERE
        var_name = 'lastHourAggr'
    )
  ORDER BY
    history_datetime,
    current_user_name,
    vm_id

In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing.

by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data.

I have no problem stopping the aggregation process when error occurs, but if a problem is not fixed within the boundaries of hoursToKeepSamples than you lose data anyway.

Comment 1 Sandro Bonazzola 2014-10-17 12:31:00 UTC

oVirt 3.5 has been released and should include the fix for this issue.