Bug 1091686

Summary: OVIRT35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service.
Product: [Retired] oVirt Reporter: Yaniv Lavi <ylavi>
Component: ovirt-engine-dwhAssignee: Shirly Radco <sradco>
Status: CLOSED CURRENTRELEASE QA Contact: movciari
Severity: high Docs Contact:
Priority: high    
Version: 3.5CC: aberezin, bazulay, gklein, iheim, jentrena, juan.hernandez, lyarwood, nobody, pablo.iranzo, pep, pstehlik, rbalakri, Rhev-m-bugs, sradco, ybronhei, yeylon, ylavi
Target Milestone: ---Keywords: FutureFeature
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 1014134 Environment:
Last Closed: 2014-10-17 12:31:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1014134    
Bug Blocks:    

Description Yaniv Lavi 2014-04-27 09:31:25 UTC
Description of problem:

The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later.

Example log showing the errors :

/var/log/ovirt-engine/ovirt-engine-dwhd.log

~~~
2013-09-19 13:30:29|ETL Service Started
Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Calendar.<init>(Calendar.java:951)
        at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619)
        at java.util.Calendar.createCalendar(Calendar.java:1030)
        at java.util.Calendar.getInstance(Calendar.java:968)
        at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163)
        at
~~~

How reproducible:
Always

Steps to Reproduce:
1. Attempt to start the ovirt-engine-dwh service.
2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive.
3. Stopping the ovirt-engine-dwh service returns the host to normal.

Actual results:
OOM errors logged and the host becomes almost unresponsive.

Expected results:
No OOM errors, host remains responsive and data is logged into the dwh DB. This should be done via hour by hour aggregation approach.

Additional info:
According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx).

This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query:

  SELECT
    history_id,
    history_datetime,
    current_user_name,
    vm_id, 
    minutes_in_status, 
    cpu_usage_percent, 
    memory_usage_percent, 
    user_cpu_usage_percent, 
    system_cpu_usage_percent, 
    vm_ip,
    currently_running_on_host, 
    vm_configuration_version, 
    current_host_configuration_version
  FROM
    vm_samples_history
  WHERE
    vm_status = 1 AND
    history_datetime >= (
      SELECT
        var_datetime
      FROM
        history_configuration
      WHERE
        var_name = 'lastHourAggr'
    )
  ORDER BY
    history_datetime,
    current_user_name,
    vm_id

In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing.

by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data.

I have no problem stopping the aggregation process when error occurs, but if a problem is not fixed within the boundaries of hoursToKeepSamples than you lose data anyway.

Comment 1 Sandro Bonazzola 2014-10-17 12:31:00 UTC
oVirt 3.5 has been released and should include the fix for this issue.