Bug 1014134

Summary:	PRD35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Lee Yarwood <lyarwood>
Component:	ovirt-engine-dwh	Assignee:	Shirly Radco <sradco>
Status:	CLOSED ERRATA	QA Contact:	movciari
Severity:	high	Docs Contact:
Priority:	high
Version:	3.2.0	CC:	aberezin, bazulay, iheim, jentrena, juan.hernandez, juwu, lnovich, lyarwood, nobody, pablo.iranzo, pep, pstehlik, rbalakri, Rhev-m-bugs, sradco, ybronhei, yeylon, ylavi
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	3.5.0
Hardware:	All
OS:	Linux
Whiteboard:	infra
Fixed In Version:	oVirt 3.5 Alpha 1	Doc Type:	Enhancement
Doc Text:	Previously, the data warehouse service became unresponsive and got an OutOfMemoryError on service start when the hourly aggregation tried to aggregate around 1.5 million records. With this update, the service now aggregates per hour of the day and then proceeds to the next hour. This way, data aggregation is now scalable.	Story Points:	---
Clone Of:
Clones:	1091686 (view as bug list)		Environment:
Last Closed:	2015-02-11 18:14:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1091686, 1142923, 1156165

Description Lee Yarwood 2013-10-01 13:02:40 UTC

Description of problem:

The ovirt-engine-dwh service appears to partially crash shortly after launch due to java.lang.OutOfMemoryError exceptions being thrown. The host (webadmin, userportal, api, ssh etc) is almost unresponsive until a service ovirt-engine-dwh stop is called. Increasing the heap space [1] has been suggested but the customer wants to ensure that this is not avoiding underlying issues that will still present themselves later.

The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not have SAR data to highlight the perf hit while running the service. 

The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts and ~800 vms. I use ~ as this is a rapidly growing env expected to double in size within 6 months.

Will increasing the heap space be enough here?

Should we advise the customer to provide additional resources to the RHEV-M to also improve the situation?

Are there additional things we can change to improve the situation (hosting the service on a separate system in the longer term etc)?

[1] https://access.redhat.com/site/solutions/448153

Example log showing the errors :

/var/log/ovirt-engine/ovirt-engine-dwhd.log

~~~
2013-09-19 13:30:29|ETL Service Started
Exception in thread "Thread-300" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7947)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$3.run(AggregationToHourly.java:23119)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Calendar.<init>(Calendar.java:951)
        at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619)
        at java.util.Calendar.createCalendar(Calendar.java:1030)
        at java.util.Calendar.getInstance(Calendar.java:968)
        at routines.RoutineHistoryETL.dateCompare(RoutineHistoryETL.java:163)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_3Process(AggregationToHourly.java:7511)
        ... 1 more
Exception in thread "Thread-1"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-1"
Exception in thread "Thread-302" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_5Process(AggregationToHourly.java:12791)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$5.run(AggregationToHourly.java:23203)
Caused by: java.lang.OutOfMemoryError: Java heap space
Exception in component tJDBCInput_10
org.postgresql.util.PSQLException: Bad value for type short : 306415363
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getShort(AbstractJdbc2ResultSet.java:1974)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_10Process(AggregationToHourly.java:21422)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$9.run(AggregationToHourly.java:23371)
2013-09-19 14:06:25|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_10|org.postgresql.util.PSQLException:Bad value for type short : 306415363|1
Exception in component tJDBCInput_13
org.postgresql.util.PSQLException: Bad value for type date : 65819507
        at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:246)
        at org.postgresql.jdbc2.TimestampUtils.toDate(TimestampUtils.java:402)
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getDate(AbstractJdbc2ResultSet.java:407)
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getDate(AbstractJdbc2ResultSet.java:2299)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_13Process(AggregationToHourly.java:16648)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$7.run(AggregationToHourly.java:23287)
Caused by: java.lang.NumberFormatException: Expected time to be colon-separated, got ''
        at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:155)
        ... 5 more
2013-09-19 14:07:22|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_13|org.postgresql.util.PSQLException:Bad value for type date : 65819507|1
Exception in thread "Thread-336" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_18Process(StatisticsSync.java:13916)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$7.run(StatisticsSync.java:16263)
Caused by: java.lang.OutOfMemoryError: Java heap space
Exception in thread "Thread-333" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_8Process(StatisticsSync.java:7442)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$4.run(StatisticsSync.java:16137)
Caused by: java.lang.OutOfMemoryError: Java heap space
Exception in component tJDBCOutput_6
Exception in component tJDBCOutput_5
org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:283)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:321)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_10Process(StatisticsSync.java:9760)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$5.run(StatisticsSync.java:16179)
Caused by: java.io.IOException: Stream closed
        at sun.nio.cs.StreamEncoder.ensureOpen(StreamEncoder.java:45)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:140)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
        at org.postgresql.core.PGStream.flush(PGStream.java:507)
        at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:1107)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:256)
        ... 5 more
org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:283)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:479)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:367)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:321)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync.tJDBCInput_12Process(StatisticsSync.java:11806)
        at ovirt_engine_dwh.statisticssync_3_2.StatisticsSync$6.run(StatisticsSync.java:16221)
Caused by: java.io.IOException: Unexpected packet type: 55
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1941)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        ... 5 more
Exception in thread "Thread-334" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Thread-335"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-335"
Exception in thread "Thread-303" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_11Process(AggregationToHourly.java:14598)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$6.run(AggregationToHourly.java:23245)
Caused by: java.lang.OutOfMemoryError: Java heap space
Exception in component tJDBCOutput_9
org.postgresql.util.PSQLException: A result was returned when none was expected.
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:322)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_9Process(AggregationToHourly.java:19195)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$8.run(AggregationToHourly.java:23329)
2013-09-19 14:12:53|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCOutput_9|org.postgresql.util.PSQLException:A result was returned when none was expected.|1
Exception in component tJDBCInput_1
org.postgresql.util.PSQLException: Bad value for type int : 2013-09-19 06:00:00+01
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.toInt(AbstractJdbc2ResultSet.java:2759)
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getInt(AbstractJdbc2ResultSet.java:2003)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_1Process(AggregationToHourly.java:2070)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$1.run(AggregationToHourly.java:23035)
2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_1|org.postgresql.util.PSQLException:Bad value for type int : 2013-09-19 06:00:00+01|1
Exception in component tJDBCInput_4
org.postgresql.util.PSQLException: No results were returned by the query.
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:253)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_4Process(AggregationToHourly.java:10040)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$4.run(AggregationToHourly.java:23161)
2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_4|org.postgresql.util.PSQLException:No results were returned by the query.|1
Exception in component tJDBCInput_2
org.postgresql.util.PSQLException: Bad value for type short : 306444130
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getShort(AbstractJdbc2ResultSet.java:1974)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly.tJDBCInput_2Process(AggregationToHourly.java:5000)
        at ovirt_engine_dwh.aggregationtohourly_3_2.AggregationToHourly$2.run(AggregationToHourly.java:23077)
2013-09-19 14:15:57|SZi7hN|K2Zwx2|9PY572|OVIRT_ENGINE_DWH|AggregationToHourly|Default|6|Java Exception|tJDBCInput_2|org.postgresql.util.PSQLException:Bad value for type short : 306444130|1
~~~

Version-Release number of selected component (if applicable):
rhevm-dwh-3.2.1-2.el6ev.noarch                              Mon 22 Jul 2013 01:41:51 PM BST
rhevm-reports-3.2.1-6.el6ev.noarch                          Mon 22 Jul 2013 01:41:52 PM BST

How reproducible:
Always

Steps to Reproduce:
1. Attempt to start the ovirt-engine-dwh service.
2. Service starts, OOM JVM errors logged, host becomes sluggish and almost unresponsive.
3. Stopping the ovirt-engine-dwh service returns the host to normal.

Actual results:
OOM errors logged and the host becomes almost unresponsive.

Expected results:
No OOM errors, host remains responsive and data is logged into the dwh DB.

Additional info:

Comment 2 Yaniv Lavi 2013-10-01 13:34:20 UTC

Please elaborate on the rhevm environment. how many hosts\vms\clusters and so on?
I see two issues in the logs. One that was resolved in 3.2 that causes failure of collection due to vm with more than 16 IPs and the other is the heap size and the solution is to increase the default. Memory leak is unlikely.



Yaniv

Comment 3 Lee Yarwood 2013-10-01 13:42:05 UTC

(In reply to Yaniv Dary from comment #2)
> Please elaborate on the rhevm environment. how many hosts\vms\clusters and
> so on?

Already provided in the description, let me know if you need anything else.

Quoting the actual text here :

(In reply to Lee Yarwood from comment #0)
> Description of problem:
>
> The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core
> each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not
> have SAR data to highlight the perf hit while running the service. 
> 
> The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts
> and ~800 vms. I use ~ as this is a rapidly growing env expected to double in
> size within 6 months.

Comment 4 Yaniv Lavi 2013-10-01 13:57:52 UTC

(In reply to Lee Yarwood from comment #3)
> (In reply to Yaniv Dary from comment #2)
> > Please elaborate on the rhevm environment. how many hosts\vms\clusters and
> > so on?
> 
> Already provided in the description, let me know if you need anything else.
> 
> Quoting the actual text here :
> 
> (In reply to Lee Yarwood from comment #0)
> > Description of problem:
> >
> > The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core
> > each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not
> > have SAR data to highlight the perf hit while running the service. 
> > 
> > The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts
> > and ~800 vms. I use ~ as this is a rapidly growing env expected to double in
> > size within 6 months.

Ok, I think the best course of action is to increase heap size and see if that resolves the issue. The large data amounts are probably the cause.
Juan, do you agree?



Yaniv

Comment 5 Lee Yarwood 2013-10-01 14:10:58 UTC

(In reply to Yaniv Dary from comment #2)
> I see two issues in the logs. One that was resolved in 3.2 that causes
> failure of collection due to vm with more than 16 IPs[..]

Btw, could you confirm the BZ? The customer is running the latest 3.2.z version after all.

Comment 6 Yaniv Lavi 2013-10-01 14:20:32 UTC

(In reply to Lee Yarwood from comment #5)
> (In reply to Yaniv Dary from comment #2)
> > I see two issues in the logs. One that was resolved in 3.2 that causes
> > failure of collection due to vm with more than 16 IPs[..]
> 
> Btw, could you confirm the BZ? The customer is running the latest 3.2.z
> version after all.

Then it must be a old log message and it should work with a larger heap.

Comment 7 Yaniv Lavi 2013-10-01 15:13:25 UTC

(In reply to Yaniv Dary from comment #6)
> (In reply to Lee Yarwood from comment #5)
> > (In reply to Yaniv Dary from comment #2)
> > > I see two issues in the logs. One that was resolved in 3.2 that causes
> > > failure of collection due to vm with more than 16 IPs[..]
> > 
> > Btw, could you confirm the BZ? The customer is running the latest 3.2.z
> > version after all.
> 
> Then it must be a old log message and it should work with a larger heap.

Also because this is a vm, you need to check the storage\host stats because the cause to slowness can be the large amount of data being written to disk, while other vms are also busy.

Comment 8 Juan Hernández 2013-10-01 15:23:59 UTC

It is my understanding that the very nature of the DWH application is to load huge amounts of data from the database in memory and them crunch them, so I agree with Yaniv that the only way forward is to increase the size of the heap.

I don't think additional resources are needed: 16 GiB and 8 CPUs should be enough, if they aren't, in my opinion, there is something that we need to fix. In this regards having a heap dump of the DWH process and a copy of the database could help. Not sure if we want to invest time in this.

I would also suggest to restrict the number of CPUs available to the DWH service, so if it goes out of memory again the garbage collector will not take over all the CPUs of the machine. This can be achieved using cgroups, for example:

  # yum -y install libcgroup
  # chkconfig cgconfig on
  # cat >> /etc/cgconfig.conf <<.
  group dwh {
      cpuset {
          cpuset.cpus = "0,1";  # Use only CPUs 0 and 1
          cpuset.mems = "0";    # Assuming that there is only one memory node
      }
  }
  .
  # service cgconfig restart

Then edit the history_service.sh script and replace the "exec" that runs the Java virtual machine with "cgexec -g cpuset:dwh":

  cgexec -g cpuset:dwh ${JAVA_HOME}/bin/java ...

Note that this is just a hack to avoid the DWH service overloading the machine, not a solution to the problem. The solution is to increase the heap size.

Comment 9 Lee Yarwood 2013-10-01 15:40:06 UTC

Brilliant, thanks for the feedback. I'll pass these suggestions on to the customer now.

Comment 10 Yaniv Lavi 2013-10-03 13:51:53 UTC

Itamar, you added current version flags. what is the fix you want here?



Yaniv

Comment 11 Itamar Heim 2013-10-03 14:23:40 UTC

iiuc, juan suggestions are important and should get their own bug.
but do we know why we OOM'd? we need to solve that.
i flagged for z-stream to consider it. question if this is a one-off, or reproducible.

Comment 12 Yaniv Lavi 2013-10-07 10:50:00 UTC

(In reply to Itamar Heim from comment #11)
> iiuc, juan suggestions are important and should get their own bug.

Please specify what you mean. 

> but do we know why we OOM'd? we need to solve that.

We don't give unlimited heap size. The standard is good enough for most cases.
In very large environments increase of defaults is needed. I don't think we should change the default currently.

> i flagged for z-stream to consider it. question if this is a one-off, or
> reproducible.

Comment 13 Itamar Heim 2013-10-07 11:12:46 UTC

(In reply to Yaniv Dary from comment #12)
> (In reply to Itamar Heim from comment #11)
> > iiuc, juan suggestions are important and should get their own bug.
> 
> Please specify what you mean. 
> 

the suggestion in comment 8 to cap cpu maybe.

> > but do we know why we OOM'd? we need to solve that.
> 
> We don't give unlimited heap size. The standard is good enough for most
> cases.
> In very large environments increase of defaults is needed. I don't think we
> should change the default currently.

so how do we track/monitor/warn for users so it won't explode for them?

Comment 15 Barak 2013-10-07 14:11:08 UTC

Lee- is this external DB?

Comment 16 Lee Yarwood 2013-10-07 14:15:10 UTC

(In reply to Barak from comment #15)
> Lee- is this external DB?

Internal.

Comment 20 Yaniv Lavi 2013-10-20 09:58:44 UTC

Please update.


Yaniv

Comment 37 Liran Zelkha 2014-01-07 12:19:56 UTC

I have created a test environment with 1000 VMs and 200 hosts running on FakeVDSM.
DWHD process started and run without any issues.
Engine is responsive and relatively fast.
Note that I changed the engine process to use 12GB of heap size.
Did the customer change the heap size configuration of the engine process?

Comment 38 Lee Yarwood 2014-01-07 17:27:05 UTC

(In reply to Liran Zelkha from comment #37)
> I have created a test environment with 1000 VMs and 200 hosts running on
> FakeVDSM.

Hey Liran,

Which version are you testing here? 3.3?

> DWHD process started and run without any issues.
> Engine is responsive and relatively fast.
> Note that I changed the engine process to use 12GB of heap size.
> Did the customer change the heap size configuration of the engine process?

Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by how much. I have no idea if they changed the engine HEAP at all [2].

[1] /usr/share/ovirt-engine-dwh/etl/history_service.sh
[2] /etc/ovirt-engine/engine.conf

Comment 39 Julio Entrena Perez 2014-01-07 18:03:42 UTC

(In reply to Lee Yarwood from comment #38)

> Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by
> how much. I have no idea if they changed the engine HEAP at all [2].
> 
> [1] /usr/share/ovirt-engine-dwh/etl/history_service.sh
> [2] /etc/ovirt-engine/engine.conf

We'll come back to you on [1] so keeping the needinfo.

No changes in [2] /etc/ovirt-engine/engine.conf so default values for engine from /usr/share/ovirt-engine/conf/engine.conf.defaults apply:
ENGINE_HEAP_MIN=1g
ENGINE_HEAP_MAX=1g
ENGINE_PERM_MIN=256m
ENGINE_PERM_MAX=256m

Comment 40 Liran Zelkha 2014-01-08 09:26:34 UTC

I'm using RHEVM 3.2.
This will never do. Engine heap is too low for so many VMs/hosts. How much memory does the machine has? Increase memory and the engine will run much faster.

Comment 42 Liran Zelkha 2014-01-09 10:03:56 UTC

Concerning Engine - my recommendation is to increase to 8GB. 
For DWH - if it works at 2GB, leave it like that. Engine would work faster and they will probably not feel the DWH.

Comment 58 Juan Hernández 2014-02-12 21:20:04 UTC

According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx).

This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query:

  SELECT
    history_id,
    history_datetime,
    current_user_name,
    vm_id, 
    minutes_in_status, 
    cpu_usage_percent, 
    memory_usage_percent, 
    user_cpu_usage_percent, 
    system_cpu_usage_percent, 
    vm_ip,
    currently_running_on_host, 
    vm_configuration_version, 
    current_host_configuration_version
  FROM
    vm_samples_history
  WHERE
    vm_status = 1 AND
    history_datetime >= (
      SELECT
        var_datetime
      FROM
        history_configuration
      WHERE
        var_name = 'lastHourAggr'
    )
  ORDER BY
    history_datetime,
    current_user_name,
    vm_id

In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing.

So apparently that query is returning at least 1465765 rows, which is hard to believe. Can run this query manually in the customer environment environment and see what is the number of results?

If the query does return this huge number of rows, then we will need to modify the DWH so that it runs queries setting the fetch size to something other than zero. Zero is the default, and it means that the JDBC driver loads all rows in memory before returning the results. For details see here:

  http://jdbc.postgresql.org/documentation/92/query.html#fetchsize-example

The code that runs these queries is generated by Talend Studio, so it won't be easy to change. If we need to do this then we will probably have to request it to Talend, or use something like AspectJ to modify the binary code.

If the query doesn't return a large number of rows, then we are probably facing a bug in the JDBC driver, or a bug in the database itself. If the issue can be reproduced then I would suggest to generate a dump of the network traffic between the DWH and the database for further analysis.

Comment 59 Josep 'Pep' Turro Mauri 2014-02-13 10:06:48 UTC

(In reply to Juan Hernández from comment #58)
> So apparently that query is returning at least 1465765 rows, which is hard
> to believe. Can run this query manually in the customer environment
> environment and see what is the number of results?

I've ran it on a copy of their db (file  00911410_ovirt_engine_history_sql.xz in the download location - not the very latest and live environment) and it does suggest that the order of magnitude is correct:

=> select count(*)  FROM vm_samples_history WHERE vm_status = 1 AND
   history_datetime >= (SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr');
  count  
---------
 1297627
(1 row)

=>  SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr';
      var_datetime      
------------------------
 2013-08-29 17:00:00+02
(1 row)

Yaniv, how do we proceed from here?

Comment 60 Yaniv Lavi 2014-02-13 10:27:08 UTC

(In reply to Josep 'Pep' Turro Mauri from comment #59)
> (In reply to Juan Hernández from comment #58)
> > So apparently that query is returning at least 1465765 rows, which is hard
> > to believe. Can run this query manually in the customer environment
> > environment and see what is the number of results?
> 
> I've ran it on a copy of their db (file 
> 00911410_ovirt_engine_history_sql.xz in the download location - not the very
> latest and live environment) and it does suggest that the order of magnitude
> is correct:
> 
> => select count(*)  FROM vm_samples_history WHERE vm_status = 1 AND
>    history_datetime >= (SELECT var_datetime FROM history_configuration WHERE
> var_name = 'lastHourAggr');
>   count  
> ---------
>  1297627
> (1 row)
> 
> =>  SELECT var_datetime FROM history_configuration WHERE var_name =
> 'lastHourAggr';
>       var_datetime      
> ------------------------
>  2013-08-29 17:00:00+02
> (1 row)
> 
> Yaniv, how do we proceed from here?

by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data. First thing will be to change last aggregate date to now, so this data will not be aggregated, then check the log closely and figure out why samples data is not being aggregated (it should happen every hour). The growing select here caused by data not being aggregated is causing this issue.



Yaniv

Comment 63 Barak 2014-02-23 11:07:56 UTC

I think the best approach will be to do the hourly aggregation per hourly select instead to select *,  this will increase the scale the the DWH can handle and if failure happens it will not increase the select size and we'll handle later hours.

The downside is that once an hour is skipped, the data will never be aggregated for that hour.

I would also advise to do the same for the daily.

This kind of change should first be acceptable by PMs & customers (meaning aggregation failure may lead to a missing hour/day).

In addition, such a change as described can not be delivered in zstream.

Yaniv for now and to make sure customer issue had been addressed, please add comment how to update the last aggregated hour variable to NOW().

Comment 79 errata-xmlrpc 2015-02-11 18:14:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0177.html