1322485 – C&U Metrics Processor memory and timeout issues associated with 'perf_rollup' method and vmware host and vm instances

Bug 1322485 - C&U Metrics Processor memory and timeout issues associated with 'perf_rollup' method and vmware host and vm instances

Summary: C&U Metrics Processor memory and timeout issues associated with 'perf_rollup'...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.5.0
Hardware:	All
OS:	All
Priority:	high
Severity:	urgent
Target Milestone:	GA
Target Release:	5.6.0
Assignee:	Keenan Brock
QA Contact:	Nandini Chandra
Docs Contact:
URL:
Whiteboard:	c&u
Depends On:
Blocks:	1325405
TreeView+	depends on / blocked

Reported:	2016-03-30 14:45 UTC by Felix Dewaleyne
Modified:	2019-11-14 07:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:	5.6.0.1
Doc Type:	Bug Fix
Doc Text:	In the previous version of CloudForms Management Engine, metrics processor worker timed out due to large amount of data to process. This patch updates metric_rollups to only load recent performance state which has now resolved the issue.
Clone Of:
Clones:	1325405 (view as bug list)
Environment:
Last Closed:	2016-06-29 15:46:10 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to reduce metric_rollup query (1.33 KB, patch) 2016-04-14 18:03 UTC, Keenan Brock	kbrock: review+	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1348	0	normal	SHIPPED_LIVE	CFME 5.6.0 bug fixes and enhancement update	2016-06-29 18:50:04 UTC

Description Felix Dewaleyne 2016-03-30 14:45:12 UTC

Description of problem:
the metrics processor worker times out due to a way toolarge amount of data to process

Version-Release number of selected component (if applicable):
5.5.2

How reproducible:
all the time in the custoemr's watford environment

Steps to Reproduce:
1. this triggers automatically with capacity and utilization enabled in an environment with 4 appliances (one db, one ui, two workers with the rest of the tasks)
2.
3.

Actual results:
huge times noted on the sql requests and worker not being killed despite massively breaching the set timeout

Expected results:
better handling of the situation so that the brokers do not die, making all operations fail.

Additional info:
this is tied to 30k+ performance rows being treated at the same time

Comment 3 dmetzger 2016-04-04 19:45:53 UTC

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1244905

Comment 4 Keenan Brock 2016-04-06 17:25:25 UTC

The culpret query is taking 1,473.229.6ms - sql queries do not cancel well, so it is staying along for a while.

We've identified the query and looking into ways of not querying so much data.

Comment 5 Thomas Hennessy 2016-04-07 03:44:32 UTC

Created attachment 1144536 [details]
Distillation from customer March 25 logs identifying HostEsx Instance Ids where perf_rollup consistently has failed (timeout or process killed)

this distillation of the perf_rollup messages for 11 VMware Hosts is intended to focus attention on these 11 Host instances for which perf_rollup processing seems to repeatedly fail, suggesting that there is something common about these which might help identify what is causing the high memory usage and the long running times.

Comment 6 Thomas Hennessy 2016-04-07 03:50:34 UTC

Created attachment 1144537 [details]
standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19

Comment 7 Thomas Hennessy 2016-04-07 03:51:19 UTC

Created attachment 1144538 [details]
standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19

Comment 8 Thomas Hennessy 2016-04-07 03:54:26 UTC

Created attachment 1144552 [details]
standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19

Initial problems begin at 9:23 on march 19, 2016 when Amazon C&U Collections are re-activated.  However, later in the week (after March 24) when Amazon C&U has been de-activated, memory problems continue to persist even after several appliance reboots, necessitating the request for the customer database which has been installed in the appliance at 10.10.182.179, and from which the process 19036 logs were harvested detailing the problem with the rails debug trace active.

Comment 9 Keenan Brock 2016-04-07 15:29:46 UTC

looks like 'watford' and Amazon are public, changing privacy again.

Comment 10 CFME Bot 2016-04-07 15:55:50 UTC

https://github.com/ManageIQ/manageiq/pull/7783

Comment 11 CFME Bot 2016-04-08 18:25:56 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/ee1c533314368d8b90d46071da4960632cf753b1

commit ee1c533314368d8b90d46071da4960632cf753b1
Author:     Keenan Brock <kbrock>
AuthorDate: Thu Apr 7 11:14:37 2016 -0400
Commit:     Keenan Brock <kbrock>
CommitDate: Thu Apr 7 12:56:11 2016 -0400

    metric_rollups: only load recent performance state
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1322485

 app/models/metric/rollup.rb | 2 +-
 lib/miq_preloader.rb        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comment 12 Keenan Brock 2016-04-08 18:55:18 UTC

merged in master: https://github.com/ManageIQ/manageiq/pull/7783

currently testing in 5.5: https://gitlab.cloudforms.lab.eng.rdu2.redhat.com/cloudforms/cfme/merge_requests/890

You can download a patch: https://gitlab.cloudforms.lab.eng.rdu2.redhat.com/cloudforms/cfme/merge_requests/890.patch

I tend to just remove the header part and then apply with patch. I tend to stumble a little whether it is -p0, -p1, or -p2

     patch -p1 < file.patch


Can you apply the patch and verify that this resolves your problem?

Comment 14 Keenan Brock 2016-04-14 18:03:56 UTC

Created attachment 1147355 [details]
Patch to reduce metric_rollup query

Hello Colin Arnott,

I'm sorry, I thought you would be able to use the instructions I included last week.

I have attached the patch. Please transfer this file to your system and type:

    # change to the vmdb directory
    vmdb
    # apply the supplied patch. (It can live in any directory)
    patch -p1 890.patch

Please let me know if this resolves your issue

Comment 15 Keenan Brock 2016-04-14 18:27:50 UTC

sorry, the instructions should read:

    vmdb
    patch -p1 < 890.patch

thank you

Comment 16 Colin Arnott 2016-04-14 20:12:55 UTC

No worries, and sorry for the confusion. That will be sufficient for my needs.

I will let you know what my traction with this patch is.

Comment 26 CFME Bot 2016-04-21 16:16:25 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=aae6f3135eb87701ecb8bc2c533f37c53c32c675

commit aae6f3135eb87701ecb8bc2c533f37c53c32c675
Author:     Keenan Brock <kbrock>
AuthorDate: Thu Apr 7 11:14:37 2016 -0400
Commit:     Keenan Brock <kbrock>
CommitDate: Fri Apr 8 13:39:03 2016 -0400

    metric_rollups: only load recent performance state
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1322485

 app/models/metric/rollup.rb | 2 +-
 lib/miq_preloader.rb        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comment 27 CFME Bot 2016-04-21 16:16:30 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=32af3c0d406cb08c02e120f644a09efc368a224a

commit 32af3c0d406cb08c02e120f644a09efc368a224a
Merge: ffaa642 aae6f31
Author:     Oleg Barenboim <obarenbo>
AuthorDate: Thu Apr 21 12:15:48 2016 -0400
Commit:     Oleg Barenboim <obarenbo>
CommitDate: Thu Apr 21 12:15:48 2016 -0400

    Merge branch 'perf_cap_u_55' into '5.5.z'
    
    metric_rollups: only load recent performance state
    
    Clean merge of [#7783](https://github.com/ManageIQ/manageiq/pull/7783)
    
    This fixes problem where Cap&U perf rollup hit the db too hard
    
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1322485
    
    See merge request !890

 app/models/metric/rollup.rb | 2 +-
 lib/miq_preloader.rb        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comment 30 Keenan Brock 2016-04-27 15:15:36 UTC

We no longer try and process all records. So that is fixed.
There is a similar bug where we are downloading too much data from the actual provider. that is not addressed in this BZ.

Comment 31 Keenan Brock 2016-04-29 20:44:31 UTC

The error being thrown is unrelated. Here is a fix:

It looks like some of the Vms did not properly clear the cluster

Please run the following from a rails console on a machine:

# clear ems cluster on orphaned vms
VmOrTemplate.where(:ems_id => nil).where.not(:ems_cluster_id => nil).update_all(:ems_cluster_id => nil)
# remove cap and u for orphaned vms
MiqQueue.where(:class_name => "ManageIQ::Providers::Vmware::InfraManager::Vm", :instance_id => VmOrTemplate.where(:ems_id => nil).select(:id), method_name: "perf_rollup").destroy_all.count

Comment 32 Nandini Chandra 2016-05-18 03:43:07 UTC

On my appliance,I changed the log levels to debug, but I wasn't able to see this query at all in the logs.

the SELECT "vim_performance_states".* FROM "vim_performance_states" ...

Reproducer:
1)Manage a provider and enable C&U collection for the provider
2)Capture C&U data for a few hours/days.
3)Disable C&U collection for at least 1 day.
4)Re-enable C&U collection

Before fix:
When C&U collection is re-enabled, CFME fetches all historical performance dats.

After fix:
When C&U collection is re-enabled, CFME fetches performance data for the current hour only.

Verified that CFME fetches performance data for the current hour only by 
looking at the DB itself.Marking this as VERIFIED.

Verified in 5.6.0.6

Comment 34 errata-xmlrpc 2016-06-29 15:46:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348

Note You need to log in before you can comment on or make changes to this bug.