Description of problem: the metrics processor worker times out due to a way toolarge amount of data to process Version-Release number of selected component (if applicable): 5.5.2 How reproducible: all the time in the custoemr's watford environment Steps to Reproduce: 1. this triggers automatically with capacity and utilization enabled in an environment with 4 appliances (one db, one ui, two workers with the rest of the tasks) 2. 3. Actual results: huge times noted on the sql requests and worker not being killed despite massively breaching the set timeout Expected results: better handling of the situation so that the brokers do not die, making all operations fail. Additional info: this is tied to 30k+ performance rows being treated at the same time
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1244905
The culpret query is taking 1,473.229.6ms - sql queries do not cancel well, so it is staying along for a while. We've identified the query and looking into ways of not querying so much data.
Created attachment 1144536 [details] Distillation from customer March 25 logs identifying HostEsx Instance Ids where perf_rollup consistently has failed (timeout or process killed) this distillation of the perf_rollup messages for 11 VMware Hosts is intended to focus attention on these 11 Host instances for which perf_rollup processing seems to repeatedly fail, suggesting that there is something common about these which might help identify what is causing the high memory usage and the long running times.
Created attachment 1144537 [details] standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19
Created attachment 1144538 [details] standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19
Created attachment 1144552 [details] standard top pdf reports showing Watford zone problems beginning about 9:23 PM on March 19 Initial problems begin at 9:23 on march 19, 2016 when Amazon C&U Collections are re-activated. However, later in the week (after March 24) when Amazon C&U has been de-activated, memory problems continue to persist even after several appliance reboots, necessitating the request for the customer database which has been installed in the appliance at 10.10.182.179, and from which the process 19036 logs were harvested detailing the problem with the rails debug trace active.
looks like 'watford' and Amazon are public, changing privacy again.
https://github.com/ManageIQ/manageiq/pull/7783
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/ee1c533314368d8b90d46071da4960632cf753b1 commit ee1c533314368d8b90d46071da4960632cf753b1 Author: Keenan Brock <kbrock> AuthorDate: Thu Apr 7 11:14:37 2016 -0400 Commit: Keenan Brock <kbrock> CommitDate: Thu Apr 7 12:56:11 2016 -0400 metric_rollups: only load recent performance state https://bugzilla.redhat.com/show_bug.cgi?id=1322485 app/models/metric/rollup.rb | 2 +- lib/miq_preloader.rb | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
merged in master: https://github.com/ManageIQ/manageiq/pull/7783 currently testing in 5.5: https://gitlab.cloudforms.lab.eng.rdu2.redhat.com/cloudforms/cfme/merge_requests/890 You can download a patch: https://gitlab.cloudforms.lab.eng.rdu2.redhat.com/cloudforms/cfme/merge_requests/890.patch I tend to just remove the header part and then apply with patch. I tend to stumble a little whether it is -p0, -p1, or -p2 patch -p1 < file.patch Can you apply the patch and verify that this resolves your problem?
Created attachment 1147355 [details] Patch to reduce metric_rollup query Hello Colin Arnott, I'm sorry, I thought you would be able to use the instructions I included last week. I have attached the patch. Please transfer this file to your system and type: # change to the vmdb directory vmdb # apply the supplied patch. (It can live in any directory) patch -p1 890.patch Please let me know if this resolves your issue
sorry, the instructions should read: vmdb patch -p1 < 890.patch thank you
No worries, and sorry for the confusion. That will be sufficient for my needs. I will let you know what my traction with this patch is.
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=aae6f3135eb87701ecb8bc2c533f37c53c32c675 commit aae6f3135eb87701ecb8bc2c533f37c53c32c675 Author: Keenan Brock <kbrock> AuthorDate: Thu Apr 7 11:14:37 2016 -0400 Commit: Keenan Brock <kbrock> CommitDate: Fri Apr 8 13:39:03 2016 -0400 metric_rollups: only load recent performance state https://bugzilla.redhat.com/show_bug.cgi?id=1322485 app/models/metric/rollup.rb | 2 +- lib/miq_preloader.rb | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=32af3c0d406cb08c02e120f644a09efc368a224a commit 32af3c0d406cb08c02e120f644a09efc368a224a Merge: ffaa642 aae6f31 Author: Oleg Barenboim <obarenbo> AuthorDate: Thu Apr 21 12:15:48 2016 -0400 Commit: Oleg Barenboim <obarenbo> CommitDate: Thu Apr 21 12:15:48 2016 -0400 Merge branch 'perf_cap_u_55' into '5.5.z' metric_rollups: only load recent performance state Clean merge of [#7783](https://github.com/ManageIQ/manageiq/pull/7783) This fixes problem where Cap&U perf rollup hit the db too hard https://bugzilla.redhat.com/show_bug.cgi?id=1322485 See merge request !890 app/models/metric/rollup.rb | 2 +- lib/miq_preloader.rb | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
We no longer try and process all records. So that is fixed. There is a similar bug where we are downloading too much data from the actual provider. that is not addressed in this BZ.
The error being thrown is unrelated. Here is a fix: It looks like some of the Vms did not properly clear the cluster Please run the following from a rails console on a machine: # clear ems cluster on orphaned vms VmOrTemplate.where(:ems_id => nil).where.not(:ems_cluster_id => nil).update_all(:ems_cluster_id => nil) # remove cap and u for orphaned vms MiqQueue.where(:class_name => "ManageIQ::Providers::Vmware::InfraManager::Vm", :instance_id => VmOrTemplate.where(:ems_id => nil).select(:id), method_name: "perf_rollup").destroy_all.count
On my appliance,I changed the log levels to debug, but I wasn't able to see this query at all in the logs. the SELECT "vim_performance_states".* FROM "vim_performance_states" ... Reproducer: 1)Manage a provider and enable C&U collection for the provider 2)Capture C&U data for a few hours/days. 3)Disable C&U collection for at least 1 day. 4)Re-enable C&U collection Before fix: When C&U collection is re-enabled, CFME fetches all historical performance dats. After fix: When C&U collection is re-enabled, CFME fetches performance data for the current hour only. Verified that CFME fetches performance data for the current hour only by looking at the DB itself.Marking this as VERIFIED. Verified in 5.6.0.6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348