Bug 1331803 - perf_capture_time method resulting in the failure to schedule C&U capture
Summary: perf_capture_time method resulting in the failure to schedule C&U capture
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.6.0
Assignee: Keenan Brock
QA Contact: Nandini Chandra
URL:
Whiteboard: c&u
Depends On:
Blocks: 1333096
TreeView+ depends on / blocked
 
Reported: 2016-04-29 14:44 UTC by Colin Arnott
Modified: 2019-10-10 12:01 UTC (History)
13 users (show)

Fixed In Version: 5.6.0.6
Doc Type: Bug Fix
Doc Text:
Previously, the perf_capture_time method responsible for generating and capturing capacity and utilization data in the message queue failed frequently. As a result, capacity and utilization capture was not scheduled for most elements in the environment. This occurred when metrics for virtual machines orphaned from an EMS were handled incorrectly. This patch fixes the issue so that when an EMS is orphaned, the contents of ems_cluster are cleared, and metrics collection is only scheduled for virtual machines with an EMS.
Clone Of:
: 1333096 (view as bug list)
Environment:
Last Closed: 2016-06-29 15:56:14 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix cloud provider orphans breaking cap and u collection (2.05 KB, patch)
2016-05-04 15:20 UTC, Keenan Brock
no flags Details | Diff
Patch to fix cloud provider orphans (v2) (2.04 KB, patch)
2016-05-12 14:33 UTC, Keenan Brock
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 0 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 18:50:04 UTC

Description Colin Arnott 2016-04-29 14:44:05 UTC
Description of problem:
'The "perf_capture_time" method that is responsible for generating the C&U capture messages into the message queue is abending about 20 times per hour resulting in the failure to schedule C&U capture for most elements in the environment and is probably the reason that no C&U data capture is [occuring].' --Tom Hennessy


Version-Release number of selected component (if applicable):
5.5.2

How reproducible:
"all the time in [a specific] environment" --Felix Dewaleyne

Steps to Reproduce:
1. "this triggers automatically with capacity and utilization enabled in an environment with 4 appliances (one db, one ui, two workers with the rest of the tasks)" --Felix Dewaleyne

Actual results:
(see problem description)

Expected results:
successful collection of C&U data

Additional info:
"this is tied to 30k+ performance rows being treated at the same time"  --Felix Dewaleyne

this bz was forked off of 1322485

Comment 13 Keenan Brock 2016-04-29 20:49:40 UTC
It looks like some of the Vms did not properly clear the cluster

A quickfix:

Please run the following from a rails console on a machine:

# clear ems cluster on orphaned vms
VmOrTemplate.where(:ems_id => nil).where.not(:ems_cluster_id => nil).update_all(:ems_cluster_id => nil)
# remove cap and u for orphaned vms
MiqQueue.where(:class_name => "ManageIQ::Providers::Vmware::InfraManager::Vm", :instance_id => VmOrTemplate.where(:ems_id => nil).select(:id), method_name: "perf_rollup").destroy_all.count

I will make a code change so this will no happen in the future.

Comment 14 Keenan Brock 2016-04-29 21:45:04 UTC
Here is the code I used to diagnose the problem:

I ran this in the rails console

AvailabilityZone.where(:ext_management_system => nil).count
EmsCluster.where(:ext_management_system => nil).count
Host.where(:ext_management_system => nil).count
VmOrTemplate.where(:ext_management_system => nil).count
VmOrTemplate.where(:ext_management_system => nil).group(:ems_cluster_id).count

A count of 0 for everything is ultimate.
But a Vm without an ems is not that bad. it is a retired or orphaned vm.
The issue to be aware is if it is still linked to a cluster.

The sample database that I had showed that there were records in the db that had a cluster and no ems.

Comment 15 Keenan Brock 2016-04-29 21:45:22 UTC
https://github.com/ManageIQ/manageiq/pull/8363

Comment 17 CFME Bot 2016-05-03 12:51:09 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/15bbd18c6df1234bdeb59ac5602273a6534cb440

commit 15bbd18c6df1234bdeb59ac5602273a6534cb440
Author:     Keenan Brock <kbrock>
AuthorDate: Fri Apr 29 17:06:51 2016 -0400
Commit:     Keenan Brock <kbrock>
CommitDate: Sat Apr 30 12:42:58 2016 -0400

    Clear cluster when ems is cleared
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1331803
    
    When an ems is orphaned, clear out the ems_cluser as well

 app/models/ems_cluster.rb          |  2 +-
 app/models/host.rb                 |  1 +
 app/models/vm_or_template.rb       |  1 +
 spec/models/host_spec.rb           | 20 ++++++++++++++++++++
 spec/models/vm_or_template_spec.rb | 21 +++++++++++++++++++++
 5 files changed, 44 insertions(+), 1 deletion(-)

Comment 19 Keenan Brock 2016-05-04 15:20:16 UTC
Created attachment 1153910 [details]
Patch to fix cloud provider orphans breaking cap and u collection

This can be applied with

    patch -p0 < cap_u_ems_c.patch

Comment 20 Keenan Brock 2016-05-04 15:51:40 UTC
To see if the patch will work, use the following script.
It should show bad before the patch and 'good' after ti

    vmdb
    bundle exec rails c

    Zone.all.map { |zone| Metric::Targets.capture_cloud_targets(zone).detect { |vm| vm.ems_id.nil? } ? "#{zone.name}: BAD" : "#{zone.name}: GOOD" }

Comment 22 CFME Bot 2016-05-06 03:05:50 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/db0a24629e9442b2ba754bb374c6141b9aba7430

commit db0a24629e9442b2ba754bb374c6141b9aba7430
Author:     Keenan Brock <kbrock>
AuthorDate: Wed May 4 22:43:34 2016 -0400
Commit:     Keenan Brock <kbrock>
CommitDate: Wed May 4 22:53:30 2016 -0400

    Metric::Target capture_cloud_targets rewrite
    
    - only bring back vms that are on (when availability_zone.nil?)
    - only bring back vms that have an ems
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1332579
    https://bugzilla.redhat.com/show_bug.cgi?id=1331803

 app/models/metric/targets.rb | 1 +
 1 file changed, 1 insertion(+)

Comment 29 Keenan Brock 2016-05-12 14:33:35 UTC
Created attachment 1156704 [details]
Patch to fix cloud provider orphans (v2)

Sorry about that.
The previous patch assumed some previous changes to the models.

Here is one that I have tested on 5.5.2.4

    vmdb
    patch -p1 < cap_u_ems_c.v2.patch

Comment 34 Keenan Brock 2016-05-25 18:46:27 UTC
Hi Nandini,

The issue is when a vm is in a cluster or an availability zone.
Then when the vm is deleted on ec2, we call that orphaning. Locally we remove the association between the vm and the ems.

The bug is that while the vm is not linked to the ems, it is still linked to the cluster or availability zone.

We approached this with 2 similar fixes:
1. Ensure that the vms won't be included in the submit 'cap&u' job (this BZ)
2. Cleared the cluster and availability zone when orphaning a vm.

I suppose with #2, #1 is not necessary. But #1 also got us some performance improvements.


If fix #2 is interfering with you, you could always orphan a vm and then go into rails console and assign a cluster to an orphaned vm.

VmOrTemplate.orphaned.first.update_attributes(:ems_cluster => EmsCluster.first)

Comment 36 Nandini Chandra 2016-06-14 18:11:25 UTC
Thanks Keenan.

Steps to reproduce:

1)Terminate an instance from the ec2 console.
2)Refresh ems.

With the fix,

Run these commands on the rails console
1. Vm.where(:ems_id => nil).count 
Output : VM count is greater than 0.

2. Vm.where(:ems_id => nil).where.not(:availability_zone_id => nil).count
Output : VM count = 0

Verified that C&U continues to work after this step.


3. Vm.where(:ems_id => nil).update_all(:availability_zone_id => AvailabilityZone.first.id)

Verified that C&U continues to work after this step.

Verified in 5.6.0.10

Comment 38 errata-xmlrpc 2016-06-29 15:56:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348


Note You need to log in before you can comment on or make changes to this bug.