Description of problem: 'The "perf_capture_time" method that is responsible for generating the C&U capture messages into the message queue is abending about 20 times per hour resulting in the failure to schedule C&U capture for most elements in the environment and is probably the reason that no C&U data capture is [occuring].' --Tom Hennessy Version-Release number of selected component (if applicable): 5.5.2 How reproducible: "all the time in [a specific] environment" --Felix Dewaleyne Steps to Reproduce: 1. "this triggers automatically with capacity and utilization enabled in an environment with 4 appliances (one db, one ui, two workers with the rest of the tasks)" --Felix Dewaleyne Actual results: (see problem description) Expected results: successful collection of C&U data Additional info: "this is tied to 30k+ performance rows being treated at the same time" --Felix Dewaleyne this bz was forked off of 1322485
It looks like some of the Vms did not properly clear the cluster A quickfix: Please run the following from a rails console on a machine: # clear ems cluster on orphaned vms VmOrTemplate.where(:ems_id => nil).where.not(:ems_cluster_id => nil).update_all(:ems_cluster_id => nil) # remove cap and u for orphaned vms MiqQueue.where(:class_name => "ManageIQ::Providers::Vmware::InfraManager::Vm", :instance_id => VmOrTemplate.where(:ems_id => nil).select(:id), method_name: "perf_rollup").destroy_all.count I will make a code change so this will no happen in the future.
Here is the code I used to diagnose the problem: I ran this in the rails console AvailabilityZone.where(:ext_management_system => nil).count EmsCluster.where(:ext_management_system => nil).count Host.where(:ext_management_system => nil).count VmOrTemplate.where(:ext_management_system => nil).count VmOrTemplate.where(:ext_management_system => nil).group(:ems_cluster_id).count A count of 0 for everything is ultimate. But a Vm without an ems is not that bad. it is a retired or orphaned vm. The issue to be aware is if it is still linked to a cluster. The sample database that I had showed that there were records in the db that had a cluster and no ems.
https://github.com/ManageIQ/manageiq/pull/8363
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/15bbd18c6df1234bdeb59ac5602273a6534cb440 commit 15bbd18c6df1234bdeb59ac5602273a6534cb440 Author: Keenan Brock <kbrock> AuthorDate: Fri Apr 29 17:06:51 2016 -0400 Commit: Keenan Brock <kbrock> CommitDate: Sat Apr 30 12:42:58 2016 -0400 Clear cluster when ems is cleared https://bugzilla.redhat.com/show_bug.cgi?id=1331803 When an ems is orphaned, clear out the ems_cluser as well app/models/ems_cluster.rb | 2 +- app/models/host.rb | 1 + app/models/vm_or_template.rb | 1 + spec/models/host_spec.rb | 20 ++++++++++++++++++++ spec/models/vm_or_template_spec.rb | 21 +++++++++++++++++++++ 5 files changed, 44 insertions(+), 1 deletion(-)
https://github.com/ManageIQ/manageiq/pull/8429
Created attachment 1153910 [details] Patch to fix cloud provider orphans breaking cap and u collection This can be applied with patch -p0 < cap_u_ems_c.patch
To see if the patch will work, use the following script. It should show bad before the patch and 'good' after ti vmdb bundle exec rails c Zone.all.map { |zone| Metric::Targets.capture_cloud_targets(zone).detect { |vm| vm.ems_id.nil? } ? "#{zone.name}: BAD" : "#{zone.name}: GOOD" }
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/db0a24629e9442b2ba754bb374c6141b9aba7430 commit db0a24629e9442b2ba754bb374c6141b9aba7430 Author: Keenan Brock <kbrock> AuthorDate: Wed May 4 22:43:34 2016 -0400 Commit: Keenan Brock <kbrock> CommitDate: Wed May 4 22:53:30 2016 -0400 Metric::Target capture_cloud_targets rewrite - only bring back vms that are on (when availability_zone.nil?) - only bring back vms that have an ems https://bugzilla.redhat.com/show_bug.cgi?id=1332579 https://bugzilla.redhat.com/show_bug.cgi?id=1331803 app/models/metric/targets.rb | 1 + 1 file changed, 1 insertion(+)
Created attachment 1156704 [details] Patch to fix cloud provider orphans (v2) Sorry about that. The previous patch assumed some previous changes to the models. Here is one that I have tested on 5.5.2.4 vmdb patch -p1 < cap_u_ems_c.v2.patch
https://github.com/ManageIQ/manageiq/pull/8473
Hi Nandini, The issue is when a vm is in a cluster or an availability zone. Then when the vm is deleted on ec2, we call that orphaning. Locally we remove the association between the vm and the ems. The bug is that while the vm is not linked to the ems, it is still linked to the cluster or availability zone. We approached this with 2 similar fixes: 1. Ensure that the vms won't be included in the submit 'cap&u' job (this BZ) 2. Cleared the cluster and availability zone when orphaning a vm. I suppose with #2, #1 is not necessary. But #1 also got us some performance improvements. If fix #2 is interfering with you, you could always orphan a vm and then go into rails console and assign a cluster to an orphaned vm. VmOrTemplate.orphaned.first.update_attributes(:ems_cluster => EmsCluster.first)
Thanks Keenan. Steps to reproduce: 1)Terminate an instance from the ec2 console. 2)Refresh ems. With the fix, Run these commands on the rails console 1. Vm.where(:ems_id => nil).count Output : VM count is greater than 0. 2. Vm.where(:ems_id => nil).where.not(:availability_zone_id => nil).count Output : VM count = 0 Verified that C&U continues to work after this step. 3. Vm.where(:ems_id => nil).update_all(:availability_zone_id => AvailabilityZone.first.id) Verified that C&U continues to work after this step. Verified in 5.6.0.10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348