Bug 1569090

Summary: During metrics collection for a VMWare provider, SOAP exception occurs during queryAvailablePerfMetric for non-existent VM
Product: Red Hat CloudForms Management Engine Reporter: Robb Manes <rmanes>
Component: ProvidersAssignee: Adam Grare <agrare>
Status: CLOSED CURRENTRELEASE QA Contact: Nandini Chandra <nachandr>
Severity: medium Docs Contact:
Priority: high    
Version: 5.8.0CC: cpelland, gblomqui, jfrey, jhardy, nachandr, obarenbo, rmanes, rovalent, simaishi
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1595461 1595462 (view as bug list) Environment:
Last Closed: 2019-02-11 14:06:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: VMware Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1595461, 1595462    

Description Robb Manes 2018-04-18 15:16:37 UTC
Description of problem:

On a 5.8.3.4 appliance, during C&U collection, a VMWare provider returns an error because we query a non-existent VM.

[----] E, [2018-04-16T03:28:06.220587 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture.avail_metrics_for_entity) EMS: [VMWARE_PROVIDER] The following error occurred: [Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' }]
[----] E, [2018-04-16T03:28:06.220740 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture#perf_collect_metrics) [hourly] for: [ManageIQ::Providers::Vmware::InfraManager::Vm], [4000000001845], [mdhas_TPS] Unhandled exception during metrics data collection: [Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' }], class: [Handsoap::Fault]
[----] E, [2018-04-16T03:28:06.220857 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture#perf_collect_metrics) [hourly] for: [ManageIQ::Providers::Vmware::InfraManager::Vm], [4000000001845], [mdhas_TPS]   Timings at time of error: {:capture_state=>0.002874135971069336, :vim_connect=>0.055640459060668945, :capture_intervals=>0.00017499923706054688, :capture_counters=>0.03223586082458496}
[----] E, [2018-04-16T03:28:06.221215 #4288:108513c] ERROR -- : [Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' }  Method:[rescue in perf_collect_metrics]
[----] E, [2018-04-16T03:28:06.221307 #4288:108513c] ERROR -- : (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:217:in `on_fault'
(druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:307:in `dispatch'
(druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:211:in `invoke'
(druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/VimService.rb:596:in `queryAvailablePerfMetric'
(druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/MiqVimPerfHistory.rb:112:in `availMetricsForEntity'
(druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1624:in `perform_without_block'
(druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1584:in `perform'
(druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1657:in `block (2 levels) in main_loop'
(druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `loop'
(druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `block in main_loop'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:124:in `avail_metrics_for_entity'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:364:in `block in perf_capture_counters'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:363:in `each'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:363:in `perf_capture_counters'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:304:in `block in perf_collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:304:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:180:in `block in perf_capture'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:177:in `perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:125:in `perf_capture_historical'
/var/www/miq/vmdb/app/models/miq_queue.rb:347:in `block in deliver'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:343:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:106:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:135:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:153:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:147:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:147:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:340:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:337:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:337:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:160:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:134:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:21:in `start_worker'

If we throw in this debug, we can get the details of the object we've called through the QueryAvailablePerfMetric API:

==================
--- /root/VimService.rb	2018-04-16 11:53:12.523931499 -0400
+++ /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/VimService.rb	2018-04-16 12:04:27.667226436 -0400
@@ -1,4 +1,5 @@
 require "handsoap"
+require 'logger'
 require 'VMwareWebService/VimTypes'
 
 class VimService < Handsoap::Service
@@ -11,6 +12,7 @@
 
     setNameSpace('urn:vim25')
 
+    @@log = Logger.new("log/vim_debug.log")
     @serviceInstanceMor = VimString.new("ServiceInstance", "ServiceInstance")
     @session_cookie     = nil
 
@@ -593,6 +595,7 @@
   end
 
   def queryAvailablePerfMetric(perfManager, entity, beginTime = nil, endTime = nil, intervalId = nil)
+    @@log.info("Calling queryAvailablePerfMetric with #{perfManager}, #{entity}, #{beginTime}, #{endTime}, #{intervalId}")
     response = invoke("n1:QueryAvailablePerfMetric") do |message|
       message.add "n1:_this", perfManager do |i|
         i.set_attr "type", perfManager.vimType
@@ -603,7 +606,9 @@
       message.add "n1:beginTime", beginTime.to_s  if beginTime
       message.add "n1:endTime", endTime.to_s    if endTime
       message.add "n1:intervalId", intervalId   if intervalId
+      @@log.info("In invoke block, message is: #{message}")
     end
+    @@log.info("Received queryAvailablePerfMetric response: #{response}")
     (parse_response(response, 'QueryAvailablePerfMetricResponse')['returnval'])
   end
==================

This returns the following for the exceptions we saw previously:

I, [2018-04-16T12:58:53.432801 #15839]  INFO -- : Calling queryAvailablePerfMetric with PerfMgr, vm-1849, , , 7200
I, [2018-04-16T12:58:53.433000 #15839]  INFO -- : In invoke block, message is: <n1:QueryAvailablePerfMetric xmlns:n1="urn:vim25">
  <n1:_this type="PerformanceManager">PerfMgr</n1:_this>
  <n1:entity type="VirtualMachine">vm-1849</n1:entity>
  <n1:intervalId>7200</n1:intervalId>
</n1:QueryAvailablePerfMetric>

When checking the environment, the VM 'vm-1849' does not exist, nor does it exist in the vms table in the CFME environment:

irb(main):001:0> provider = ManageIQ::Providers::Vmware::InfraManager.find_by(name: "VMWARE_PROVIDER")
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
=> #<ManageIQ::Providers::Vmware::InfraManager id: 4000000000001, name: "VMWARE_PROVIDER", created_on: "2017-11-20 20:01:44", updated_on: "2018-04-16 18:30:44", guid: "a2059a88-ce2d-11e7-bf3c-005056a941bc", zone_id: 4000000000004, type: "ManageIQ::Providers::Vmware::InfraManager", api_version: "6.0", uid_ems: "726d897d-858b-4405-bf9b-04e891e9214a", host_default_vnc_port_start: nil, host_default_vnc_port_end: nil, provider_region: nil, last_refresh_error: nil, last_refresh_date: "2018-04-16 18:30:44", provider_id: nil, realm: nil, tenant_id: 4000000000001, project: nil, parent_ems_id: nil, subscription: nil, last_metrics_error: nil, last_metrics_update_date: nil, last_metrics_success_date: nil, tenant_mapping_enabled: nil>

irb(main):002:0> vm = provider.vms.find_by(name: "vm-1849")
=> nil

irb(main):003:0> vm.updated_on
NoMethodError: undefined method `updated_on' for nil:NilClass
	from (irb):3
	from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/console.rb:65:in `start'
	from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/console_helper.rb:9:in `start'
	from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/commands_tasks.rb:78:in `console'
	from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
	from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands.rb:18:in `<top (required)>'
	from bin/rails:4:in `require'
	from bin/rails:4:in `<main>'

Rightfully so, according to the VMWare docs insofar as I can determine, the exception is completely valid from VMWare's SOAP environment, and I do not think we should be querying for non-existent VM's during C&U:

PerformanceManager object: http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.wssdk.apiref.doc%2Fvim.PerformanceManager.html

ManagedObjectReference: http://pubs.vmware.com/vsphere-6-5/index.jsp#com.vmware.wssdk.apiref.doc/vim.PerformanceManager.ProviderSummary.html#entity

VirtualMachine object: http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.wssdk.apiref.doc%2Fvim.VirtualMachine.html

The exception and error thrown causes a lot of clutter and false alarms in the customer's monitoring environment, despite there not being an impact we can determine at the moment directly from this.

Questions are:

A) Are we intentionally trying to query a VM that is non-existent for C&U data?  Is it expected that VMWare PerformanceManager objects can be queried/store data about VM's that no longer exist in the infrastructure?
B) If this is an error, can we not query VM's that do not exist to prevent errors?
C) Since the VM name does not exist in the vms table, where are we getting the VM name when we do C&U to create SOAP calls?


Version-Release number of selected component (if applicable):
5.8.3.4

How reproducible:
Only in customer's environment for this moment in time (but with extreme frequency - every C&U update); I do not have an environment to easily test with.


Actual results:
Exceptions occur when we attempt to query a non-existent VM's performance metrics.

Expected results:
Exceptions should not occur, either by not querying the deleted VM or otherwise checking to determine if the VM still exists.

Comment 2 Adam Grare 2018-04-23 15:54:56 UTC
Hey Robb, what most likely happened here is that while the metrics capture request sat in the queue that VM was deleted.  The vm will be archived (have its ems_id nulled out) which will cause it to not be part of the provider's vms collection anymore which is why `vm = provider.vms.find_by(name: "vm-1849")` returned nil.

You can try `VmOrTemplate.where(:ems_ref => 'vm-1849')` which will search archived vms as well as current vms.
Actually, you searched for vms named 'vm-1849' instead of VM's whose ems_ref is 'vm-1846'.  For VMware we use ems_ref to track the ManagedObjectReference.

> Questions are:

> A) Are we intentionally trying to query a VM that is non-existent for C&U data?  Is it expected that VMWare PerformanceManager objects can be queried/store data about VM's that no longer exist in the infrastructure?

No

> B) If this is an error, can we not query VM's that do not exist to prevent errors?

We can add a check to ensure that the VM's target.ext_management_system isn't nil before querying metrics, it will still fail but at least with a better error message.  It should be noted that this will not catch all errors, this will still happen if we try to query metrics after the vm was deleted but before the refresh runs and disconnects the vm.

> C) Since the VM name does not exist in the vms table, where are we getting the VM name when we do C&U to create SOAP calls?

That's actually the VM's ems_ref not the name.  It looks like the name is mdhas_TPS (from the log line you pasted above).  We are getting the ems_ref from the database because the metrics capture args on the queue store the ID of the VM we are capturing metrics for.  In fact the ID is in the log line, try VmOrTemplate.find(4000000001845).

Comment 3 Adam Grare 2018-04-23 19:56:53 UTC
Actually we do exactly this kind of check already, https://github.com/ManageIQ/manageiq-providers-vmware/blob/f89f47c322e3dda61fdac38ebc37ae20cfa77997/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb#L267-L268

Robb can you run that query again and check for the ems_ref instead of the name to see if we think there is still a VM with that ref attached to the provider?

And you said you already checked the MOB and didn't see that VM?

Comment 10 Adam Grare 2018-05-02 17:10:10 UTC
In this case it looks like the summary.config.uuid property is nil, while the config.uuid property is available.  The simple fix is to try to get the uuid from either summary.config.uuid or config.uuid.  Longer term I think we should look at only not disconnecting vms which are invalid, rather than a single invalid vm causing nothing to be disconnected from the whole environment given this has other side-effects (specifically we will try to collect metrics from deleted VMs).

Comment 13 CFME Bot 2018-05-08 15:32:52 UTC
New commit detected on ManageIQ/vmware_web_service/master:

https://github.com/ManageIQ/vmware_web_service/commit/ae3e1c45e163efef14fa119580bd80eed594c247
commit ae3e1c45e163efef14fa119580bd80eed594c247
Author:     Adam Grare <agrare>
AuthorDate: Wed May  2 13:30:04 2018 -0400
Commit:     Adam Grare <agrare>
CommitDate: Wed May  2 13:30:04 2018 -0400

    Add VirtualMachine config.uuid to VimPropMaps

    Collect config.uuid in addition to summary.config.uuid for cases when
    the later is null.

    https://bugzilla.redhat.com/show_bug.cgi?id=1569090

 lib/VMwareWebService/VimPropMaps.rb | 1 +
 1 file changed, 1 insertion(+)

Comment 14 CFME Bot 2018-05-15 14:57:37 UTC
New commit detected on ManageIQ/manageiq-providers-vmware/master:

https://github.com/ManageIQ/manageiq-providers-vmware/commit/f340d716369626e66c89ba68c534c8861748012a
commit f340d716369626e66c89ba68c534c8861748012a
Author:     Adam Grare <agrare>
AuthorDate: Wed May  2 12:57:20 2018 -0400
Commit:     Adam Grare <agrare>
CommitDate: Wed May  2 12:57:20 2018 -0400

    Try to get VM UUID from summary.config or config

    In some instances a VM will have a nil summary.config.uuid but a valid
    config.uuid.  Attempt to get the UUID from config.uuid as a fallback in
    the event that summary.config.uuid is nil.

    https://bugzilla.redhat.com/show_bug.cgi?id=1569090

 .yamllint | 1 +
 app/models/manageiq/providers/vmware/infra_manager/refresh_parser.rb | 5 +-
 app/models/manageiq/providers/vmware/infra_manager/selector_spec.rb | 1 +
 manageiq-providers-vmware.gemspec | 2 +-
 spec/models/manageiq/providers/vmware/infra_manager/refresh_parser_spec.rb | 2 +-
 spec/models/manageiq/providers/vmware/infra_manager/refresher_spec.rb | 6 +
 spec/tools/vim_data/miq_vim_inventory/virtualMachinesByMor.yml | 911 +-
 7 files changed, 520 insertions(+), 408 deletions(-)

Comment 21 Nandini Chandra 2018-07-24 01:34:25 UTC
verified in 5.10.0.4