Description of problem: On a 5.8.3.4 appliance, during C&U collection, a VMWare provider returns an error because we query a non-existent VM. [----] E, [2018-04-16T03:28:06.220587 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture.avail_metrics_for_entity) EMS: [VMWARE_PROVIDER] The following error occurred: [Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' }] [----] E, [2018-04-16T03:28:06.220740 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture#perf_collect_metrics) [hourly] for: [ManageIQ::Providers::Vmware::InfraManager::Vm], [4000000001845], [mdhas_TPS] Unhandled exception during metrics data collection: [Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' }], class: [Handsoap::Fault] [----] E, [2018-04-16T03:28:06.220857 #4288:108513c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::MetricsCapture#perf_collect_metrics) [hourly] for: [ManageIQ::Providers::Vmware::InfraManager::Vm], [4000000001845], [mdhas_TPS] Timings at time of error: {:capture_state=>0.002874135971069336, :vim_connect=>0.055640459060668945, :capture_intervals=>0.00017499923706054688, :capture_counters=>0.03223586082458496} [----] E, [2018-04-16T03:28:06.221215 #4288:108513c] ERROR -- : [Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct: entity' } Method:[rescue in perf_collect_metrics] [----] E, [2018-04-16T03:28:06.221307 #4288:108513c] ERROR -- : (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:217:in `on_fault' (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:307:in `dispatch' (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/handsoap-b1247a733ca3/lib/handsoap/service.rb:211:in `invoke' (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/VimService.rb:596:in `queryAvailablePerfMetric' (druby://127.0.0.1:40500) /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/MiqVimPerfHistory.rb:112:in `availMetricsForEntity' (druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1624:in `perform_without_block' (druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1584:in `perform' (druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1657:in `block (2 levels) in main_loop' (druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `loop' (druby://127.0.0.1:40500) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `block in main_loop' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:124:in `avail_metrics_for_entity' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:364:in `block in perf_capture_counters' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:363:in `each' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:363:in `perf_capture_counters' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:304:in `block in perf_collect_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-vmware-d69f678e3cb3/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb:304:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:180:in `block in perf_capture' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:177:in `perf_capture' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:125:in `perf_capture_historical' /var/www/miq/vmdb/app/models/miq_queue.rb:347:in `block in deliver' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout' /var/www/miq/vmdb/app/models/miq_queue.rb:343:in `deliver' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:106:in `deliver_queue_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:135:in `deliver_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:153:in `block in do_work' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:147:in `loop' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:147:in `do_work' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:340:in `block in do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:337:in `loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:337:in `do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:160:in `run' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:134:in `start' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:21:in `start_worker' If we throw in this debug, we can get the details of the object we've called through the QueryAvailablePerfMetric API: ================== --- /root/VimService.rb 2018-04-16 11:53:12.523931499 -0400 +++ /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-a29aab0471c6/lib/gems/pending/VMwareWebService/VimService.rb 2018-04-16 12:04:27.667226436 -0400 @@ -1,4 +1,5 @@ require "handsoap" +require 'logger' require 'VMwareWebService/VimTypes' class VimService < Handsoap::Service @@ -11,6 +12,7 @@ setNameSpace('urn:vim25') + @@log = Logger.new("log/vim_debug.log") @serviceInstanceMor = VimString.new("ServiceInstance", "ServiceInstance") @session_cookie = nil @@ -593,6 +595,7 @@ end def queryAvailablePerfMetric(perfManager, entity, beginTime = nil, endTime = nil, intervalId = nil) + @@log.info("Calling queryAvailablePerfMetric with #{perfManager}, #{entity}, #{beginTime}, #{endTime}, #{intervalId}") response = invoke("n1:QueryAvailablePerfMetric") do |message| message.add "n1:_this", perfManager do |i| i.set_attr "type", perfManager.vimType @@ -603,7 +606,9 @@ message.add "n1:beginTime", beginTime.to_s if beginTime message.add "n1:endTime", endTime.to_s if endTime message.add "n1:intervalId", intervalId if intervalId + @@log.info("In invoke block, message is: #{message}") end + @@log.info("Received queryAvailablePerfMetric response: #{response}") (parse_response(response, 'QueryAvailablePerfMetricResponse')['returnval']) end ================== This returns the following for the exceptions we saw previously: I, [2018-04-16T12:58:53.432801 #15839] INFO -- : Calling queryAvailablePerfMetric with PerfMgr, vm-1849, , , 7200 I, [2018-04-16T12:58:53.433000 #15839] INFO -- : In invoke block, message is: <n1:QueryAvailablePerfMetric xmlns:n1="urn:vim25"> <n1:_this type="PerformanceManager">PerfMgr</n1:_this> <n1:entity type="VirtualMachine">vm-1849</n1:entity> <n1:intervalId>7200</n1:intervalId> </n1:QueryAvailablePerfMetric> When checking the environment, the VM 'vm-1849' does not exist, nor does it exist in the vms table in the CFME environment: irb(main):001:0> provider = ManageIQ::Providers::Vmware::InfraManager.find_by(name: "VMWARE_PROVIDER") PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0 => #<ManageIQ::Providers::Vmware::InfraManager id: 4000000000001, name: "VMWARE_PROVIDER", created_on: "2017-11-20 20:01:44", updated_on: "2018-04-16 18:30:44", guid: "a2059a88-ce2d-11e7-bf3c-005056a941bc", zone_id: 4000000000004, type: "ManageIQ::Providers::Vmware::InfraManager", api_version: "6.0", uid_ems: "726d897d-858b-4405-bf9b-04e891e9214a", host_default_vnc_port_start: nil, host_default_vnc_port_end: nil, provider_region: nil, last_refresh_error: nil, last_refresh_date: "2018-04-16 18:30:44", provider_id: nil, realm: nil, tenant_id: 4000000000001, project: nil, parent_ems_id: nil, subscription: nil, last_metrics_error: nil, last_metrics_update_date: nil, last_metrics_success_date: nil, tenant_mapping_enabled: nil> irb(main):002:0> vm = provider.vms.find_by(name: "vm-1849") => nil irb(main):003:0> vm.updated_on NoMethodError: undefined method `updated_on' for nil:NilClass from (irb):3 from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/console.rb:65:in `start' from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/console_helper.rb:9:in `start' from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/commands_tasks.rb:78:in `console' from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands/commands_tasks.rb:49:in `run_command!' from /opt/rh/cfme-gemset/gems/railties-5.0.3/lib/rails/commands.rb:18:in `<top (required)>' from bin/rails:4:in `require' from bin/rails:4:in `<main>' Rightfully so, according to the VMWare docs insofar as I can determine, the exception is completely valid from VMWare's SOAP environment, and I do not think we should be querying for non-existent VM's during C&U: PerformanceManager object: http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.wssdk.apiref.doc%2Fvim.PerformanceManager.html ManagedObjectReference: http://pubs.vmware.com/vsphere-6-5/index.jsp#com.vmware.wssdk.apiref.doc/vim.PerformanceManager.ProviderSummary.html#entity VirtualMachine object: http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.wssdk.apiref.doc%2Fvim.VirtualMachine.html The exception and error thrown causes a lot of clutter and false alarms in the customer's monitoring environment, despite there not being an impact we can determine at the moment directly from this. Questions are: A) Are we intentionally trying to query a VM that is non-existent for C&U data? Is it expected that VMWare PerformanceManager objects can be queried/store data about VM's that no longer exist in the infrastructure? B) If this is an error, can we not query VM's that do not exist to prevent errors? C) Since the VM name does not exist in the vms table, where are we getting the VM name when we do C&U to create SOAP calls? Version-Release number of selected component (if applicable): 5.8.3.4 How reproducible: Only in customer's environment for this moment in time (but with extreme frequency - every C&U update); I do not have an environment to easily test with. Actual results: Exceptions occur when we attempt to query a non-existent VM's performance metrics. Expected results: Exceptions should not occur, either by not querying the deleted VM or otherwise checking to determine if the VM still exists.
Hey Robb, what most likely happened here is that while the metrics capture request sat in the queue that VM was deleted. The vm will be archived (have its ems_id nulled out) which will cause it to not be part of the provider's vms collection anymore which is why `vm = provider.vms.find_by(name: "vm-1849")` returned nil. You can try `VmOrTemplate.where(:ems_ref => 'vm-1849')` which will search archived vms as well as current vms. Actually, you searched for vms named 'vm-1849' instead of VM's whose ems_ref is 'vm-1846'. For VMware we use ems_ref to track the ManagedObjectReference. > Questions are: > A) Are we intentionally trying to query a VM that is non-existent for C&U data? Is it expected that VMWare PerformanceManager objects can be queried/store data about VM's that no longer exist in the infrastructure? No > B) If this is an error, can we not query VM's that do not exist to prevent errors? We can add a check to ensure that the VM's target.ext_management_system isn't nil before querying metrics, it will still fail but at least with a better error message. It should be noted that this will not catch all errors, this will still happen if we try to query metrics after the vm was deleted but before the refresh runs and disconnects the vm. > C) Since the VM name does not exist in the vms table, where are we getting the VM name when we do C&U to create SOAP calls? That's actually the VM's ems_ref not the name. It looks like the name is mdhas_TPS (from the log line you pasted above). We are getting the ems_ref from the database because the metrics capture args on the queue store the ID of the VM we are capturing metrics for. In fact the ID is in the log line, try VmOrTemplate.find(4000000001845).
Actually we do exactly this kind of check already, https://github.com/ManageIQ/manageiq-providers-vmware/blob/f89f47c322e3dda61fdac38ebc37ae20cfa77997/app/models/manageiq/providers/vmware/infra_manager/metrics_capture.rb#L267-L268 Robb can you run that query again and check for the ems_ref instead of the name to see if we think there is still a VM with that ref attached to the provider? And you said you already checked the MOB and didn't see that VM?
In this case it looks like the summary.config.uuid property is nil, while the config.uuid property is available. The simple fix is to try to get the uuid from either summary.config.uuid or config.uuid. Longer term I think we should look at only not disconnecting vms which are invalid, rather than a single invalid vm causing nothing to be disconnected from the whole environment given this has other side-effects (specifically we will try to collect metrics from deleted VMs).
https://github.com/ManageIQ/manageiq-providers-vmware/pull/246
https://github.com/ManageIQ/vmware_web_service/pull/39
New commit detected on ManageIQ/vmware_web_service/master: https://github.com/ManageIQ/vmware_web_service/commit/ae3e1c45e163efef14fa119580bd80eed594c247 commit ae3e1c45e163efef14fa119580bd80eed594c247 Author: Adam Grare <agrare> AuthorDate: Wed May 2 13:30:04 2018 -0400 Commit: Adam Grare <agrare> CommitDate: Wed May 2 13:30:04 2018 -0400 Add VirtualMachine config.uuid to VimPropMaps Collect config.uuid in addition to summary.config.uuid for cases when the later is null. https://bugzilla.redhat.com/show_bug.cgi?id=1569090 lib/VMwareWebService/VimPropMaps.rb | 1 + 1 file changed, 1 insertion(+)
New commit detected on ManageIQ/manageiq-providers-vmware/master: https://github.com/ManageIQ/manageiq-providers-vmware/commit/f340d716369626e66c89ba68c534c8861748012a commit f340d716369626e66c89ba68c534c8861748012a Author: Adam Grare <agrare> AuthorDate: Wed May 2 12:57:20 2018 -0400 Commit: Adam Grare <agrare> CommitDate: Wed May 2 12:57:20 2018 -0400 Try to get VM UUID from summary.config or config In some instances a VM will have a nil summary.config.uuid but a valid config.uuid. Attempt to get the UUID from config.uuid as a fallback in the event that summary.config.uuid is nil. https://bugzilla.redhat.com/show_bug.cgi?id=1569090 .yamllint | 1 + app/models/manageiq/providers/vmware/infra_manager/refresh_parser.rb | 5 +- app/models/manageiq/providers/vmware/infra_manager/selector_spec.rb | 1 + manageiq-providers-vmware.gemspec | 2 +- spec/models/manageiq/providers/vmware/infra_manager/refresh_parser_spec.rb | 2 +- spec/models/manageiq/providers/vmware/infra_manager/refresher_spec.rb | 6 + spec/tools/vim_data/miq_vim_inventory/virtualMachinesByMor.yml | 911 +- 7 files changed, 520 insertions(+), 408 deletions(-)
verified in 5.10.0.4