Description of problem: I have a small OCP 3.7 cluster connected to a CFME 5.9.0.17 appliance. In most cases a Metrics collection of any object fails with "Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass" and a stack trace. It does occasionally work however (maybe 1 in 500 times), and Hawkular seems to be correctly installed - the metrics are available in the OpenShift console. In the provider configuration page the validation of the 'hawkular-metrics.cloudapps....' metrics endpoint is successful . Stack trace as follows: [----] E, [2018-01-22T11:32:43.544532 #8023:92d13c] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass [----] I, [2018-01-22T11:32:43.556138 #8023:92d13c] INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.001560211181640625, :collect_data=>0.12748312950134277, :total_time=>0.1774895191192627} [----] E, [2018-01-22T11:32:43.556323 #8023:92d13c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [1000001010449], Error: [undefined method `keys' for nil:NilClass] [----] E, [2018-01-22T11:32:43.556506 #8023:92d13c] ERROR -- : [NoMethodError]: undefined method `keys' for nil:NilClass Method:[block in method_missing] [----] E, [2018-01-22T11:32:43.556611 #8023:92d13c] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:136:in `get_metrics_key' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:212:in `block in insert_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:209:in `each' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:209:in `insert_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:244:in `collect_metrics_for_object' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:76:in `collect_container_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:28:in `collect_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:97:in `block in perf_collect_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:95:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:193:in `block in just_perf_capture' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:189:in `just_perf_capture' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:135:in `perf_capture' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:117:in `perf_capture_realtime' /var/www/miq/vmdb/app/models/miq_queue.rb:449:in `block in dispatch_method' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout' /var/www/miq/vmdb/app/models/miq_queue.rb:448:in `dispatch_method' /var/www/miq/vmdb/app/models/miq_queue.rb:425:in `block in deliver' /var/www/miq/vmdb/app/models/user.rb:253:in `with_user_group' /var/www/miq/vmdb/app/models/miq_queue.rb:425:in `deliver' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:134:in `deliver_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:152:in `block in do_work' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `loop' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `do_work' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:357:in `block in start_runner_via_fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork' /var/www/miq/vmdb/app/models/miq_worker.rb:355:in `start_runner_via_fork' /var/www/miq/vmdb/app/models/miq_worker.rb:349:in `start_runner' /var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start' /var/www/miq/vmdb/app/models/miq_worker.rb:266:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `times' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers' /var/www/miq/vmdb/app/models/miq_server.rb:329:in `block in monitor' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block' /var/www/miq/vmdb/app/models/miq_server.rb:329:in `monitor' /var/www/miq/vmdb/app/models/miq_server.rb:351:in `block (2 levels) in monitor_loop' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block' /var/www/miq/vmdb/app/models/miq_server.rb:351:in `block in monitor_loop' /var/www/miq/vmdb/app/models/miq_server.rb:350:in `loop' /var/www/miq/vmdb/app/models/miq_server.rb:350:in `monitor_loop' /var/www/miq/vmdb/app/models/miq_server.rb:234:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start' /var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>' Version-Release number of selected component (if applicable): CFME 5.9.0.17 OpenShift Master: v3.7.14 How reproducible: Consistently Steps to Reproduce: 1. Add an OCP 3.7 provider to CFME 5.9.0.17, configuring provider with metrics from Hawkular 2. Enable C&U server roles 3. Observe errors in evm.log Actual results: Errors Expected results: No errors Additional info:
Hi, we saw this errors in: https://bugzilla.redhat.com/show_bug.cgi?id=1530627 On #1530627 environment we saw that we actually had no metrics in Hawkular. Here "Hawkular seems to be correctly installed - the metrics are available in the OpenShift console" ? We actually fixed the logs for #1530627 by: https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 But we assume that this errors only happen if no metrics are actually collected. ------------ Peter: a. can you re-check that metrics are available on OpenShift console ? b. can you re-check what metrics are being collected and what are not in CFME ? c. do you have pods crashing in OCP, crashing pods will have no metrics and will trigger this error ? d. can you give us access to the affected environment so we can debug this on a system that show this error ( for #1530627 we could not reproduce this problem ) ?
As you suggest, the errors seem to occur when trying to collect metrics for pods such as 'apiserver-49xjz' which no longer exist. Perhaps we just need to handle this case more gracefully?
> Perhaps we just need to handle this case more gracefully? The PR that fixes that was merged 15 days ago: https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 ------------- p.s. We are trying to figure out how to show in the provider summary page, that metrics where collected but some containers did not have any data ... is it an error, is it a warning ... what message to show users ... adding Einat to this BZ because we already discussed the issue of what to show the user in such cases (when we saw #1530627) p.p.s do we need to backport https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 to G release (target of this BZ is 5.9 )
No issue backporting https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 to Gaprindashvili if the Container Management team thinks it would be helpful to the user.
> > p.s. > We are trying to figure out how to show in the provider summary page, that > metrics where collected but some containers did not have any data ... is it > an error, is it a warning ... what message to show users ... Personally I would consider this a warning rather than an error. There is a valid reason that no metrics were collected, the pod dod not exist for long enough. I guess there could conceivably be naturally short-lived pods (that terminate normally) that show the same characteristics?
merged upstream + G/yes https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204
I don't think it should be a blocker, especially as https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 has been merged already.
Per comment # 11 moving to 5.9.1
Clearing RFI flag
I'm seeing this with OCP 3.7 on both AWS and OSP when using cfme-5.9 (CF 4.6). I've not deployed any pods but the default install and Hawkuler. Metrics are displayed in OCP for the hawkuler pods. But CF just does this: [----] E, [2018-03-06T14:17:39.302748 #23579:1097134] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass [----] E, [2018-03-06T14:17:39.640065 #23579:1097134] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000000014908], Error: [undefined method `keys' for nil:NilClass] [----] E, [2018-03-06T14:17:39.640237 #23579:1097134] ERROR -- : [NoMethodError]: undefined method `keys' for nil:NilClass Method:[block in method_missing] [----] E, [2018-03-06T14:17:39.640345 #23579:1097134] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-c53d253a7730/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:136:in `get_metrics_key' for both OCP installs. I noticed the BZ that the pull request is based on (https://bugzilla.redhat.com/show_bug.cgi?id=1530627) closed NOTABUG but i'm not sure what the fix was and how the pull fixes this. For the moment, I can't work out how to get metrics into CF without this working.
So when I manually add the change to my 5.9 environment the errors change to be like this: [----] E, [2018-03-06T16:06:30.732708 #1515:697134] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: no gauge metrics found for [sti-build] [----] E, [2018-03-06T16:06:30.770902 #1515:697134] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000000086213], Error: [no gauge metrics found for [sti-build]] [----] E, [2018-03-06T16:06:30.771083 #1515:697134] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: no gauge metrics found for [sti-build] Method:[block in method_missing] [----] E, [2018-03-06T16:06:30.771180 #1515:697134] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-c53d253a7730/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:135:in `get_metrics_key' But I still don't see any changes to the "Aggregated Node Utilization" page (providers -> containers -> overview). Should that update? I read with 4.1 it only updates after 30 days. Is that true? Metrics are fine for hawkular pods as seen in OCP.
Yaacov, After switching to legacy hawkular api ('hawkular_force_legacy' to true in advance configuration), cloudforms is able to collect the metrics properly.
Thanks !
Created attachment 1455747 [details] Bug verification
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days