Bug 1537195

Summary: Metrics collection from an OpenShift 3.7 cluster frequently fails with "Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass"
Product: Red Hat CloudForms Management Engine Reporter: Peter McGowan <pmcgowan>
Component: C&U Capacity and UtilizationAssignee: Yaacov Zamir <yzamir>
Status: CLOSED CURRENTRELEASE QA Contact: Shalom Naim <snaim>
Severity: high Docs Contact:
Priority: high    
Version: 5.9.0CC: asimonel, bsorota, cpelland, epacific, jnordell, niroy, obarenbo, pmcgowan, pmukhedk, rmanes, rspagnol, simaishi, smallamp, yzamir
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1552314 (view as bug list) Environment:
Last Closed: 2019-02-11 14:09:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Container Management Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1552314    
Attachments:
Description Flags
Bug verification none

Description Peter McGowan 2018-01-22 16:38:15 UTC
Description of problem:
I have a small OCP 3.7 cluster connected to a CFME 5.9.0.17 appliance. In most cases a
Metrics collection of any object fails with "Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass" and a stack trace. It does occasionally work however (maybe 1 in 500 times), and Hawkular seems to be correctly installed - the metrics are available in the OpenShift console. In the provider configuration page the validation of the 'hawkular-metrics.cloudapps....' metrics endpoint is successful .

Stack trace as follows:

[----] E, [2018-01-22T11:32:43.544532 #8023:92d13c] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass
[----] I, [2018-01-22T11:32:43.556138 #8023:92d13c]  INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.001560211181640625, :collect_data=>0.12748312950134277, :total_time=>0.1774895191192627}
[----] E, [2018-01-22T11:32:43.556323 #8023:92d13c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [1000001010449], Error: [undefined method `keys' for nil:NilClass]
[----] E, [2018-01-22T11:32:43.556506 #8023:92d13c] ERROR -- : [NoMethodError]: undefined method `keys' for nil:NilClass  Method:[block in method_missing]
[----] E, [2018-01-22T11:32:43.556611 #8023:92d13c] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:136:in `get_metrics_key'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:212:in `block in insert_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:209:in `each'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:209:in `insert_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:244:in `collect_metrics_for_object'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:76:in `collect_container_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:28:in `collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:97:in `block in perf_collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-9073e19b1081/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:95:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:193:in `block in just_perf_capture'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:189:in `just_perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:135:in `perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:117:in `perf_capture_realtime'
/var/www/miq/vmdb/app/models/miq_queue.rb:449:in `block in dispatch_method'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:448:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:253:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:134:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:152:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:357:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:355:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:349:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:266:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `times'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:329:in `block in monitor'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:329:in `monitor'
/var/www/miq/vmdb/app/models/miq_server.rb:351:in `block (2 levels) in monitor_loop'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-9de8568e2726/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:351:in `block in monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:350:in `loop'
/var/www/miq/vmdb/app/models/miq_server.rb:350:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:234:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

Version-Release number of selected component (if applicable):
CFME 5.9.0.17
OpenShift Master:   v3.7.14 


How reproducible:
Consistently


Steps to Reproduce:
1. Add an OCP 3.7 provider to CFME 5.9.0.17, configuring provider with metrics from Hawkular
2. Enable C&U server roles
3. Observe errors in evm.log

Actual results:
Errors

Expected results:
No errors

Additional info:

Comment 3 Yaacov Zamir 2018-01-23 15:16:49 UTC
Hi, we saw this errors in:
https://bugzilla.redhat.com/show_bug.cgi?id=1530627

On #1530627 environment we saw that we actually had no metrics in Hawkular.

Here "Hawkular seems to be correctly installed - the metrics are available in the OpenShift console" ?

We actually fixed the logs for #1530627 by:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204

But we assume that this errors only happen if no metrics are actually collected.

------------

Peter:
a. can you re-check that metrics are available on OpenShift console ?
b. can you re-check what metrics are being collected and what are not in CFME ?
c. do you have pods crashing in OCP, crashing pods will have no metrics and will trigger this error ?
d. can you give us access to the affected environment so we can debug this on a system that show this error ( for #1530627 we could not reproduce this problem ) ?

Comment 5 Peter McGowan 2018-01-23 16:09:37 UTC
As you suggest, the errors seem to occur when trying to collect metrics for pods such as 'apiserver-49xjz' which no longer exist. Perhaps we just need to handle this case more gracefully?

Comment 6 Yaacov Zamir 2018-01-23 16:18:54 UTC
> Perhaps we just need to handle this case more gracefully?

The PR that fixes that was merged 15 days ago: 
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204

-------------

p.s.
We are trying to figure out how to show in the provider summary page, that metrics where collected but some containers did not have any data ... is it an error, is it a warning ... what message to show users ...

adding Einat to this BZ because we already discussed the issue of what to show the user in such cases (when we saw #1530627)

p.p.s
do we need to backport 
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204
to G release (target of this BZ is 5.9 )

Comment 7 Oleg Barenboim 2018-01-23 16:32:27 UTC
No issue backporting https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 to Gaprindashvili if the Container Management team thinks it would be helpful to the user.

Comment 9 Peter McGowan 2018-01-23 16:37:35 UTC
> 
> p.s.
> We are trying to figure out how to show in the provider summary page, that
> metrics where collected but some containers did not have any data ... is it
> an error, is it a warning ... what message to show users ...

Personally I would consider this a warning rather than an error. There is a valid reason that no metrics were collected, the pod dod not exist for long enough. I guess there could conceivably be naturally short-lived pods (that terminate normally) that show the same characteristics?

Comment 10 Yaacov Zamir 2018-01-23 16:41:43 UTC
merged upstream + G/yes
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204

Comment 11 Peter McGowan 2018-01-23 16:42:07 UTC
I don't think it should be a blocker, especially as https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/204 has been merged already.

Comment 12 Barak 2018-01-24 10:25:08 UTC
Per comment # 11 moving to 5.9.1

Comment 15 Peter McGowan 2018-02-12 13:36:05 UTC
Clearing RFI flag

Comment 16 August Simonelli 2018-03-06 03:54:06 UTC
I'm seeing this with OCP 3.7 on both AWS and OSP when using cfme-5.9 (CF 4.6). I've not deployed any pods but the default install and Hawkuler. Metrics are displayed in OCP for the hawkuler pods. But CF just does this:

[----] E, [2018-03-06T14:17:39.302748 #23579:1097134] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: undefined method `keys' for nil:NilClass
[----] E, [2018-03-06T14:17:39.640065 #23579:1097134] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000000014908], Error: [undefined method `keys' for nil:NilClass]
[----] E, [2018-03-06T14:17:39.640237 #23579:1097134] ERROR -- : [NoMethodError]: undefined method `keys' for nil:NilClass  Method:[block in method_missing]
[----] E, [2018-03-06T14:17:39.640345 #23579:1097134] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-c53d253a7730/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:136:in `get_metrics_key'

for both OCP installs.

I noticed the BZ that the pull request is based on (https://bugzilla.redhat.com/show_bug.cgi?id=1530627) closed NOTABUG but i'm not sure what the fix was and how the pull fixes this.

For the moment, I can't work out how to get metrics into CF without this working.

Comment 17 August Simonelli 2018-03-06 13:22:01 UTC
So when I manually add the change to my 5.9 environment the errors change to be like this:


[----] E, [2018-03-06T16:06:30.732708 #1515:697134] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: no gauge metrics found for [sti-build]

[----] E, [2018-03-06T16:06:30.770902 #1515:697134] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000000086213], Error: [no gauge metrics found for [sti-build]]

[----] E, [2018-03-06T16:06:30.771083 #1515:697134] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: no gauge metrics found for [sti-build]  Method:[block in method_missing]

[----] E, [2018-03-06T16:06:30.771180 #1515:697134] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-c53d253a7730/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:135:in `get_metrics_key'

But I still don't see any changes to the "Aggregated Node Utilization" page (providers -> containers -> overview). Should that update? I read with 4.1 it only updates after 30 days. Is that true?

Metrics are fine for hawkular pods as seen in OCP.

Comment 30 Prasad Mukhedkar 2018-05-14 11:24:22 UTC
Yaacov, After switching to legacy hawkular api ('hawkular_force_legacy' to true in advance configuration), cloudforms is able to collect the metrics properly.

Comment 31 Yaacov Zamir 2018-05-14 12:08:01 UTC
Thanks !

Comment 33 Shalom Naim 2018-07-01 11:35:13 UTC
Created attachment 1455747 [details]
Bug verification

Comment 34 Red Hat Bugzilla 2023-09-15 00:06:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days