Bug 1517064

Summary:	MetricsCollectorWorker workers are started even if metrics collection is disabled for a container provider
Product:	Red Hat CloudForms Management Engine	Reporter:	Prasad Mukhedkar <pmukhedk>
Component:	Providers	Assignee:	Yaacov Zamir <yzamir>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Shalom Naim <snaim>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.9.0	CC:	cpelland, gblomqui, jfrey, jhardy, lavenel, lsmola, obarenbo, pmukhedk, yzamir
Target Milestone:	GA	Keywords:	TestOnly
Target Release:	5.10.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	testathon
Fixed In Version:	5.10.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1524628 (view as bug list)		Environment:
Last Closed:	2018-06-21 20:30:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	Container Management	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1524628

Description Prasad Mukhedkar 2017-11-24 06:38:01 UTC

Description of problem:


Metrics collection for openshift provider is set to disabled but still following workers are started on the appliances
where c&u roles enabled in the zone. 

[root@cfwork1 vmdb]# rake evm:status  | grep openshift
 ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 153 | 1808 | 12745 |         3 | openshift               | 2017-11-24T04:05:00Z | 2017-11-24T05:35:41Z |      214
 ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 158 | 1816 | 12751 |         3 | openshift               | 2017-11-24T04:05:00Z | 2017-11-24T05:35:28Z |      209
[root@cfwork1 vmdb]# 

and following error message is reported in the evm.log file repetedely : 


[----] E, [2017-11-24T00:31:43.351235 #1816:b11138] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)
[----] I, [2017-11-24T00:31:43.376426 #1816:b11138]  INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.0041866302490234375, :collect_data=>3.0460519790649414, :total_time=>3.07134747505188}
[----] E, [2017-11-24T00:31:43.376893 #1816:b11138] ERROR -- : MIQ(MiqQueue#deliver) Message id: [12583], Error: [Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)]
[----] E, [2017-11-24T00:31:43.377335 #1816:b11138] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)  Method:[block in method_missing]
[----] E, [2017-11-24T00:31:43.377535 #1816:b11138] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:59:in `rescue in fetch_counters_data'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:50:in `fetch_counters_data'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:58:in `fetch_counters_rate'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:24:in `collect_container_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:28:in `collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:76:in `block in perf_collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:74:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:193:in `block in just_perf_capture'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:189:in `just_perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:135:in `perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:117:in `perf_capture_realtime'
/var/www/miq/vmdb/app/models/miq_queue.rb:449:in `block in dispatch_method'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:448:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:253:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:134:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:152:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:357:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:355:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:349:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:266:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `times'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:141:in `start'
/var/www/miq/vmdb/app/models/miq_server.rb:233:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'



Version-Release number of selected component (if applicable):

cfme-5.9.0.10-1.el7cf.x86_64
How reproducible:


1. Navigate to Compute → Containers → Providers.
2. Click Configuration (Configuration), then click Add a New Containers Provider (Add Existing Containers Provider).
3. Enter a Name for the provider.
4. From the Type list, select OpenShift Container Platform. 
5. Select appropriate zone 
6. Fom the Metrics list, set disabled. 
7. add provider. 

Actual results:
MetricsCollectorWorker workers are started for container provider.

Expected results:

Ideally no meticcollector workers should be started when metric collection is set to disabled for  container provider.
Additional info:

Comment 2 Ladislav Smola 2017-11-24 09:41:50 UTC

Prasad:

this is expected behavior. Whether the worker should be started, that is based on if the role for metric collection is enabled. This works like that everywhere. So we would need a RFE for changing that behaviour.

I think that it's recommended to have 1 provider per zone, for this purpose.


Yaacov:

Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started? Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these)

Comment 3 Yaacov Zamir 2017-11-26 12:09:51 UTC

> Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started?

Agree, should we use this bug to fix this or open a new one, to:

a. make it Warning instead of Error
b. make is one line, instead of the multiple errors.

> Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these)

Also agree :-) , should we use this bug to fix this or open a new one ?

P.S
I will start working on a patch to fix the things Ladislav suggested, please comment here if we need to open a new BZ, or use this one ?

Comment 4 Yaacov Zamir 2017-11-28 18:07:50 UTC

This patch:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

Changes the way we collect metrics, it also:
a. issue a Warning once when no metrics endpoint is available.
b. does not try to get metrics if not endpoint found.

Changing this BZ to on dev since this patch (for different BZ) will also fix the above issues.

Comment 6 Barak 2017-12-07 10:46:54 UTC

Yaacov, Please add this BZ to the PR's description (add to the existing bug)

Comment 7 Yaacov Zamir 2017-12-07 13:59:12 UTC

BZ added to:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

Comment 8 Yaacov Zamir 2017-12-10 14:24:20 UTC

merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

moving to post although this does not fix the original problem, only:
a. issue a Warning once when no metrics endpoint is available.
b. does not try to get metrics if not endpoint found.

please re-open if this is not enough, and we need to address the underling problem of actually starting a worker when C&U is on and no metrics endpoint is defined. [ this is a core issue, not a container only problem, and we will need to re-assign this to the core team ]

Comment 10 Yaacov Zamir 2017-12-21 15:29:48 UTC

Note for QE:

When testing:

Currently we see an Error if C&U is set but no metrics endpoints:
"""MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable"""

After fix:
a. We should not see Error if C&U is on but not metrics endpoint is set.
b. We should see Warnings about it in the log.