Bug 1517064 - MetricsCollectorWorker workers are started even if metrics collection is disabled for a container provider
Summary: MetricsCollectorWorker workers are started even if metrics collection is disa...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.9.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.10.0
Assignee: Yaacov Zamir
QA Contact: Shalom Naim
URL:
Whiteboard: testathon
Depends On:
Blocks: 1524628
TreeView+ depends on / blocked
 
Reported: 2017-11-24 06:38 UTC by Prasad Mukhedkar
Modified: 2018-06-21 20:30 UTC (History)
9 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1524628 (view as bug list)
Environment:
Last Closed: 2018-06-21 20:30:46 UTC
Category: ---
Cloudforms Team: Container Management
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Prasad Mukhedkar 2017-11-24 06:38:01 UTC
Description of problem:


Metrics collection for openshift provider is set to disabled but still following workers are started on the appliances
where c&u roles enabled in the zone. 

[root@cfwork1 vmdb]# rake evm:status  | grep openshift
 ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 153 | 1808 | 12745 |         3 | openshift               | 2017-11-24T04:05:00Z | 2017-11-24T05:35:41Z |      214
 ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 158 | 1816 | 12751 |         3 | openshift               | 2017-11-24T04:05:00Z | 2017-11-24T05:35:28Z |      209
[root@cfwork1 vmdb]# 

and following error message is reported in the evm.log file repetedely : 


[----] E, [2017-11-24T00:31:43.351235 #1816:b11138] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)
[----] I, [2017-11-24T00:31:43.376426 #1816:b11138]  INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.0041866302490234375, :collect_data=>3.0460519790649414, :total_time=>3.07134747505188}
[----] E, [2017-11-24T00:31:43.376893 #1816:b11138] ERROR -- : MIQ(MiqQueue#deliver) Message id: [12583], Error: [Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)]
[----] E, [2017-11-24T00:31:43.377335 #1816:b11138] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)  Method:[block in method_missing]
[----] E, [2017-11-24T00:31:43.377535 #1816:b11138] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:59:in `rescue in fetch_counters_data'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:50:in `fetch_counters_data'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:58:in `fetch_counters_rate'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:24:in `collect_container_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:28:in `collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:76:in `block in perf_collect_metrics'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:74:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:193:in `block in just_perf_capture'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:189:in `just_perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:135:in `perf_capture'
/var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:117:in `perf_capture_realtime'
/var/www/miq/vmdb/app/models/miq_queue.rb:449:in `block in dispatch_method'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:448:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:253:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:425:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:134:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:152:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:357:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:355:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:349:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:266:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `times'
/var/www/miq/vmdb/app/models/miq_worker.rb:153:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:141:in `start'
/var/www/miq/vmdb/app/models/miq_server.rb:233:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'



Version-Release number of selected component (if applicable):

cfme-5.9.0.10-1.el7cf.x86_64
How reproducible:


1. Navigate to Compute → Containers → Providers.
2. Click Configuration (Configuration), then click Add a New Containers Provider (Add Existing Containers Provider).
3. Enter a Name for the provider.
4. From the Type list, select OpenShift Container Platform. 
5. Select appropriate zone 
6. Fom the Metrics list, set disabled. 
7. add provider. 

Actual results:
MetricsCollectorWorker workers are started for container provider.

Expected results:

Ideally no meticcollector workers should be started when metric collection is set to disabled for  container provider.
Additional info:

Comment 2 Ladislav Smola 2017-11-24 09:41:50 UTC
Prasad:

this is expected behavior. Whether the worker should be started, that is based on if the role for metric collection is enabled. This works like that everywhere. So we would need a RFE for changing that behaviour.

I think that it's recommended to have 1 provider per zone, for this purpose.


Yaacov:

Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started? Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these)

Comment 3 Yaacov Zamir 2017-11-26 12:09:51 UTC
> Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started?

Agree, should we use this bug to fix this or open a new one, to:

a. make it Warning instead of Error
b. make is one line, instead of the multiple errors.

> Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these)

Also agree :-) , should we use this bug to fix this or open a new one ?

P.S
I will start working on a patch to fix the things Ladislav suggested, please comment here if we need to open a new BZ, or use this one ?

Comment 4 Yaacov Zamir 2017-11-28 18:07:50 UTC
This patch:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

Changes the way we collect metrics, it also:
a. issue a Warning once when no metrics endpoint is available.
b. does not try to get metrics if not endpoint found.

Changing this BZ to on dev since this patch (for different BZ) will also fix the above issues.

Comment 6 Barak 2017-12-07 10:46:54 UTC
Yaacov, Please add this BZ to the PR's description (add to the existing bug)

Comment 7 Yaacov Zamir 2017-12-07 13:59:12 UTC
BZ added to:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

Comment 8 Yaacov Zamir 2017-12-10 14:24:20 UTC
merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

moving to post although this does not fix the original problem, only:
a. issue a Warning once when no metrics endpoint is available.
b. does not try to get metrics if not endpoint found.

please re-open if this is not enough, and we need to address the underling problem of actually starting a worker when C&U is on and no metrics endpoint is defined. [ this is a core issue, not a container only problem, and we will need to re-assign this to the core team ]

Comment 10 Yaacov Zamir 2017-12-21 15:29:48 UTC
Note for QE:

When testing:

Currently we see an Error if C&U is set but no metrics endpoints:
"""MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable"""

After fix:
a. We should not see Error if C&U is on but not metrics endpoint is set.
b. We should see Warnings about it in the log.


Note You need to log in before you can comment on or make changes to this bug.