Description of problem: Metrics collection for openshift provider is set to disabled but still following workers are started on the appliances where c&u roles enabled in the zone. [root@cfwork1 vmdb]# rake evm:status | grep openshift ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 153 | 1808 | 12745 | 3 | openshift | 2017-11-24T04:05:00Z | 2017-11-24T05:35:41Z | 214 ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker | started | 158 | 1816 | 12751 | 3 | openshift | 2017-11-24T04:05:00Z | 2017-11-24T05:35:28Z | 209 [root@cfwork1 vmdb]# and following error message is reported in the evm.log file repetedely : [----] E, [2017-11-24T00:31:43.351235 #1816:b11138] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000) [----] I, [2017-11-24T00:31:43.376426 #1816:b11138] INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.0041866302490234375, :collect_data=>3.0460519790649414, :total_time=>3.07134747505188} [----] E, [2017-11-24T00:31:43.376893 #1816:b11138] ERROR -- : MIQ(MiqQueue#deliver) Message id: [12583], Error: [Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000)] [----] E, [2017-11-24T00:31:43.377335 #1816:b11138] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: Hawkular::ConnectionException: Failed to open TCP connection to ose.example.com:5000 (No route to host - connect(2) for "ose.example.com" port 5000) Method:[block in method_missing] [----] E, [2017-11-24T00:31:43.377535 #1816:b11138] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:59:in `rescue in fetch_counters_data' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:50:in `fetch_counters_data' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:58:in `fetch_counters_rate' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb:24:in `collect_container_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb:28:in `collect_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:76:in `block in perf_collect_metrics' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block' /opt/rh/cfme-gemset/bundler/gems/manageiq-providers-kubernetes-d2e40c479ac0/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb:74:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:6:in `perf_collect_metrics' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:193:in `block in just_perf_capture' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-76700bd7592a/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:189:in `just_perf_capture' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:135:in `perf_capture' /var/www/miq/vmdb/app/models/metric/ci_mixin/capture.rb:117:in `perf_capture_realtime' /var/www/miq/vmdb/app/models/miq_queue.rb:449:in `block in dispatch_method' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch' /opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:106:in `timeout' /var/www/miq/vmdb/app/models/miq_queue.rb:448:in `dispatch_method' /var/www/miq/vmdb/app/models/miq_queue.rb:425:in `block in deliver' /var/www/miq/vmdb/app/models/user.rb:253:in `with_user_group' /var/www/miq/vmdb/app/models/miq_queue.rb:425:in `deliver' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:134:in `deliver_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:152:in `block in do_work' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `loop' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:146:in `do_work' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:357:in `block in start_runner_via_fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.3/lib/nakayoshi_fork.rb:24:in `fork' /var/www/miq/vmdb/app/models/miq_worker.rb:355:in `start_runner_via_fork' /var/www/miq/vmdb/app/models/miq_worker.rb:349:in `start_runner' /var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start' /var/www/miq/vmdb/app/models/miq_worker.rb:266:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `times' /var/www/miq/vmdb/app/models/miq_worker.rb:153:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server.rb:141:in `start' /var/www/miq/vmdb/app/models/miq_server.rb:233:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start' Version-Release number of selected component (if applicable): cfme-5.9.0.10-1.el7cf.x86_64 How reproducible: 1. Navigate to Compute → Containers → Providers. 2. Click Configuration (Configuration), then click Add a New Containers Provider (Add Existing Containers Provider). 3. Enter a Name for the provider. 4. From the Type list, select OpenShift Container Platform. 5. Select appropriate zone 6. Fom the Metrics list, set disabled. 7. add provider. Actual results: MetricsCollectorWorker workers are started for container provider. Expected results: Ideally no meticcollector workers should be started when metric collection is set to disabled for container provider. Additional info:
Prasad: this is expected behavior. Whether the worker should be started, that is based on if the role for metric collection is enabled. This works like that everywhere. So we would need a RFE for changing that behaviour. I think that it's recommended to have 1 provider per zone, for this purpose. Yaacov: Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started? Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these)
> Maybe we should not see this failure though. Should we rather display a warning if provider has no Hawkular/Prometheus set but collector is started? Agree, should we use this bug to fix this or open a new one, to: a. make it Warning instead of Error b. make is one line, instead of the multiple errors. > Also, I believe that if C&U is disabled, we should not be queuing any targets for collection? (we should have another BZs for these) Also agree :-) , should we use this bug to fix this or open a new one ? P.S I will start working on a patch to fix the things Ladislav suggested, please comment here if we need to open a new BZ, or use this one ?
This patch: https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159 Changes the way we collect metrics, it also: a. issue a Warning once when no metrics endpoint is available. b. does not try to get metrics if not endpoint found. Changing this BZ to on dev since this patch (for different BZ) will also fix the above issues.
Yaacov, Please add this BZ to the PR's description (add to the existing bug)
BZ added to: https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159
merged upstream: https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159 moving to post although this does not fix the original problem, only: a. issue a Warning once when no metrics endpoint is available. b. does not try to get metrics if not endpoint found. please re-open if this is not enough, and we need to address the underling problem of actually starting a worker when C&U is on and no metrics endpoint is defined. [ this is a core issue, not a container only problem, and we will need to re-assign this to the core team ]
Note for QE: When testing: Currently we see an Error if C&U is set but no metrics endpoints: """MetricsCapture#perf_collect_metrics) Hawkular metrics service unavailable""" After fix: a. We should not see Error if C&U is on but not metrics endpoint is set. b. We should see Warnings about it in the log.