Description of problem: target nodes which have not had metrics collected for weeks/months result in errors after 600 seconds (message timeout value). case is very similar to bz 1429593 but the collection target type is different. Version-Release number of selected component (if applicable): Version: 5.7.1.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: error (surprisingly not a timeout) after 600 seconds. Expected results: successful collection. Additional info: Sample begin and termination sequcence for message [] follows: ===== [----] I, [2017-03-01T06:51:48.096230 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::MetricsCollectorWorker::Runner#get_message_via_drb) Message id: [10000187834267], MiqWorker id: [10000000951795], Zone: [CTC Core], Role: [ems_metrics_collector], Server: [], Ident: [openshift_enterprise], Target id: [], Instance id: [10000000000058], Task id: [], Command: [ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode.perf_capture_realtime], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [8781.438474909] seconds [----] I, [2017-03-01T06:51:48.098317 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] Capture for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058]... [----] I, [2017-03-01T06:51:48.131013 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Collecting metrics for ContainerNode(10000000000058) [realtime] [2016-11-17 10:06:40 UTC] [] [----] I, [2017-03-01T06:57:20.946496 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] Capture for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058]...Complete - Timings: {:server_dequeue=>0.005004405975341797, :capture_state=>7.224035978317261, :collect_data=>1237.5566260814667, :total_time=>1710.1304786205292, :process_counter_values=>1.59721040725708, :db_find_prev_perfs=>0.8102364540100098, :process_perfs=>52.30407118797302, :process_perfs_db=>396.53866600990295} [----] W, [2017-03-01T06:57:20.946688 #1880:9b1130] WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] For ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058], expected to get data as of [2016-11-17T10:07:00Z], but got data as of [2017-02-15 12:56:20 UTC]. [----] I, [2017-03-01T06:57:20.950930 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] Processing for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058], for range [2017-02-15 12:56:20 UTC - 2017-03-01 12:51:20 UTC]... ... ... [----] W, [2017-03-01T06:57:21.933694 #1880:9b1130] WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058] Timestamp: [2017-03-01T12:22:20Z], Column [cpu_usage_rate_average]: 'percent value 150.3533666025 is out of range, resetting to 100.0' [----] W, [2017-03-01T06:57:21.934361 #1880:9b1130] WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058] Timestamp: [2017-03-01T12:36:40Z], Column [cpu_usage_rate_average]: 'percent value 111.7550530925 is out of range, resetting to 100.0' [----] I, [2017-03-01T06:57:21.957127 #1880:9b1130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] Processing 41273 performance rows... [----] E, [2017-03-01T07:01:48.098588 #1880:9b1130] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000187834267], timed out after 600.00220107 seconds. Timeout threshold [600] =====
Created attachment 1260490 [details] one of several sample processes containing reported problem
As reported in the logs the interval of collection became too large to be processed within 600 seconds: [2017-02-15 12:56:20 UTC - 2017-03-01 12:51:20 UTC].
*** This bug has been marked as a duplicate of bug 1429593 ***