Bug 1429617 - Kubernetes ContainerNode#perf_capture errors after 600 seconds
Summary: Kubernetes ContainerNode#perf_capture errors after 600 seconds
Keywords:
Status: CLOSED DUPLICATE of bug 1429593
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: C&U Capacity and Utilization
Version: 5.7.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: GA
: cfme-future
Assignee: Federico Simoncelli
QA Contact: Einat Pacifici
URL:
Whiteboard: container:c&u
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-06 17:08 UTC by Thomas Hennessy
Modified: 2020-06-11 13:23 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-10 09:23:23 UTC
Category: ---
Cloudforms Team: Container Management
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Thomas Hennessy 2017-03-06 17:08:33 UTC
Description of problem: target nodes which have not had metrics collected for weeks/months result in errors after 600 seconds (message timeout value).

case is very similar to bz 1429593 but the collection target type is different.


Version-Release number of selected component (if applicable): Version: 5.7.1.3


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: error (surprisingly not a timeout) after 600 seconds.


Expected results: successful collection.



Additional info:

Sample begin and termination sequcence for message [] follows:
=====
[----] I, [2017-03-01T06:51:48.096230 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::MetricsCollectorWorker::Runner#get_message_via_drb) Message id: [10000187834267], MiqWorker id: [10000000951795], Zone: [CTC Core], Role: [ems_metrics_collector], Server: [], Ident: [openshift_enterprise], Target id: [], Instance id: [10000000000058], Task id: [], Command: [ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode.perf_capture_realtime], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [8781.438474909] seconds
[----] I, [2017-03-01T06:51:48.098317 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] Capture for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058]...
[----] I, [2017-03-01T06:51:48.131013 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Collecting metrics for ContainerNode(10000000000058) [realtime] [2016-11-17 10:06:40 UTC] []
[----] I, [2017-03-01T06:57:20.946496 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] Capture for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058]...Complete - Timings: {:server_dequeue=>0.005004405975341797, :capture_state=>7.224035978317261, :collect_data=>1237.5566260814667, :total_time=>1710.1304786205292, :process_counter_values=>1.59721040725708, :db_find_prev_perfs=>0.8102364540100098, :process_perfs=>52.30407118797302, :process_perfs_db=>396.53866600990295}
[----] W, [2017-03-01T06:57:20.946688 #1880:9b1130]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_capture) [realtime] For ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058], expected to get data as of [2016-11-17T10:07:00Z], but got data as of [2017-02-15 12:56:20 UTC].
[----] I, [2017-03-01T06:57:20.950930 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] Processing for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058], for range [2017-02-15 12:56:20 UTC - 2017-03-01 12:51:20 UTC]...
...
...
[----] W, [2017-03-01T06:57:21.933694 #1880:9b1130]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058] Timestamp: [2017-03-01T12:22:20Z], Column [cpu_usage_rate_average]: 'percent value 150.3533666025 is out of range, resetting to 100.0'
[----] W, [2017-03-01T06:57:21.934361 #1880:9b1130]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [apsrp8673.uhc.com], id: [10000000000058] Timestamp: [2017-03-01T12:36:40Z], Column [cpu_usage_rate_average]: 'percent value 111.7550530925 is out of range, resetting to 100.0'
[----] I, [2017-03-01T06:57:21.957127 #1880:9b1130]  INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode#perf_process) [realtime] Processing 41273 performance rows...
[----] E, [2017-03-01T07:01:48.098588 #1880:9b1130] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000187834267], timed out after 600.00220107 seconds.  Timeout threshold [600]
=====

Comment 2 Thomas Hennessy 2017-03-06 17:09:56 UTC
Created attachment 1260490 [details]
one of several sample processes containing reported problem

Comment 8 Federico Simoncelli 2017-03-09 17:03:55 UTC
As reported in the logs the interval of collection became too large to be processed within 600 seconds: [2017-02-15 12:56:20 UTC - 2017-03-01 12:51:20 UTC].

Comment 9 Federico Simoncelli 2017-03-10 09:23:23 UTC

*** This bug has been marked as a duplicate of bug 1429593 ***


Note You need to log in before you can comment on or make changes to this bug.