Bug 1650351

Summary: Memory utilization by node is incorrect in Provider Overview page
Product: Red Hat CloudForms Management Engine Reporter: David Luong <dluong>
Component: ProvidersAssignee: Oved Ourfali <oourfali>
Status: CLOSED CURRENTRELEASE QA Contact: juwatts
Severity: urgent Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: urgent    
Version: 5.8.3CC: agrare, dluong, dmetzger, jfrey, jhardy, juwatts, obarenbo, oourfali, simaishi, yzamir
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.25 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1654463 1663520 (view as bug list) Environment:
Last Closed: 2019-02-12 16:52:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Container Management Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1654463, 1663520    

Description David Luong 2018-11-15 22:24:30 UTC
Description of problem:
Memory utilization in CloudForms not matching up with CloudForms.

Version-Release number of selected component (if applicable):
5.8.3
(Also reproduced on 5.10.0.22)

How reproducible:
Always

Steps to Reproduce:
1.  Add Openshift provider with hawkular
2.  Wait for metrics to populate
3.  However over memory block under node utilization

Actual results:
Shows higher than expected memory utilization

Expected results:
Accurate memory utilization

Additional info:

Comment 17 Yaacov Zamir 2018-11-18 14:55:15 UTC
submited upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/305

Comment 19 Yaacov Zamir 2018-11-20 12:01:26 UTC
merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/305

Note:
The values in ManageIQ are day/month averages and should not be compared as is with cli `top` command or `oc adm top` command values.

Comment 20 CFME Bot 2018-11-20 14:36:51 UTC
New commit detected on ManageIQ/manageiq-providers-kubernetes/hammer:

https://github.com/ManageIQ/manageiq-providers-kubernetes/commit/746ae7d41cb2f634bd09b548373eea965b07f7be
commit 746ae7d41cb2f634bd09b548373eea965b07f7be
Author:     Adam Grare <agrare>
AuthorDate: Tue Nov 20 06:59:20 2018 -0500
Commit:     Adam Grare <agrare>
CommitDate: Tue Nov 20 06:59:20 2018 -0500

    Merge pull request #305 from yaacov/optional-working-set-as-memory-tag

    Use Hawkular memory tag working-set instead of usage

    (cherry picked from commit aeda8479fe7d8caa9e17e325a14be4022d77b213)

    https://bugzilla.redhat.com/show_bug.cgi?id=1650351

 app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb | 16 +-
 app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context.rb | 6 +-
 spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_spec.rb | 20 +-
 spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_spec.rb | 16 +-
 spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture_spec.rb | 4 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_container_metrics.yml | 52 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_container_timespan.yml | 50 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_m_endpoint.yml | 36 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_node_metrics.yml | 44 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_node_timespan.yml | 46 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_pod_metrics.yml | 48 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_pod_timespan.yml | 50 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_refresh.yml | 181 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_status.yml | 44 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_container_metrics.yml | 80 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_container_timespan.yml | 84 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_node_metrics.yml | 160 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_node_timespan.yml | 168 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_pod_metrics.yml | 160 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_pod_timespan.yml | 168 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_refresh.yml | 146 +-
 spec/vcr_cassettes/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context_status.yml | 44 +-
 22 files changed, 902 insertions(+), 721 deletions(-)

Comment 39 juwatts 2018-12-06 14:57:14 UTC
Verified in: 5.10.0.27.20181128170555_43ed8cb

Note: There is no clear way to verify this bug. Steps below are what I believe is sufficient verification 

The provider added for this test has been relatively unused for over two weeks. Adding a unused provider for this case because the CloudForms Dashboard will show average values over days/months. An unused provider will have a low memory utilization and make verification more accurate. 

Verification steps:
1) Set "hawkular_force_legacy" to "false" under the advanced configuration. This is a new key within the advanced configuration, see documentation for more information.  
2) Added a OCP provider and enabled metrics collection permissions. 
3) Compared the real time metrics on the OpenShift nodes ("oc adm top nodes") with the averages displayed on the CFME dashboard and verified they were within 5%, which I believe is a reasonable margin.