Bug 1943265

Summary: Negative Memory Utilization for Cluster Compute Resources Dashboard
Product: OpenShift Container Platform Reporter: jhusta <jhusta>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED CURRENTRELEASE QA Contact: Yadan Pei <yapei>
Severity: low Docs Contact:
Priority: low    
Version: 4.8CC: alegrand, anpicker, aos-bugs, erooth, jhadvig, jhusta, jokerman, juzhao, krmoser, lcosic, wolfgang.voesch
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-01 14:22:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1934148    
Attachments:
Description Flags
Screen Shots of Dashboard and usage by nodes none

Description jhusta 2021-03-25 17:00:50 UTC
Description of problem:
When using Dashboard Kubernetes/Compute Resources/Cluster 
Memory Utilization shows negative percentage and incorrect usage


Version-Release number of selected component (if applicable):
Server Version: 4.8.0-0.nightly-s390x-2021-03-22-155743


How reproducible:
This is a large environment consisting of 3 masters and 10 workers


Looking at the inspect values:
1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable_memory_bytes{cluster=""})

sum of node_memory_MemAvailable_bytes = 611205861376
sum of Kub_node_status_allocatable_memory_bytes = 387137921024

I messed around with some other variables to get to a proper value. I have included my screen shots and trials with and without  a mem workload running to see how the current 





Steps to Reproduce:
1. Compute mem utilization on an environment to see if you get the correct value and it matches what is displayed in the Dashboard.
2.
3.

Actual results:


Expected results:
correct utilization value


Additional info:

Comment 1 jhusta 2021-03-25 17:17:26 UTC
Created attachment 1766360 [details]
Screen Shots of Dashboard and usage by nodes

Comment 2 Andrew Pickering 2021-04-02 13:41:43 UTC
I see that this query has now been changed to `1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable{resource="memory",cluster=""})` (changed by https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/534).

Not sure if this change would be expected to resolve this issue. Pawel, could you confirm?

FWIW, I am not seeing negative values with my test cluster.

Comment 3 Pawel Krupa 2021-04-06 07:42:22 UTC
Seems like this can be happening when there is a large chunk of memory reserved for other uses. In such scenario node available memory will be much higher than what is allowed to be allocated by scheduler. This leads to have higher than one right part of the equation (`sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable{resource="memory",cluster=""})`) and causes negative values in overall.

The PR https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/534 won't fix this as we need a different way to track this, preferably one where we don't subtract metric values from 1.

Comment 5 Junqi Zhao 2021-08-05 08:40:50 UTC
checked with 4.9.0-0.nightly-2021-08-04-131508, Dashboard Kubernetes/Compute Resources/Cluster, "Memory Utilisation" expression now is
1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(node_memory_MemTotal_bytes{cluster=""})
this can guarantee no negative value

Comment 10 jhusta 2021-10-01 14:22:23 UTC
@juzhao I was able to validate the fix on 4.9. This defect can be closed. Thank you!