Bug 1669410

Summary:	Memory usage is double counted for `oc adm top pod` command
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Frederic Branczyk <fbranczy>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	adeshpan, anpicker, aos-bugs, erooth, fbranczy, jokerman, mloibl, mmccomas, sponnaga, ssadhale, surbania
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:42:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664187

Description Junqi Zhao 2019-01-25 07:36:54 UTC

Description of problem:
Memory usage is double counted for `oc adm top pod` command

$ oc -n openshift-kube-apiserver adm top pod openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal
NAME                                                               CPU(cores)   MEMORY(bytes)   
openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal   213m         912Mi         

search in prometheus UI
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"}

Result is 478658560 byte, that is 478658560/1024/1024 = 456.484375Mi
Element	                                                                                    Value
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"}	478658560


912Mi from `oc adm top pod` command, which is the double value from prometheus


Version-Release number of selected component (if applicable):
$ oc version
oc v4.0.0-0.125.0

payload: registry.svc.ci.openshift.org/ocp/release@sha256:9185e93b4cf65abe8712b2e489226406c3ea9406da8051c8ae201a9159fa3db8


How reproducible:
Always

Steps to Reproduce:
1. Check `oc adm top pod` for Memory usage result
2. Check from prometheus UI for Memory usage result
3. Compare the two results

Actual results:
Memory usage is double counted for `oc adm top pod` command

Expected results:
Both value should not have large gap

Additional info:
Similiar issue: https://github.com/openshift/cluster-monitoring-operator/pull/153/files

Comment 1 Seth Jennings 2019-04-01 18:45:42 UTC

I am getting a discrepancy but in the other direction.

$ oc adm top pod
NAME                                                     CPU(cores)   MEMORY(bytes)   
etcd-member-ip-10-0-130-219.us-west-1.compute.internal   73m          175Mi           
etcd-member-ip-10-0-137-248.us-west-1.compute.internal   51m          222Mi <----          
etcd-member-ip-10-0-152-13.us-west-1.compute.internal    36m          173Mi

Straight out of prometheus:
pod_name:container_memory_usage_bytes:sum{namespace="kube-system",pod_name="etcd-member-ip-10-0-137-248.us-west-1.compute.internal"}	335613952

Sending to Monitoring to take a look at the prometheus adapter that serves up the resource API and figure out why there is such a large delta.

Comment 2 Andrew Pickering 2019-04-02 05:39:10 UTC

Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669718

Comment 3 Frederic Branczyk 2019-04-02 13:12:43 UTC

First PR to fix this is out: https://github.com/coreos/prometheus-operator/pull/2528

Comment 4 Andrew Pickering 2019-04-03 00:05:44 UTC

Fix PR has merged.

Comment 5 Frederic Branczyk 2019-04-03 09:19:32 UTC

Actually that was "just" the upstream change. The downstream change necessary is captured in: https://github.com/openshift/cluster-monitoring-operator/pull/303

Comment 7 Frederic Branczyk 2019-04-03 13:51:40 UTC

The patch that enables this in our downstream landed now as well, so this can indeed be QE'd.

Comment 8 Junqi Zhao 2019-04-04 07:37:52 UTC

# oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal
NAME                                                       CPU(cores)   MEMORY(bytes)   
kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal   888m         839Mi           


From prometheus UI, search
pod_name:container_memory_usage_bytes:sum{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'} 

result
Element	                                                                                                                                                Value
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal"}	970461184

970461184 / 1024 /1024 = 925.50390625Mi

Issue is fixed, the difference between `oc amd top pod` and prometheus result is acceptable.
payload: 4.0.0-0.nightly-2019-04-04-030930

@Frederic
WDYT?

Comment 9 Frederic Branczyk 2019-04-04 07:50:17 UTC

Could you double check that against the `container_memory_working_set_bytes` metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's what's really used by the adapter.

Comment 10 Junqi Zhao 2019-04-04 08:31:59 UTC

(In reply to Frederic Branczyk from comment #9)
> Could you double check that against the `container_memory_working_set_bytes`
> metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's
> what's really used by the adapter.

The results are almost the same

# oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal
NAME                                                       CPU(cores)   MEMORY(bytes)   
kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal   309m         871Mi    

sum(container_memory_working_set_bytes{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'}) / 1024 /1024 = 	871.265625Mi

Comment 11 Frederic Branczyk 2019-04-04 08:33:25 UTC

Wonderful, looks solved to me :)

Comment 12 Junqi Zhao 2019-04-04 09:12:49 UTC

*** Bug 1669718 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2019-06-04 10:42:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Comment 18 Red Hat Bugzilla 2023-09-14 04:45:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days