Bug 1669410 - Memory usage is double counted for `oc adm top pod` command [NEEDINFO]
Summary: Memory usage is double counted for `oc adm top pod` command
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1669718 (view as bug list)
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2019-01-25 07:36 UTC by Junqi Zhao
Modified: 2019-06-04 10:42 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:42:15 UTC
Target Upstream Version:
adeshpan: needinfo? (fbranczy)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:42:22 UTC

Description Junqi Zhao 2019-01-25 07:36:54 UTC
Description of problem:
Memory usage is double counted for `oc adm top pod` command

$ oc -n openshift-kube-apiserver adm top pod openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal
NAME                                                               CPU(cores)   MEMORY(bytes)   
openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal   213m         912Mi         

search in prometheus UI
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"}

Result is 478658560 byte, that is 478658560/1024/1024 = 456.484375Mi
Element	                                                                                    Value
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"}	478658560


912Mi from `oc adm top pod` command, which is the double value from prometheus


Version-Release number of selected component (if applicable):
$ oc version
oc v4.0.0-0.125.0

payload: registry.svc.ci.openshift.org/ocp/release@sha256:9185e93b4cf65abe8712b2e489226406c3ea9406da8051c8ae201a9159fa3db8


How reproducible:
Always

Steps to Reproduce:
1. Check `oc adm top pod` for Memory usage result
2. Check from prometheus UI for Memory usage result
3. Compare the two results

Actual results:
Memory usage is double counted for `oc adm top pod` command

Expected results:
Both value should not have large gap

Additional info:
Similiar issue: https://github.com/openshift/cluster-monitoring-operator/pull/153/files

Comment 1 Seth Jennings 2019-04-01 18:45:42 UTC
I am getting a discrepancy but in the other direction.

$ oc adm top pod
NAME                                                     CPU(cores)   MEMORY(bytes)   
etcd-member-ip-10-0-130-219.us-west-1.compute.internal   73m          175Mi           
etcd-member-ip-10-0-137-248.us-west-1.compute.internal   51m          222Mi <----          
etcd-member-ip-10-0-152-13.us-west-1.compute.internal    36m          173Mi

Straight out of prometheus:
pod_name:container_memory_usage_bytes:sum{namespace="kube-system",pod_name="etcd-member-ip-10-0-137-248.us-west-1.compute.internal"}	335613952

Sending to Monitoring to take a look at the prometheus adapter that serves up the resource API and figure out why there is such a large delta.

Comment 2 Andrew Pickering 2019-04-02 05:39:10 UTC
Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669718

Comment 3 Frederic Branczyk 2019-04-02 13:12:43 UTC
First PR to fix this is out: https://github.com/coreos/prometheus-operator/pull/2528

Comment 4 Andrew Pickering 2019-04-03 00:05:44 UTC
Fix PR has merged.

Comment 5 Frederic Branczyk 2019-04-03 09:19:32 UTC
Actually that was "just" the upstream change. The downstream change necessary is captured in: https://github.com/openshift/cluster-monitoring-operator/pull/303

Comment 7 Frederic Branczyk 2019-04-03 13:51:40 UTC
The patch that enables this in our downstream landed now as well, so this can indeed be QE'd.

Comment 8 Junqi Zhao 2019-04-04 07:37:52 UTC
# oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal
NAME                                                       CPU(cores)   MEMORY(bytes)   
kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal   888m         839Mi           


From prometheus UI, search
pod_name:container_memory_usage_bytes:sum{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'} 

result
Element	                                                                                                                                                Value
pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal"}	970461184

970461184 / 1024 /1024 = 925.50390625Mi

Issue is fixed, the difference between `oc amd top pod` and prometheus result is acceptable.
payload: 4.0.0-0.nightly-2019-04-04-030930

@Frederic
WDYT?

Comment 9 Frederic Branczyk 2019-04-04 07:50:17 UTC
Could you double check that against the `container_memory_working_set_bytes` metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's what's really used by the adapter.

Comment 10 Junqi Zhao 2019-04-04 08:31:59 UTC
(In reply to Frederic Branczyk from comment #9)
> Could you double check that against the `container_memory_working_set_bytes`
> metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's
> what's really used by the adapter.

The results are almost the same

# oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal
NAME                                                       CPU(cores)   MEMORY(bytes)   
kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal   309m         871Mi    

sum(container_memory_working_set_bytes{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'}) / 1024 /1024 = 	871.265625Mi

Comment 11 Frederic Branczyk 2019-04-04 08:33:25 UTC
Wonderful, looks solved to me :)

Comment 12 Junqi Zhao 2019-04-04 09:12:49 UTC
*** Bug 1669718 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2019-06-04 10:42:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.