Description of problem: Memory usage is double counted for `oc adm top pod` command $ oc -n openshift-kube-apiserver adm top pod openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal NAME CPU(cores) MEMORY(bytes) openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal 213m 912Mi search in prometheus UI pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"} Result is 478658560 byte, that is 478658560/1024/1024 = 456.484375Mi Element Value pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="openshift-kube-apiserver-ip-10-0-43-9.us-east-2.compute.internal"} 478658560 912Mi from `oc adm top pod` command, which is the double value from prometheus Version-Release number of selected component (if applicable): $ oc version oc v4.0.0-0.125.0 payload: registry.svc.ci.openshift.org/ocp/release@sha256:9185e93b4cf65abe8712b2e489226406c3ea9406da8051c8ae201a9159fa3db8 How reproducible: Always Steps to Reproduce: 1. Check `oc adm top pod` for Memory usage result 2. Check from prometheus UI for Memory usage result 3. Compare the two results Actual results: Memory usage is double counted for `oc adm top pod` command Expected results: Both value should not have large gap Additional info: Similiar issue: https://github.com/openshift/cluster-monitoring-operator/pull/153/files
I am getting a discrepancy but in the other direction. $ oc adm top pod NAME CPU(cores) MEMORY(bytes) etcd-member-ip-10-0-130-219.us-west-1.compute.internal 73m 175Mi etcd-member-ip-10-0-137-248.us-west-1.compute.internal 51m 222Mi <---- etcd-member-ip-10-0-152-13.us-west-1.compute.internal 36m 173Mi Straight out of prometheus: pod_name:container_memory_usage_bytes:sum{namespace="kube-system",pod_name="etcd-member-ip-10-0-137-248.us-west-1.compute.internal"} 335613952 Sending to Monitoring to take a look at the prometheus adapter that serves up the resource API and figure out why there is such a large delta.
Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669718
First PR to fix this is out: https://github.com/coreos/prometheus-operator/pull/2528
Fix PR has merged.
Actually that was "just" the upstream change. The downstream change necessary is captured in: https://github.com/openshift/cluster-monitoring-operator/pull/303
The patch that enables this in our downstream landed now as well, so this can indeed be QE'd.
# oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal NAME CPU(cores) MEMORY(bytes) kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal 888m 839Mi From prometheus UI, search pod_name:container_memory_usage_bytes:sum{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'} result Element Value pod_name:container_memory_usage_bytes:sum{namespace="openshift-kube-apiserver",pod_name="kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal"} 970461184 970461184 / 1024 /1024 = 925.50390625Mi Issue is fixed, the difference between `oc amd top pod` and prometheus result is acceptable. payload: 4.0.0-0.nightly-2019-04-04-030930 @Frederic WDYT?
Could you double check that against the `container_memory_working_set_bytes` metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's what's really used by the adapter.
(In reply to Frederic Branczyk from comment #9) > Could you double check that against the `container_memory_working_set_bytes` > metric instead of `pod_name:container_memory_usage_bytes:sum`, as that's > what's really used by the adapter. The results are almost the same # oc -n openshift-kube-apiserver adm top pod kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal NAME CPU(cores) MEMORY(bytes) kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal 309m 871Mi sum(container_memory_working_set_bytes{pod_name='kube-apiserver-ip-10-0-129-66.sa-east-1.compute.internal',namespace='openshift-kube-apiserver'}) / 1024 /1024 = 871.265625Mi
Wonderful, looks solved to me :)
*** Bug 1669718 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days