Bug 1656868
Summary: | Large difference between manually calculated Memory related values and Prometheus query output | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shivkumar Ople <sople> | |
Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> | |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.11.0 | CC: | cvogel, fbranczy, hgomes, sople, ssadhale, surbania, yhe | |
Target Milestone: | --- | |||
Target Release: | 4.1.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1701856 (view as bug list) | Environment: | ||
Last Closed: | 2019-06-04 10:41:14 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1701856 |
Description
Shivkumar Ople
2018-12-06 14:36:29 UTC
Continuing to the "Description of problem" section from the previous comment following is the difference between the output: ===> Difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command) is +95 GB ===> ifference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem available value (from free command) is -66GB (In reply to Shivkumar Ople from comment #1) > Continuing to the "Description of problem" section from the previous comment > following is the difference between the output: > > > ===> Difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem > free value (from free command) is +95 GB > > ===> ifference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem > available value (from free command) is -66GB What is the exact difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command), 95GB or -66GB The exact difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command) is 95 GB Hi there, I am sorry for net getting back to this any earlier. I am having difficulties to reproduce this. > Available or free values from free command do not match prometheus sum(:node_memory_MemFreeCachedBuffers:sum) value "sum(:node_memory_MemFreeCachedBuffers:sum)" is the sum of /free/, /cached/ and /buffered/ memory from the free command. Thereby it does not match the /free/ value of the free command. You can find the recording rule definition here [1]. The /available/ value of the free command is not just the sum of /free/ and /cached/, but includes additional considerations. See excerpt of `man free`: > available > Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by > the cache or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will > be reclaimed due to items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels > 2.6.27+, otherwise the same as free) In order to be able to better debug this, would you mind posting an updated output of `$ free` here and the output of the following node_exporter metrics: - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_MemTotal_bytes{job="node-exporter"} Thanks for the help. [1] https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L159 PR opened to consistently configure the metrics used in 4.0 in https://github.com/coreos/prometheus-operator/pull/2438. Unfortunately this is impossible to backport to 3.11 as in 4.0 we introduced an entirely new component, that uses metrics collected by Prometheus to serve the Kubernetes resource metrics API, instead of the Kubernetes metrics-server, which only aggregates the cgroup hierachy, which causes this inaccuracy reported here, as there are other processes not in the cgroup hierarchy that use CPU/memory/etc. https://github.com/openshift/cluster-monitoring-operator/pull/272 merged, which pulled in the changes from https://github.com/coreos/prometheus-operator/pull/2438. The Grafana dashboards and "kubectl top" now use the identical metrics, so there should be no more deviation. Also the metrics used for `kubectl top node` now come from the node-exporter as opposed to only being the sum of resources used by all containers. Moving to modified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |