Description of problem: Manually calculated available or free values from free command does not match with prometheus sum(:node_memory_MemFreeCachedBuffers:sum) value. The output of prometheus sum(:node_memory_MemFreeCachedBuffers:sum) result, prometheus returns: 228516712448 And #free -b | grep Mem command across the whole cluster (3 master nodes, 3 infra nodes, 6 application nodes) returns: total used free shared buff/cache available Mem: 33567514624 4655554560 228970496 5259264 28682989568 27253944320 Mem: 33567469568 3680813056 4468191232 6115328 25418465280 29052235776 Mem: 33567469568 7705444352 14346620928 7659520 11515404288 25235783680 Mem: 33567469568 4605882368 15638114304 6225920 13323472896 28376805376 Mem: 33567469568 9781415936 1770549248 550543360 22015504384 22365310976 Mem: 33567469568 16412160000 393760768 11522048 16761548800 16270479360 Mem: 33567469568 5052407808 407941120 5537792 28107120640 26687836160 Mem: 33567514624 3753984000 874369024 6721536 28939161600 28735127552 Mem: 33567514624 4327591936 792666112 7442432 28447256576 28191858688 Mem: 33567514624 4794445824 910139392 5914624 27862929408 27696361472 Mem: 33567514624 6027689984 679137280 3465216 26860687360 25744715776 Mem: 33567506432 3273830400 487788544 4468736 29805887488 29746855936 SUM: 402809896960 74071220224 40998248448 620875776 287740428288 315357315072 Available or free values from free command do not match prometheus sum(:node_memory_MemFreeCachedBuffers:sum) value Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Execute sum(:node_memory_MemFreeCachedBuffers:sum) query on prometheus and record the output 2.calculate free and available (i.e. necessary Buff/Cache) values from each node 3. Compare the output from step 1 and 2 Actual results: a large difference between both values Expected results: Should not have a large difference Additional info:
Continuing to the "Description of problem" section from the previous comment following is the difference between the output: ===> Difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command) is +95 GB ===> ifference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem available value (from free command) is -66GB
(In reply to Shivkumar Ople from comment #1) > Continuing to the "Description of problem" section from the previous comment > following is the difference between the output: > > > ===> Difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem > free value (from free command) is +95 GB > > ===> ifference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem > available value (from free command) is -66GB What is the exact difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command), 95GB or -66GB
The exact difference between sum(:node_memory_MemFreeCachedBuffers:sum) and Mem free value (from free command) is 95 GB
Hi there, I am sorry for net getting back to this any earlier. I am having difficulties to reproduce this. > Available or free values from free command do not match prometheus sum(:node_memory_MemFreeCachedBuffers:sum) value "sum(:node_memory_MemFreeCachedBuffers:sum)" is the sum of /free/, /cached/ and /buffered/ memory from the free command. Thereby it does not match the /free/ value of the free command. You can find the recording rule definition here [1]. The /available/ value of the free command is not just the sum of /free/ and /cached/, but includes additional considerations. See excerpt of `man free`: > available > Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by > the cache or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will > be reclaimed due to items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels > 2.6.27+, otherwise the same as free) In order to be able to better debug this, would you mind posting an updated output of `$ free` here and the output of the following node_exporter metrics: - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_MemTotal_bytes{job="node-exporter"} Thanks for the help. [1] https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L159
PR opened to consistently configure the metrics used in 4.0 in https://github.com/coreos/prometheus-operator/pull/2438. Unfortunately this is impossible to backport to 3.11 as in 4.0 we introduced an entirely new component, that uses metrics collected by Prometheus to serve the Kubernetes resource metrics API, instead of the Kubernetes metrics-server, which only aggregates the cgroup hierachy, which causes this inaccuracy reported here, as there are other processes not in the cgroup hierarchy that use CPU/memory/etc.
https://github.com/openshift/cluster-monitoring-operator/pull/272 merged, which pulled in the changes from https://github.com/coreos/prometheus-operator/pull/2438. The Grafana dashboards and "kubectl top" now use the identical metrics, so there should be no more deviation. Also the metrics used for `kubectl top node` now come from the node-exporter as opposed to only being the sum of resources used by all containers. Moving to modified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758