Created attachment 1502715 [details] memory consumed I have observed that v3.11.16 leaks memory. I installed a cluster (HA on bare-metal) and then went on the road for 3 days. When i returned, almost all of the 64GB on all three nodes was consumed. I use the 'all-in-one' profile across three bare-metal nodes. I hadn't even logged into this cluster yet. The OS was freshly provisioned with a satellite 6.4 server via iPXE/kickstart. The systems do not leak memory prior to Openshift being install. Screenshots and ansible hosts file attached. -Nick
Created attachment 1502716 [details] baseline
Created attachment 1502717 [details] short time later
Created attachment 1502719 [details] hosts file
Created attachment 1502721 [details] kube-system1
Created attachment 1502722 [details] kube-system2
Created attachment 1502819 [details] cluster memory plot
I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing to subtract cached memory. It's healthy for the linux kernel to not release memory but cache it instead (for later use). That being said, we should probably fix the calculation there to subtract cached mem.
(In reply to Nicholas Nachefski from comment #7) > That being said, we should probably fix the calculation there to subtract cached mem. We are subtracting cached mem: ((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100 https://github.com/openshift/console/blob/master/frontend/public/components/cluster-overview.jsx#L114 fbranczy - Does this query look correct to you? Anything we should change?
(In reply to Nicholas Nachefski from comment #7) > I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing > to subtract cached memory. It's healthy for the linux kernel to not release > memory but cache it instead (for later use). That being said, we should > probably fix the calculation there to subtract cached mem. Is this just a guess or were you able to confirm that the console isn't subtracting cached mem? It's part of the query (see comment #8). When I take out the `- sum(node_memory_Cached)` from the Prometheus UI, I get a much different result, so it appears to be working. Best I can tell the console is not included cached mem in the gauge.
Created attachment 1503661 [details] This is the query the console runs
Created attachment 1503662 [details] Different result if I edit to query to include cached mem
Just a guess. I deployed a new cluster about 14 hrs ago and it's already up to 66% in the Status Dashboard.
Created attachment 1504442 [details] after install
Created attachment 1504443 [details] 14 hrs later
Created attachment 1504444 [details] dash 14 hrs later
Created attachment 1507025 [details] after reboot
Transitioning to the monitoring team for evaluation. If the query needs the be changed, the console team needs guidance on what changes to make.
I have access to some rather large 3.11 installs and cannot see anything similar to what you are observing, with any combination of total-free, total-free-cached, total-free-buffered, total-free-cached-buffered. For none I can see continuous growth like that over the entire retention time. Can you maybe see some container that continues to take up increasing amounts of memory or maybe a go process even, you could check that with these two queries: any container ``` container_memory_rss ``` go processses ``` go_memstats_heap_inuse_bytes ``` That way we can at least figure out if a process is actually leaking.
Thanks for the tip, i'll dig in a little bit more. It's worth noting that i did not have this problem when running on 9 VMs (HA mode). I only started seeing this problem when i moved to all bare-metal (3 physical nodes, HA mode). I'm using the 'all-in-one' profile on all three nodes.
Closing this BZ as a dup of 1650138. I think the VM vs Bare-metal comparison was just a timing issue (i re-installed my cluster with this version at the same time i switched from KVMs to BM). -Nick *** This bug has been marked as a duplicate of bug 1650138 ***