Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1647220

Summary: Openshift is leaking memory
Product: OpenShift Container Platform Reporter: Nicholas Schuetz <nick>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, chernand, fbranczy, jokerman, minden, mmccomas, nick, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-12 13:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
memory consumed
none
baseline
none
short time later
none
hosts file
none
kube-system1
none
kube-system2
none
cluster memory plot
none
This is the query the console runs
none
Different result if I edit to query to include cached mem
none
after install
none
14 hrs later
none
dash 14 hrs later
none
after reboot none

Description Nicholas Schuetz 2018-11-06 21:37:33 UTC
Created attachment 1502715 [details]
memory consumed

I have observed that v3.11.16 leaks memory.  I installed a cluster (HA on bare-metal) and then went on the road for 3 days.  When i returned, almost all of the 64GB on all three nodes was consumed.  I use the 'all-in-one' profile across three bare-metal nodes.  I hadn't even logged into this cluster yet.  The OS was freshly provisioned with a satellite 6.4 server via iPXE/kickstart.  The systems do not leak memory prior to Openshift being install.  Screenshots and ansible hosts file attached.

-Nick

Comment 1 Nicholas Schuetz 2018-11-06 21:38:57 UTC
Created attachment 1502716 [details]
baseline

Comment 2 Nicholas Schuetz 2018-11-06 21:39:34 UTC
Created attachment 1502717 [details]
short time later

Comment 3 Nicholas Schuetz 2018-11-06 21:43:48 UTC
Created attachment 1502719 [details]
hosts file

Comment 4 Nicholas Schuetz 2018-11-06 22:01:25 UTC
Created attachment 1502721 [details]
kube-system1

Comment 5 Nicholas Schuetz 2018-11-06 22:01:58 UTC
Created attachment 1502722 [details]
kube-system2

Comment 6 Nicholas Schuetz 2018-11-07 04:26:21 UTC
Created attachment 1502819 [details]
cluster memory plot

Comment 7 Nicholas Schuetz 2018-11-07 05:34:32 UTC
I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing to subtract cached memory.  It's healthy for the linux kernel to not release memory but cache it instead (for later use).  That being said, we should probably fix the calculation there to subtract cached mem.

Comment 8 Samuel Padgett 2018-11-07 14:42:43 UTC
(In reply to Nicholas Nachefski from comment #7)
> That being said, we should probably fix the calculation there to subtract cached mem.

We are subtracting cached mem:

((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100

https://github.com/openshift/console/blob/master/frontend/public/components/cluster-overview.jsx#L114

fbranczy - Does this query look correct to you? Anything we should change?

Comment 9 Samuel Padgett 2018-11-09 13:55:27 UTC
(In reply to Nicholas Nachefski from comment #7)
> I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing
> to subtract cached memory.  It's healthy for the linux kernel to not release
> memory but cache it instead (for later use).  That being said, we should
> probably fix the calculation there to subtract cached mem.

Is this just a guess or were you able to confirm that the console isn't subtracting cached mem? It's part of the query (see comment #8). When I take out the `- sum(node_memory_Cached)` from the Prometheus UI, I get a much different result, so it appears to be working. Best I can tell the console is not included cached mem in the gauge.

Comment 10 Samuel Padgett 2018-11-09 13:56:04 UTC
Created attachment 1503661 [details]
This is the query the console runs

Comment 11 Samuel Padgett 2018-11-09 13:57:09 UTC
Created attachment 1503662 [details]
Different result if I edit to query to include cached mem

Comment 12 Nicholas Schuetz 2018-11-11 19:26:03 UTC
Just a guess.  I deployed a new cluster about 14 hrs ago and it's already up to 66% in the Status Dashboard.

Comment 13 Nicholas Schuetz 2018-11-11 19:26:49 UTC
Created attachment 1504442 [details]
after install

Comment 14 Nicholas Schuetz 2018-11-11 19:29:08 UTC
Created attachment 1504443 [details]
14 hrs later

Comment 15 Nicholas Schuetz 2018-11-11 19:29:49 UTC
Created attachment 1504444 [details]
dash 14 hrs later

Comment 16 Nicholas Schuetz 2018-11-18 19:36:37 UTC
Created attachment 1507025 [details]
after reboot

Comment 17 Samuel Padgett 2018-11-19 14:23:52 UTC
Transitioning to the monitoring team for evaluation. If the query needs the be changed, the console team needs guidance on what changes to make.

Comment 18 Frederic Branczyk 2018-11-19 15:34:33 UTC
I have access to some rather large 3.11 installs and cannot see anything similar to what you are observing, with any combination of total-free, total-free-cached, total-free-buffered, total-free-cached-buffered. For none I can see continuous growth like that over the entire retention time. Can you maybe see some container that continues to take up increasing amounts of memory or maybe a go process even, you could check that with these two queries:

any container
```
container_memory_rss
```

go processses
```
go_memstats_heap_inuse_bytes
```

That way we can at least figure out if a process is actually leaking.

Comment 19 Nicholas Schuetz 2018-11-20 17:12:38 UTC
Thanks for the tip, i'll dig in a little bit more.  It's worth noting that i did not have this problem when running on 9 VMs (HA mode).  I only started seeing this problem when i moved to all bare-metal (3 physical nodes, HA mode).  I'm using the 'all-in-one' profile on all three nodes.

Comment 20 Nicholas Schuetz 2018-12-12 13:48:23 UTC
Closing this BZ as a dup of 1650138.  I think the VM vs Bare-metal comparison was just a timing issue (i re-installed my cluster with this version at the same time i switched from KVMs to BM).

-Nick

*** This bug has been marked as a duplicate of bug 1650138 ***