Bug 1647220 - Openshift is leaking memory
Summary: Openshift is leaking memory
Keywords:
Status: CLOSED DUPLICATE of bug 1650138
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-06 21:37 UTC by Nicholas Schuetz
Modified: 2018-12-12 13:48 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-12 13:48:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
memory consumed (227.11 KB, image/png)
2018-11-06 21:37 UTC, Nicholas Schuetz
no flags Details
baseline (182.35 KB, image/png)
2018-11-06 21:38 UTC, Nicholas Schuetz
no flags Details
short time later (151.62 KB, image/png)
2018-11-06 21:39 UTC, Nicholas Schuetz
no flags Details
hosts file (4.51 KB, text/plain)
2018-11-06 21:43 UTC, Nicholas Schuetz
no flags Details
kube-system1 (266.58 KB, image/png)
2018-11-06 22:01 UTC, Nicholas Schuetz
no flags Details
kube-system2 (266.39 KB, image/png)
2018-11-06 22:01 UTC, Nicholas Schuetz
no flags Details
cluster memory plot (131.12 KB, image/png)
2018-11-07 04:26 UTC, Nicholas Schuetz
no flags Details
This is the query the console runs (196.99 KB, image/png)
2018-11-09 13:56 UTC, Samuel Padgett
no flags Details
Different result if I edit to query to include cached mem (194.30 KB, image/png)
2018-11-09 13:57 UTC, Samuel Padgett
no flags Details
after install (249.96 KB, image/png)
2018-11-11 19:26 UTC, Nicholas Schuetz
no flags Details
14 hrs later (248.62 KB, image/png)
2018-11-11 19:29 UTC, Nicholas Schuetz
no flags Details
dash 14 hrs later (222.00 KB, image/png)
2018-11-11 19:29 UTC, Nicholas Schuetz
no flags Details
after reboot (109.32 KB, image/png)
2018-11-18 19:36 UTC, Nicholas Schuetz
no flags Details

Description Nicholas Schuetz 2018-11-06 21:37:33 UTC
Created attachment 1502715 [details]
memory consumed

I have observed that v3.11.16 leaks memory.  I installed a cluster (HA on bare-metal) and then went on the road for 3 days.  When i returned, almost all of the 64GB on all three nodes was consumed.  I use the 'all-in-one' profile across three bare-metal nodes.  I hadn't even logged into this cluster yet.  The OS was freshly provisioned with a satellite 6.4 server via iPXE/kickstart.  The systems do not leak memory prior to Openshift being install.  Screenshots and ansible hosts file attached.

-Nick

Comment 1 Nicholas Schuetz 2018-11-06 21:38:57 UTC
Created attachment 1502716 [details]
baseline

Comment 2 Nicholas Schuetz 2018-11-06 21:39:34 UTC
Created attachment 1502717 [details]
short time later

Comment 3 Nicholas Schuetz 2018-11-06 21:43:48 UTC
Created attachment 1502719 [details]
hosts file

Comment 4 Nicholas Schuetz 2018-11-06 22:01:25 UTC
Created attachment 1502721 [details]
kube-system1

Comment 5 Nicholas Schuetz 2018-11-06 22:01:58 UTC
Created attachment 1502722 [details]
kube-system2

Comment 6 Nicholas Schuetz 2018-11-07 04:26:21 UTC
Created attachment 1502819 [details]
cluster memory plot

Comment 7 Nicholas Schuetz 2018-11-07 05:34:32 UTC
I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing to subtract cached memory.  It's healthy for the linux kernel to not release memory but cache it instead (for later use).  That being said, we should probably fix the calculation there to subtract cached mem.

Comment 8 Samuel Padgett 2018-11-07 14:42:43 UTC
(In reply to Nicholas Nachefski from comment #7)
> That being said, we should probably fix the calculation there to subtract cached mem.

We are subtracting cached mem:

((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100

https://github.com/openshift/console/blob/master/frontend/public/components/cluster-overview.jsx#L114

fbranczy - Does this query look correct to you? Anything we should change?

Comment 9 Samuel Padgett 2018-11-09 13:55:27 UTC
(In reply to Nicholas Nachefski from comment #7)
> I think the "Memory Usage" widget in the "Cluster Dashboard" page is failing
> to subtract cached memory.  It's healthy for the linux kernel to not release
> memory but cache it instead (for later use).  That being said, we should
> probably fix the calculation there to subtract cached mem.

Is this just a guess or were you able to confirm that the console isn't subtracting cached mem? It's part of the query (see comment #8). When I take out the `- sum(node_memory_Cached)` from the Prometheus UI, I get a much different result, so it appears to be working. Best I can tell the console is not included cached mem in the gauge.

Comment 10 Samuel Padgett 2018-11-09 13:56:04 UTC
Created attachment 1503661 [details]
This is the query the console runs

Comment 11 Samuel Padgett 2018-11-09 13:57:09 UTC
Created attachment 1503662 [details]
Different result if I edit to query to include cached mem

Comment 12 Nicholas Schuetz 2018-11-11 19:26:03 UTC
Just a guess.  I deployed a new cluster about 14 hrs ago and it's already up to 66% in the Status Dashboard.

Comment 13 Nicholas Schuetz 2018-11-11 19:26:49 UTC
Created attachment 1504442 [details]
after install

Comment 14 Nicholas Schuetz 2018-11-11 19:29:08 UTC
Created attachment 1504443 [details]
14 hrs later

Comment 15 Nicholas Schuetz 2018-11-11 19:29:49 UTC
Created attachment 1504444 [details]
dash 14 hrs later

Comment 16 Nicholas Schuetz 2018-11-18 19:36:37 UTC
Created attachment 1507025 [details]
after reboot

Comment 17 Samuel Padgett 2018-11-19 14:23:52 UTC
Transitioning to the monitoring team for evaluation. If the query needs the be changed, the console team needs guidance on what changes to make.

Comment 18 Frederic Branczyk 2018-11-19 15:34:33 UTC
I have access to some rather large 3.11 installs and cannot see anything similar to what you are observing, with any combination of total-free, total-free-cached, total-free-buffered, total-free-cached-buffered. For none I can see continuous growth like that over the entire retention time. Can you maybe see some container that continues to take up increasing amounts of memory or maybe a go process even, you could check that with these two queries:

any container
```
container_memory_rss
```

go processses
```
go_memstats_heap_inuse_bytes
```

That way we can at least figure out if a process is actually leaking.

Comment 19 Nicholas Schuetz 2018-11-20 17:12:38 UTC
Thanks for the tip, i'll dig in a little bit more.  It's worth noting that i did not have this problem when running on 9 VMs (HA mode).  I only started seeing this problem when i moved to all bare-metal (3 physical nodes, HA mode).  I'm using the 'all-in-one' profile on all three nodes.

Comment 20 Nicholas Schuetz 2018-12-12 13:48:23 UTC
Closing this BZ as a dup of 1650138.  I think the VM vs Bare-metal comparison was just a timing issue (i re-installed my cluster with this version at the same time i switched from KVMs to BM).

-Nick

*** This bug has been marked as a duplicate of bug 1650138 ***


Note You need to log in before you can comment on or make changes to this bug.