1934021 – High RAM usage on machine api termination node system oom

Bug 1934021 - High RAM usage on machine api termination node system oom

Summary: High RAM usage on machine api termination node system oom

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1934017 1934018 1934019 (view as bug list)
Depends On:
Blocks:	1935636
TreeView+	depends on / blocked

Reported:	2021-03-02 09:58 UTC by Alexander Niebuhr
Modified:	2021-07-27 22:49 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: HTTP responses were not being closed properly when they were the handler had finished with them Consequence: Go routines were leaked in the background Fix: Ensure that the response is always closed properly Result: The memory usage is now stable
Clone Of:
Environment:
Last Closed:	2021-07-27 22:48:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
verified on 4.8.0-0.nightly-2021-03-10-142839 (130.32 KB, image/png) 2021-03-12 06:08 UTC, sunzhaohua	no flags	Details
reproduced on 4.7.0 (81.46 KB, image/png) 2021-03-12 06:09 UTC, sunzhaohua	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 390	None	open	Bug 1934021: Ensure response body is closed when we are finished with the request	2021-03-04 13:43:27 UTC
Github	openshift cluster-api-provider-azure pull 205	None	open	Bug 1934021: Ensure response body is closed when we are finished with the request	2021-03-04 13:44:14 UTC
Github	openshift cluster-api-provider-gcp pull 150	None	open	Bug 1934021: Ensure response body is closed when we are finished with the request	2021-03-04 13:45:15 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:49:03 UTC

Description Alexander Niebuhr 2021-03-02 09:58:02 UTC

Description of problem:
we are seeing extremely high ram usage of machine-api-termination-handler pods (>10GB). This lead to System OOM on the nodes

Version-Release number of selected component (if applicable):
4.7

Actual results:
node1 ---
Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: E0302 08:47:29.274245    1228 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: E0302 08:47:28.939304    1228 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: I0302 08:47:29.515650    1228 manager.go:1215] Created an OOM event in container "/" at 2021-03-02 08:39:51.798032232 +0000 UTC m=+949569.837181581
Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: I0302 08:47:30.114399    1228 event.go:291] "Event occurred" object="ip-10-0-160-241.eu-central-1.compute.internal" kind="Node" apiVersion="" type="Warning" reason="SystemOOM" message="System OOM encountered"

node2---
Mar 02 08:36:35 ip-10-0-159-11 hyperkube[1227]: I0302 08:36:34.703449    1227 manager.go:1215] Created an OOM event in container "/" at 2021-03-02 08:39:10.22251715 +0000 UTC m=+1722875.568383336
Mar 02 08:36:35 ip-10-0-159-11 hyperkube[1227]: I0302 08:36:35.455905    1227 event.go:291] "Event occurred" object="ip-10-0-159-11.eu-central-1.compute.internal" kind="Node" apiVersion="" type="Warning" reason="SystemOOM" message="System OOM encountered, victim process: opm, pid: 1083350"

Comment 1 Alexander Niebuhr 2021-03-02 09:59:19 UTC

pod machine-api-termination-handler seems to loop through termination and logs something with "not marked for termination" ...

Comment 2 Alexander Niebuhr 2021-03-02 10:00:03 UTC

*** Bug 1934017 has been marked as a duplicate of this bug. ***

Comment 3 Alexander Niebuhr 2021-03-02 10:00:23 UTC

*** Bug 1934018 has been marked as a duplicate of this bug. ***

Comment 4 Alexander Niebuhr 2021-03-02 10:00:36 UTC

*** Bug 1934019 has been marked as a duplicate of this bug. ***

Comment 5 Yu Qi Zhang 2021-03-02 20:44:22 UTC

machine-api-termination-handler is under machine-api/MachineHealthCheck I believe. Reassigning.

Comment 6 Joel Speed 2021-03-03 10:53:30 UTC

Can we get any must gather data or anything for a cluster that is in this state?

Any idea how we might be able to reproduce this? Are you able to reproduce this on 4.6?

Which platform are you seeing this on?

Comment 8 Alexander Niebuhr 2021-03-03 18:01:44 UTC

I attach a must-gather shortly generated after node killed with system oom, that was on 4.6. Also seeing on 4.7 lab system. However currently do not have any cluster with it, as I am deleted pods if they exceed RAM so cluster gets lower usage again.

https://drive.google.com/file/d/1SOulE0bTQsqZjI-uBlIblxn5-hQMVPeO/view?usp=sharing

Comment 9 Joel Speed 2021-03-04 12:31:22 UTC

I think the issue here is that we aren't closing the response bodies when we are finished with the request. I've built a debug build with pprof enabled so will do a quick test to check the heap profile to make sure this is the case

Comment 10 Joel Speed 2021-03-04 13:31:45 UTC

To confirm my suspicion, I've grabbed the goroutine profile from the process running, looks like we are leaking goroutines in the net.http read and write loops, this is the result of not closing the body of the request. Will get a fix up shortly and verify the routines are being closed properly

From go tool pprof:

Showing nodes accounting for 674, 99.56% of 677 total
Dropped 61 nodes (cum <= 3)
      flat  flat%   sum%        cum   cum%
       674 99.56% 99.56%        674 99.56%  runtime.gopark
         0     0% 99.56%        330 48.74%  net/http.(*persistConn).readLoop
         0     0% 99.56%        330 48.74%  net/http.(*persistConn).writeLoop
         0     0% 99.56%          4  0.59%  runtime.chanrecv
         0     0% 99.56%        668 98.67%  runtime.selectgo

Comment 12 Alexander Niebuhr 2021-03-09 06:20:39 UTC

is there any way to work around until the fix will be included or disable it, it does affect our cluster all the time? For now, I do terminate the pods by hand...

Comment 13 Joel Speed 2021-03-09 10:56:02 UTC

You may be able to set a memory limit on the daemonset without the operator overwriting it, but I haven't tried this.

Will see if I can try this today and report back.

Comment 14 sunzhaohua 2021-03-10 02:28:06 UTC

Joel Speed，Can you give me some guidance on how to verify this bug？

Comment 15 Joel Speed 2021-03-10 10:17:01 UTC

I'll explain how I manually verified this myself, it's a bit involved.

1. Added the pprof http listener to the code so that we can inspect memory allocation (https://golang.org/pkg/net/http/pprof/)
2. Built two images, quay.io/jspeed/cluster-api-provider-aws:pprof and quay.io/jspeed/cluster-api-provider-aws:http-close which both include the above, the first does not include the fix, the second does
3. Spin up a cluster, replace the termination handler daemonset image with one of the above
4. Use kubectl port-forward to port forward to port 8080 within the container image (kubectl port-forward pod <pod-name> 8080:8080
5. Use curl to get the goroutine dump information from the pod (curl http://localhost:8080/debug/pprof/goroutine > goroutines.out)
6. Use go tooling to inspect the number of goroutines total (go tool pprof -top goroutine-with-close-1.out | grep total)
7. Fetch the goroutine dump after another 30 seconds or so and inspect the total number again

Before the fix, I was seeing that the number of goroutines increased by 2 for every 5 seconds the pod was running.
After the fix it was stable at around 18.

Eg Before fix and after running for a little while:

Showing nodes accounting for 641, 99.69% of 643 total

After fix and running for similar amount of time:

Showing nodes accounting for 18, 100% of 18 total

---

In terms of an easier way to verify this, if you can leave the termination pod running for an extended period (I don't know how long this will take), you may be able to see `kubectl top` show a reported memory that has increased (it should be stable). So start one up, kubectl top, not memory usage (<20mb I think), leave for x hours, if it is now increased, that would show it if broken, if it's stable, it remains fixed

Comment 16 sunzhaohua 2021-03-12 06:08:17 UTC

Created attachment 1762895 [details]
verified on 4.8.0-0.nightly-2021-03-10-142839

Comment 17 sunzhaohua 2021-03-12 06:09:01 UTC

Created attachment 1762896 [details]
reproduced on 4.7.0

Comment 18 sunzhaohua 2021-03-12 06:10:29 UTC

Verified on 4.8.0-0.nightly-2021-03-10-142839, it is stable.
$ kubectl top pod machine-api-termination-handler-2pd9z
NAME                                    CPU(cores)   MEMORY(bytes)   
machine-api-termination-handler-2pd9z   0m           17Mi  

reproduced on 4.7.0, memory increased from 17M to 67M in 3 hours.
$ kubectl top pod machine-api-termination-handler-cwmhf
NAME                                    CPU(cores)   MEMORY(bytes)   
machine-api-termination-handler-cwmhf   0m           67Mi

Comment 21 errata-xmlrpc 2021-07-27 22:48:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.