Description of problem: we are seeing extremely high ram usage of machine-api-termination-handler pods (>10GB). This lead to System OOM on the nodes Version-Release number of selected component (if applicable): 4.7 Actual results: node1 --- Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: E0302 08:47:29.274245 1228 oomparser.go:149] exiting analyzeLines. OOM events will not be reported. Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: E0302 08:47:28.939304 1228 oomparser.go:149] exiting analyzeLines. OOM events will not be reported. Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: I0302 08:47:29.515650 1228 manager.go:1215] Created an OOM event in container "/" at 2021-03-02 08:39:51.798032232 +0000 UTC m=+949569.837181581 Mar 02 08:47:31 ip-10-0-160-241 hyperkube[1228]: I0302 08:47:30.114399 1228 event.go:291] "Event occurred" object="ip-10-0-160-241.eu-central-1.compute.internal" kind="Node" apiVersion="" type="Warning" reason="SystemOOM" message="System OOM encountered" node2--- Mar 02 08:36:35 ip-10-0-159-11 hyperkube[1227]: I0302 08:36:34.703449 1227 manager.go:1215] Created an OOM event in container "/" at 2021-03-02 08:39:10.22251715 +0000 UTC m=+1722875.568383336 Mar 02 08:36:35 ip-10-0-159-11 hyperkube[1227]: I0302 08:36:35.455905 1227 event.go:291] "Event occurred" object="ip-10-0-159-11.eu-central-1.compute.internal" kind="Node" apiVersion="" type="Warning" reason="SystemOOM" message="System OOM encountered, victim process: opm, pid: 1083350"
pod machine-api-termination-handler seems to loop through termination and logs something with "not marked for termination" ...
*** Bug 1934017 has been marked as a duplicate of this bug. ***
*** Bug 1934018 has been marked as a duplicate of this bug. ***
*** Bug 1934019 has been marked as a duplicate of this bug. ***
machine-api-termination-handler is under machine-api/MachineHealthCheck I believe. Reassigning.
Can we get any must gather data or anything for a cluster that is in this state? Any idea how we might be able to reproduce this? Are you able to reproduce this on 4.6? Which platform are you seeing this on?
I attach a must-gather shortly generated after node killed with system oom, that was on 4.6. Also seeing on 4.7 lab system. However currently do not have any cluster with it, as I am deleted pods if they exceed RAM so cluster gets lower usage again. https://drive.google.com/file/d/1SOulE0bTQsqZjI-uBlIblxn5-hQMVPeO/view?usp=sharing
I think the issue here is that we aren't closing the response bodies when we are finished with the request. I've built a debug build with pprof enabled so will do a quick test to check the heap profile to make sure this is the case
To confirm my suspicion, I've grabbed the goroutine profile from the process running, looks like we are leaking goroutines in the net.http read and write loops, this is the result of not closing the body of the request. Will get a fix up shortly and verify the routines are being closed properly From go tool pprof: Showing nodes accounting for 674, 99.56% of 677 total Dropped 61 nodes (cum <= 3) flat flat% sum% cum cum% 674 99.56% 99.56% 674 99.56% runtime.gopark 0 0% 99.56% 330 48.74% net/http.(*persistConn).readLoop 0 0% 99.56% 330 48.74% net/http.(*persistConn).writeLoop 0 0% 99.56% 4 0.59% runtime.chanrecv 0 0% 99.56% 668 98.67% runtime.selectgo
is there any way to work around until the fix will be included or disable it, it does affect our cluster all the time? For now, I do terminate the pods by hand...
You may be able to set a memory limit on the daemonset without the operator overwriting it, but I haven't tried this. Will see if I can try this today and report back.
Joel Speed,Can you give me some guidance on how to verify this bug?
I'll explain how I manually verified this myself, it's a bit involved. 1. Added the pprof http listener to the code so that we can inspect memory allocation (https://golang.org/pkg/net/http/pprof/) 2. Built two images, quay.io/jspeed/cluster-api-provider-aws:pprof and quay.io/jspeed/cluster-api-provider-aws:http-close which both include the above, the first does not include the fix, the second does 3. Spin up a cluster, replace the termination handler daemonset image with one of the above 4. Use kubectl port-forward to port forward to port 8080 within the container image (kubectl port-forward pod <pod-name> 8080:8080 5. Use curl to get the goroutine dump information from the pod (curl http://localhost:8080/debug/pprof/goroutine > goroutines.out) 6. Use go tooling to inspect the number of goroutines total (go tool pprof -top goroutine-with-close-1.out | grep total) 7. Fetch the goroutine dump after another 30 seconds or so and inspect the total number again Before the fix, I was seeing that the number of goroutines increased by 2 for every 5 seconds the pod was running. After the fix it was stable at around 18. Eg Before fix and after running for a little while: Showing nodes accounting for 641, 99.69% of 643 total After fix and running for similar amount of time: Showing nodes accounting for 18, 100% of 18 total --- In terms of an easier way to verify this, if you can leave the termination pod running for an extended period (I don't know how long this will take), you may be able to see `kubectl top` show a reported memory that has increased (it should be stable). So start one up, kubectl top, not memory usage (<20mb I think), leave for x hours, if it is now increased, that would show it if broken, if it's stable, it remains fixed
Created attachment 1762895 [details] verified on 4.8.0-0.nightly-2021-03-10-142839
Created attachment 1762896 [details] reproduced on 4.7.0
Verified on 4.8.0-0.nightly-2021-03-10-142839, it is stable. $ kubectl top pod machine-api-termination-handler-2pd9z NAME CPU(cores) MEMORY(bytes) machine-api-termination-handler-2pd9z 0m 17Mi reproduced on 4.7.0, memory increased from 17M to 67M in 3 hours. $ kubectl top pod machine-api-termination-handler-cwmhf NAME CPU(cores) MEMORY(bytes) machine-api-termination-handler-cwmhf 0m 67Mi
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438