https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1372002364097040384 Looking at the machine-controller logs, they're quite brief. It's clear the machine-controller restarted at least twice by looking at the current and previous logs of the machine-controller. This is not a disruptive test, so there shouldn't be any restarts of our component. Looking at the pod details, indeed, we've restarted 5 times, at least once was due to OOM: "containerID": "cri-o://ab294ff8fff75e9114af6c07079dee1de688ae4c8e7bf2536183b08ebf405f46", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:089124a4c3e71c8e32516195ac5e50a0906affc483d547f7aa3a81571bb5b784", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:089124a4c3e71c8e32516195ac5e50a0906affc483d547f7aa3a81571bb5b784", "lastState": { "terminated": { "containerID": "cri-o://49e7e128fbd2aa610d425dd9cb0f9de19979c58cd8edd5e63ce472da39b9bd13", "exitCode": 137, "finishedAt": "2021-03-17T03:13:29Z", "reason": "OOMKilled", "startedAt": "2021-03-17T02:59:12Z" } }, "name": "machine-controller", "ready": true, "restartCount": 5, "started": true, "state": { "running": { "startedAt": "2021-03-17T03:13:30Z" } } }, Other containers error'ed out at ~2:18 (MHC, MachineSet, NodeRef). Everything lost leader election around the same time: I0317 02:17:52.625155 1 leaderelection.go:278] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition 2021/03/17 02:17:52 leader election lost
This is why we shouldn't be adding limits to pods! This has already been fixed *** This bug has been marked as a duplicate of bug 1938493 ***
I'm reopening this bug. I want to see if what the memory of this controller is doing before we close it. It took over 30 minutes to go OOM, I want to ensure we're not leaking memory.
I'm not going to get to this any time soon.