Description of problem: During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated". ``` I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111 I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance panic: assignment to entry in nil map goroutine 125 [running]: sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0) /go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0) /go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365 github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...) /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460) /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0) /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040) /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425 ``` What is the business impact? Please also provide timeframe information. We failed to recover from a major outage due to this bug. Where are you experiencing the behavior? What environment? Production and all envs. When does the behavior occur? Frequency? Repeatedly? At certain times? It appeared only once so far, but can appear in larger scaling scenarios. Version-Release number of selected component (if applicable): 4.8.39 Actual results: With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines. Expected results: Make the cluster scaleable again without deleting manually. Additional info:
Probably the issue is here: - https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L416-L426(the fields referenced are on the file below. Probably duplicate the lines or move here). - https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L165
Since issue is in machine-api, moving it to correct team.
I am working on a fix for this.
Tried several times replacing worker node on 4.12.0-0.nightly-2022-07-17-215842, there is no panic. Move this to Verified. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws412-945hh-master-0 Running m6i.xlarge us-east-2 us-east-2a 99m huliu-aws412-945hh-master-1 Running m6i.xlarge us-east-2 us-east-2b 99m huliu-aws412-945hh-master-2 Running m6i.xlarge us-east-2 us-east-2c 99m huliu-aws412-945hh-worker-us-east-2a-4gwxb Running m6i.xlarge us-east-2 us-east-2a 6m9s huliu-aws412-945hh-worker-us-east-2a-cndjb Running m6i.xlarge us-east-2 us-east-2a 6m22s huliu-aws412-945hh-worker-us-east-2b-t2rvp Running m6i.xlarge us-east-2 us-east-2b 5m55s huliu-aws412-945hh-worker-us-east-2c-t98h4 Running m6i.xlarge us-east-2 us-east-2c 5m39s liuhuali@Lius-MacBook-Pro huali-test % oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-77d49d497d-rpwlx 2/2 Running 0 99m cluster-baremetal-operator-8b7bfdf74-2r6g6 2/2 Running 0 99m machine-api-controllers-6f89cc4dcf-vn24l 7/7 Running 0 96m machine-api-operator-675494c444-9l4mn 2/2 Running 0 99m liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-6f89cc4dcf-vn24l -c machine-controller |grep panic liuhuali@Lius-MacBook-Pro huali-test %
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399