2106733 – Machine Controller stuck with Terminated Instances while Provisioning on AWS

Bug 2106733 - Machine Controller stuck with Terminated Instances while Provisioning on AWS

Summary: Machine Controller stuck with Terminated Instances while Provisioning on AWS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.8
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Radek Maňák
QA Contact:	Huali Liu
Docs Contact:	Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:	2108021
TreeView+	depends on / blocked

Reported:	2022-07-13 12:00 UTC by Gabriel Stein
Modified:	2023-01-17 19:53 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, there was no check for nil values in the annotations of a machine object before attempting to access the object. This situation was rare, but caused the machine controller to panic when reconciling the machine. With this release, nil values are checked and the machine controller is able to reconcile machines without annotations. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2106733[BZ#2106733])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:52:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-aws pull 46	0	None	open	Bug 2106733: Fix panic when accessing nil machine annotations map	2022-07-14 09:03:09 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:53:03 UTC

Description Gabriel Stein 2022-07-13 12:00:34 UTC

Description of problem:
During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".



```
I0628 10:09:02.518169       1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
I0628 10:09:03.090641       1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
I0628 10:09:03.090662       1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
I0628 10:09:03.090669       1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
I0628 10:09:03.090682       1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
panic: assignment to entry in nil map

goroutine 125 [running]:
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
	/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```

What is the business impact? Please also provide timeframe information.
We failed to recover from a major outage due to this bug. 

Where are you experiencing the behavior? What environment?
Production and all envs.

When does the behavior occur? Frequency? Repeatedly? At certain times?
It appeared only once so far, but can appear in larger scaling scenarios.

Version-Release number of selected component (if applicable):
4.8.39


Actual results:

With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
Expected results:
Make the cluster scaleable again without deleting manually.

Additional info:

Comment 1 Gabriel Stein 2022-07-13 13:34:08 UTC

Probably the issue is here:

- https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L416-L426(the fields referenced are on the file below. Probably duplicate the lines or move here).

- https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L165

Comment 2 Sinny Kumari 2022-07-13 15:22:47 UTC

Since issue is in machine-api, moving it to correct team.

Comment 3 Radek Maňák 2022-07-14 08:20:00 UTC

I am working on a fix for this.

Comment 5 Huali Liu 2022-07-18 16:14:55 UTC

Tried several times replacing worker node on 4.12.0-0.nightly-2022-07-17-215842, there is no panic. Move this to Verified.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
huliu-aws412-945hh-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   99m
huliu-aws412-945hh-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   99m
huliu-aws412-945hh-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   99m
huliu-aws412-945hh-worker-us-east-2a-4gwxb   Running   m6i.xlarge   us-east-2   us-east-2a   6m9s
huliu-aws412-945hh-worker-us-east-2a-cndjb   Running   m6i.xlarge   us-east-2   us-east-2a   6m22s
huliu-aws412-945hh-worker-us-east-2b-t2rvp   Running   m6i.xlarge   us-east-2   us-east-2b   5m55s
huliu-aws412-945hh-worker-us-east-2c-t98h4   Running   m6i.xlarge   us-east-2   us-east-2c   5m39s
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-77d49d497d-rpwlx   2/2     Running   0          99m
cluster-baremetal-operator-8b7bfdf74-2r6g6     2/2     Running   0          99m
machine-api-controllers-6f89cc4dcf-vn24l       7/7     Running   0          96m
machine-api-operator-675494c444-9l4mn          2/2     Running   0          99m
liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-6f89cc4dcf-vn24l -c machine-controller  |grep panic               
liuhuali@Lius-MacBook-Pro huali-test %

Comment 8 errata-xmlrpc 2023-01-17 19:52:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.