Bug 1929579 - [WinC] panic: runtime error: invalid memory address or nil pointer dereference after windowsmachine-controller reconciling
Summary: [WinC] panic: runtime error: invalid memory address or nil pointer dereferenc...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Sebastian Soto
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks: 1929794
TreeView+ depends on / blocked
 
Reported: 2021-02-17 08:27 UTC by Ronnie Rasouli
Modified: 2021-11-03 12:42 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: User gives an invalid private key Consequence: Operator panics Fix: Code change Result: Operator throws an error instead of panicking
Clone Of:
: 1929794 (view as bug list)
Environment:
Last Closed: 2021-11-03 12:42:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
WMCO deployment log (9.77 KB, text/plain)
2021-02-17 08:27 UTC, Ronnie Rasouli
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 312 0 None open Bug 1929579: [wmco] Fix issue with dereferencing nil signer 2021-02-17 16:37:09 UTC

Description Ronnie Rasouli 2021-02-17 08:27:29 UTC
Created attachment 1757514 [details]
WMCO deployment log

Description of problem:
Observed a panic on WMCO after manipulating the secret and private key. 

The machine isn't properly reconciled due to panic.

2021-02-16T19:02:58.837Z	DEBUG	windowsmachine-controller	reconciling	{"namespace": "openshift-machine-api", "name": "rrasouli-0216-9kwxv-windows-worker-us-east-2a-r2g8v"}
E0216 19:02:58.855764       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1721 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1730340, 0x25949b0)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1730340, 0x25949b0)
	/usr/local/go/src/runtime/panic.go:969 +0x175
github.com/openshift/windows-machine-config-operator/pkg/controller/windowsmachine.(*ReconcileWindowsMachine).Reconcile(0xc0006e7680, 0xc0009df460, 0x15, 0xc0000d6140, 0x33, 0x3efaaf528, 0xc0006b6c60, 0xc000284008, 0xc000284000)
	/build/windows-machine-config-operator/pkg/controller/windowsmachine/windowsmachine_controller.go:310 +0x1b48
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000204900, 0x1798a60, 0xc000b5a040, 0x0)
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000204900, 0x203000)
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000204900)
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000b71580)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000b71580, 0x1b057c0, 0xc00074e330, 0x1, 0xc0000954a0)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b71580, 0x3b9aca00, 0x0, 0xc000398301, 0xc0000954a0)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc000b71580, 0x3b9aca00, 0xc0000954a0)
	/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x15553c8]


Version-Release number of selected component (if applicable):
WMCO 2.0

How reproducible:
100%

Steps to Reproduce:
1. delete the secret 
# oc delete secret cloud-private-key
2. create similar secret with a non valid private key, that does not match the public key on the Win node.
@ oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/some-key.pem
3. watch the deployment logs:
# oc logs deployments/windows-machine-config-operator -f

Actual results:
Panic!
The machine reconcile is stopped due to the panic on WMCO

Expected results:
Successful reconciliation node should be marked for deletion.
No panic on WMCO pod

Additional info:

Comment 1 gaoshang 2021-02-19 14:07:03 UTC
This bug has been verified and passed on 4.7.0-0.nightly-2021-02-18-110409, thanks.

Version:
OCP version: 4.7.0-0.nightly-2021-02-18-110409
WMCO commit: f1f40153af071e9778d3ada5ec6dc93e9adfaa9d

Steps:
1, WMCO installed and Windows nodes has been bootstrapped
# oc get nodes -l kubernetes.io/os=windows
NAME            STATUS   ROLES    AGE   VERSION
windows-k9t5g   Ready    worker   98m   v1.20.0-1030+cac2421340a449

2, Delete secret cloud-private-key and create it again with an invalid key

# oc delete secret cloud-private-key -n openshift-windows-machine-config-operator
secret "cloud-private-key" deleted

# echo "invalid" > /root/.ssh/invalid.pem

# oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/invalid.pem
secret/cloud-private-key created

3, Check WMCO log, no panic now and get seasonable error message

2021-02-19T13:48:49.358Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key", "error": "error generating windows-user-data secret: unable to parse private key: ssh: no key found" ....

4, Delete above invalid secret and create with another valid key

# oc delete secret cloud-private-key
secret "cloud-private-key" deleted

# oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/openshift-dev.pem 
secret/cloud-private-key created

5, Check WMCO log, Windows node provisioned with old key is marked for deletion and new Windows node is reconciled.

# oc get nodes -l kubernetes.io/os=windows
NAME                                   STATUS                     ROLES    AGE    VERSION
windows-k9t5g                          Ready,SchedulingDisabled   worker   111m   v1.20.0-1030+cac2421340a449


# oc get nodes -l kubernetes.io/os=windows
NAME            STATUS   ROLES    AGE     VERSION
windows-tzgf5   Ready    worker   2m43s   v1.20.0-1030+cac2421340a449

# oc logs -f deployment.apps/windows-machine-config-operator
...
2021-02-19T13:54:16.752Z	DEBUG	windowsmachine-controller	reconciling	{"namespace": "openshift-machine-api", "name": "windows-k9t5g"}
2021-02-19T13:54:16.765Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key"}
2021-02-19T13:54:16.772Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key"}
2021-02-19T13:54:16.773Z	INFO	windowsmachine-controller	deleting machine	{"name": "windows-k9t5g"}
2021-02-19T13:54:16.803Z	INFO	windowsmachine-controller	machine has been remediated by deletion	{"name": "windows-k9t5g"}
2021-02-19T13:54:16.803Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"}
2021-02-19T13:54:16.803Z	DEBUG	windowsmachine-controller	reconciling	{"namespace": "openshift-machine-api", "name": "windows-k9t5g"}
2021-02-19T13:54:16.803Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Machine","namespace":"openshift-machine-api","name":"windows-k9t5g","uid":"63a6d9aa-715e-4fd2-9baf-2c9a0cb2beb9","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"185934"}, "reason": "MachineDeleted", "message": "Machine windows-k9t5g has been remediated by deleting the Machine object"}
2021-02-19T13:54:16.816Z	INFO	windowsmachine-controller	deleting machine	{"name": "windows-k9t5g"}
2021-02-19T13:54:16.828Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"}
2021-02-19T13:54:16.828Z	DEBUG	windowsmachine-controller	reconciling	{"namespace": "openshift-machine-api", "name": "windows-k9t5g"}
2021-02-19T13:54:16.837Z	DEBUG	windowsmachine-controller	machine not provisioned	{"phase": "Deleting"}
2021-02-19T13:54:16.855Z	INFO	metrics	Prometheus configured	{"endpoints": "windows-machine-config-operator-metrics", "port": 9182, "name": "metrics"}
2021-02-19T13:54:16.856Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"}
2021-02-19T13:54:16.856Z	DEBUG	windowsmachine-controller	reconciling	{"namespace": "openshift-machine-api", "name": "windows-tzgf5"}
...


Note You need to log in before you can comment on or make changes to this bug.