Created attachment 1757514 [details] WMCO deployment log Description of problem: Observed a panic on WMCO after manipulating the secret and private key. The machine isn't properly reconciled due to panic. 2021-02-16T19:02:58.837Z DEBUG windowsmachine-controller reconciling {"namespace": "openshift-machine-api", "name": "rrasouli-0216-9kwxv-windows-worker-us-east-2a-r2g8v"} E0216 19:02:58.855764 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 1721 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1730340, 0x25949b0) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x1730340, 0x25949b0) /usr/local/go/src/runtime/panic.go:969 +0x175 github.com/openshift/windows-machine-config-operator/pkg/controller/windowsmachine.(*ReconcileWindowsMachine).Reconcile(0xc0006e7680, 0xc0009df460, 0x15, 0xc0000d6140, 0x33, 0x3efaaf528, 0xc0006b6c60, 0xc000284008, 0xc000284000) /build/windows-machine-config-operator/pkg/controller/windowsmachine/windowsmachine_controller.go:310 +0x1b48 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000204900, 0x1798a60, 0xc000b5a040, 0x0) /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000204900, 0x203000) /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000204900) /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000b71580) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000b71580, 0x1b057c0, 0xc00074e330, 0x1, 0xc0000954a0) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b71580, 0x3b9aca00, 0x0, 0xc000398301, 0xc0000954a0) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc000b71580, 0x3b9aca00, 0xc0000954a0) /build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x15553c8] Version-Release number of selected component (if applicable): WMCO 2.0 How reproducible: 100% Steps to Reproduce: 1. delete the secret # oc delete secret cloud-private-key 2. create similar secret with a non valid private key, that does not match the public key on the Win node. @ oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/some-key.pem 3. watch the deployment logs: # oc logs deployments/windows-machine-config-operator -f Actual results: Panic! The machine reconcile is stopped due to the panic on WMCO Expected results: Successful reconciliation node should be marked for deletion. No panic on WMCO pod Additional info:
This bug has been verified and passed on 4.7.0-0.nightly-2021-02-18-110409, thanks. Version: OCP version: 4.7.0-0.nightly-2021-02-18-110409 WMCO commit: f1f40153af071e9778d3ada5ec6dc93e9adfaa9d Steps: 1, WMCO installed and Windows nodes has been bootstrapped # oc get nodes -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION windows-k9t5g Ready worker 98m v1.20.0-1030+cac2421340a449 2, Delete secret cloud-private-key and create it again with an invalid key # oc delete secret cloud-private-key -n openshift-windows-machine-config-operator secret "cloud-private-key" deleted # echo "invalid" > /root/.ssh/invalid.pem # oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/invalid.pem secret/cloud-private-key created 3, Check WMCO log, no panic now and get seasonable error message 2021-02-19T13:48:49.358Z ERROR controller-runtime.controller Reconciler error {"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key", "error": "error generating windows-user-data secret: unable to parse private key: ssh: no key found" .... 4, Delete above invalid secret and create with another valid key # oc delete secret cloud-private-key secret "cloud-private-key" deleted # oc create secret generic cloud-private-key --from-file=private-key.pem=/root/.ssh/openshift-dev.pem secret/cloud-private-key created 5, Check WMCO log, Windows node provisioned with old key is marked for deletion and new Windows node is reconciled. # oc get nodes -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION windows-k9t5g Ready,SchedulingDisabled worker 111m v1.20.0-1030+cac2421340a449 # oc get nodes -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION windows-tzgf5 Ready worker 2m43s v1.20.0-1030+cac2421340a449 # oc logs -f deployment.apps/windows-machine-config-operator ... 2021-02-19T13:54:16.752Z DEBUG windowsmachine-controller reconciling {"namespace": "openshift-machine-api", "name": "windows-k9t5g"} 2021-02-19T13:54:16.765Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key"} 2021-02-19T13:54:16.772Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "secret_controller", "request": "openshift-windows-machine-config-operator/cloud-private-key"} 2021-02-19T13:54:16.773Z INFO windowsmachine-controller deleting machine {"name": "windows-k9t5g"} 2021-02-19T13:54:16.803Z INFO windowsmachine-controller machine has been remediated by deletion {"name": "windows-k9t5g"} 2021-02-19T13:54:16.803Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"} 2021-02-19T13:54:16.803Z DEBUG windowsmachine-controller reconciling {"namespace": "openshift-machine-api", "name": "windows-k9t5g"} 2021-02-19T13:54:16.803Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Machine","namespace":"openshift-machine-api","name":"windows-k9t5g","uid":"63a6d9aa-715e-4fd2-9baf-2c9a0cb2beb9","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"185934"}, "reason": "MachineDeleted", "message": "Machine windows-k9t5g has been remediated by deleting the Machine object"} 2021-02-19T13:54:16.816Z INFO windowsmachine-controller deleting machine {"name": "windows-k9t5g"} 2021-02-19T13:54:16.828Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"} 2021-02-19T13:54:16.828Z DEBUG windowsmachine-controller reconciling {"namespace": "openshift-machine-api", "name": "windows-k9t5g"} 2021-02-19T13:54:16.837Z DEBUG windowsmachine-controller machine not provisioned {"phase": "Deleting"} 2021-02-19T13:54:16.855Z INFO metrics Prometheus configured {"endpoints": "windows-machine-config-operator-metrics", "port": 9182, "name": "metrics"} 2021-02-19T13:54:16.856Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "windowsmachine-controller", "request": "openshift-machine-api/windows-k9t5g"} 2021-02-19T13:54:16.856Z DEBUG windowsmachine-controller reconciling {"namespace": "openshift-machine-api", "name": "windows-tzgf5"} ...