Description of problem: When a cluster attempts a 4.9 -> 4.10 upgrade and has any failing webhooks, it degrades the kube-apiserver cluster operator which in turn causes the upgrade to stall. Version-Release number of selected component (if applicable): 4.10.0-rc1 How reproducible: Observed on several clusters. Steps to Reproduce: 1. Install a validatingwebhookconfiguration or mutatingwebhookconfiguration that has problems. (ie. communicates to a missing service / does not trust the service) 2. Attempt an upgrade to 4.10.0-rc1 Actual results: After upgrading the kube-apiserver clusteroperator to 4.10.0-rc1, the upgrade will stall out due to issues with the webhook. Expected results: A failing webhook should not result in an upgrade stalling part-way through. Either the upgrade should not commence, or this situation should be handled gracefully that does not require intervention part-way through the upgrade to unblock it.
[1] is the new-in-4.10 controller. The situation is: 1. Running 4.9 with a broken webhook. Whatever the webhook is supposed to do doesn't work, but kube-apiserver is otherwise oblivious to the issue. 2. Cluster updates towards 4.10. 3. As the 4.10 kube-apiserver operator comes in, the new controller [1] takes a look around, sees the broken webhooks, and sets Degraded=True. 4. Update wedges, because the cluster-version operator won't move past the kube-apiserver ClusterOperator manifest while it's Degraded [2]. We plan on softening the CVO behavior so it doen't block on Degraded [3], but we aren't there yet. In the meantime, options include: a. Moving from Degraded conditions to alerts, so we can complain without blocking updates. We did this for the vSphere problem detector in bug 1943719. b. Getting some kind of early warning system into 4.9 so folks hear about these issues and have time to mitigate before updating to 4.10. c. Adding a lump of inertia to the 4.10 Degraded condition, so most folks are likely to complete the update before Degraded goes True and locks up further updating. [1]: https://github.com/openshift/cluster-kube-apiserver-operator/blob/98cea10c60a7e4da61f51d0cf388cfda47af6841/pkg/operator/webhooksupportabilitycontroller/ [2]: https://github.com/openshift/enhancements/blame/27846285be01a2aebf8d3a04ebb8ed7f877e4959/dev-guide/cluster-version-operator/user/reconciliation.md#L160 [3]: https://issues.redhat.com/browse/OTA-540
Pre-merge verified the bug as below, 1. Install a recent 4.9.z. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True False 12m Cluster version is 4.9.21 2. Install a broken webhook. $ cat webhook-deploy.yaml # Let the targetPort inconsistent with containerPort, will cause webhook failed. apiVersion: v1 kind: Namespace metadata: name: validationwebhook --- apiVersion: v1 kind: Service metadata: name: validationwebhook namespace: validationwebhook spec: selector: app: validationwebhook ports: - protocol: TCP port: 443 targetPort: 8444 --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: validationwebhook name: validationwebhook namespace: validationwebhook spec: replicas: 1 selector: matchLabels: app: validationwebhook template: metadata: labels: app: validationwebhook spec: containers: - name: test1 image: quay.io/wangke19/test1:v1 imagePullPolicy: Always ports: - containerPort: 8443 ------- $ cat webhook-registration.yaml apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: validationwebhook.validationwebhook.svc annotations: service.beta.openshift.io/inject-cabundle: "true" webhooks: - name: validationwebhook.validationwebhook.svc failurePolicy: Fail rules: - apiGroups: ["*"] apiVersions: ["v1"] operations: ["UPDATE"] resources: ["nodes"] clientConfig: service: namespace: validationwebhook name: validationwebhook path: "/" admissionReviewVersions: ["v1"] sideEffects: None $ oc apply -f webhook-deploy.yaml namespace/validationwebhook created service/validationwebhook created deployment.apps/validationwebhook created $ oc apply -f webhook-registration.yaml validatingwebhookconfiguration.admissionregistration.k8s.io/validationwebhook.validationwebhook.svc created webhook ran into error, $ oc get pod -n validationwebhook NAME READY STATUS RESTARTS AGE validationwebhook-7478c99bd-r9n5h 0/1 CrashLoopBackOff 3 (11s ago) 81s After a while, check the kube-apiserver, it doesn't cate webhook status, $ oc get co/kube-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.9.21 True False False 35m 3. Ask the cluster to update to a 4.10 payload built by cluster-bot with PR 1312. $ oc adm upgrade --to-image=registry.build01.ci.openshift.org/ci-ln-cri889k/release:latest --force=true --allow-explicit-upgrade=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.build01.ci.openshift.org/ci-ln-cri889k/release:latest $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 18s Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 9 of 836 done (1% complete) $ oc get co/kube-apiserver;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.9.21 True True False 41m NodeInstallerProgressing: 2 nodes are at revision 6; 1 nodes are at revision 7 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 5m41s Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 95 of 771 done (12% complete) $ oc get co/kube-apiserver;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest True False False 52m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 16m Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 203 of 771 done (26% complete) Based on above, we can see as the 4.10 kube-apiserver operator comes in, the new controller takes a look around, sees the broken webhooks, and (with this fix) stays Degraded=False, kube-apiserver was updated to new revision, the behavior is as expected, PR fix works fine.
Retest the upgrade from 4.9.21 to latest 4.10 nightly including PR fix, steps see below, After created broken webhook, upgrade cluster to 4.10 nightly $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 --allow-explicit-upgrade=true --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.9.21 True True False 34m NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 2m27s Working towards 4.10.0-0.nightly-2022-02-11-082848: 94 of 770 done (12% complete) $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-2022-02-11-082848 True False False 110m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 77m Working towards 4.10.0-0.nightly-2022-02-11-082848: 648 of 770 done (84% complete) $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-2022-02-11-082848 True False False 128m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.21 True True 95m Unable to apply 4.10.0-0.nightly-2022-02-11-082848: wait has exceeded 40 minutes for these operators: machine-config As the 4.10 kube-apiserver operator including PR fix comes in, will go on smoothly.Finally, the upgrade still got stuck in machine-config updating, seems hit the bug 2000937 I've seen it before. Anyway, the broken webhooks are no longer a problem for upgrade, move the bug VERIFIED.
Try to upgrade from 4.10.0-rc.1 to latest 4.10 nightly including PR fix, steps see below, Created broken webhook, upgrade cluster to 4.10 nightly, $ oc get co/kube-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-rc.1 True False True 13m ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.180.174:443: connect: no route to host $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 --allow-explicit-upgrade=true --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-rc.1 True False True 14m ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.180.174:443: connect: no route to host NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.1 True True 11s Working towards 4.10.0-0.nightly-2022-02-11-082848: 20 of 770 done (2% complete) $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-rc.1 True True False 20m NodeInstallerProgressing: 1 nodes are at revision 8; 2 nodes are at revision 9 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.1 True True 6m40s Working towards 4.10.0-0.nightly-2022-02-11-082848: 95 of 770 done (12% complete) $ oc get co/kube-apiserver;echo;oc get clusterversion NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-2022-02-11-082848 True False False 67m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.1 True True 53m Working towards 4.10.0-0.nightly-2022-02-11-082848: 648 of 770 done (84% complete), waiting on machine-config From above, we can see the broken webhooks are no longer a problem for upgrade, kube-apiserver operator cleaned the old conditions out.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056