2052339 – Failing webhooks will block an upgrade to 4.10 mid-way through the upgrade.

Bug 2052339 - Failing webhooks will block an upgrade to 4.10 mid-way through the upgrade.

Summary: Failing webhooks will block an upgrade to 4.10 mid-way through the upgrade.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Abu Kashem
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:	2052513
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-09 02:29 UTC by Matt Bargenquast
Modified:	2022-05-03 00:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2052513 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:43:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1314	0	None	Merged	[release-4.10] Bug 2052339: degraded webhook conditions to errors	2022-02-21 21:33:40 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:44:04 UTC

Description Matt Bargenquast 2022-02-09 02:29:14 UTC

Description of problem:

When a cluster attempts a 4.9 -> 4.10 upgrade and has any failing webhooks, it degrades the kube-apiserver cluster operator which in turn causes the upgrade to stall.

Version-Release number of selected component (if applicable):

4.10.0-rc1

How reproducible:

Observed on several clusters.


Steps to Reproduce:
1. Install a validatingwebhookconfiguration or mutatingwebhookconfiguration that has problems. (ie. communicates to a missing service / does not trust the service)
2. Attempt an upgrade to 4.10.0-rc1

Actual results:

After upgrading the kube-apiserver clusteroperator to 4.10.0-rc1, the upgrade will stall out due to issues with the webhook.


Expected results:

A failing webhook should not result in an upgrade stalling part-way through. Either the upgrade should not commence, or this situation should be handled gracefully that does not require intervention part-way through the upgrade to unblock it.

Comment 3 W. Trevor King 2022-02-09 03:07:13 UTC

[1] is the new-in-4.10 controller.  The situation is:

1. Running 4.9 with a broken webhook.  Whatever the webhook is supposed to do doesn't work, but kube-apiserver is otherwise oblivious to the issue.
2. Cluster updates towards 4.10.
3. As the 4.10 kube-apiserver operator comes in, the new controller [1] takes a look around, sees the broken webhooks, and sets Degraded=True.
4. Update wedges, because the cluster-version operator won't move past the kube-apiserver ClusterOperator manifest while it's Degraded [2].

We plan on softening the CVO behavior so it doen't block on Degraded [3], but we aren't there yet.  In the meantime, options include:

a. Moving from Degraded conditions to alerts, so we can complain without blocking updates.  We did this for the vSphere problem detector in bug 1943719.
b. Getting some kind of early warning system into 4.9 so folks hear about these issues and have time to mitigate before updating to 4.10.
c. Adding a lump of inertia to the 4.10 Degraded condition, so most folks are likely to complete the update before Degraded goes True and locks up further updating.

[1]: https://github.com/openshift/cluster-kube-apiserver-operator/blob/98cea10c60a7e4da61f51d0cf388cfda47af6841/pkg/operator/webhooksupportabilitycontroller/
[2]: https://github.com/openshift/enhancements/blame/27846285be01a2aebf8d3a04ebb8ed7f877e4959/dev-guide/cluster-version-operator/user/reconciliation.md#L160
[3]: https://issues.redhat.com/browse/OTA-540

Comment 4 Ke Wang 2022-02-10 16:29:43 UTC

Pre-merge verified the bug as below,

1. Install a recent 4.9.z.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        False         12m     Cluster version is 4.9.21

2. Install a broken webhook. 
$ cat webhook-deploy.yaml # Let the targetPort inconsistent with containerPort, will cause webhook failed.
apiVersion: v1
kind: Namespace
metadata:
  name: validationwebhook

---

apiVersion: v1
kind: Service
metadata:
  name: validationwebhook
  namespace: validationwebhook
spec:
  selector:
    app: validationwebhook
  ports:
  - protocol: TCP
    port: 443
    targetPort: 8444

---

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: validationwebhook
  name: validationwebhook
  namespace: validationwebhook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: validationwebhook
  template:
    metadata:
      labels:
        app: validationwebhook
    spec:
      containers:
      - name: test1
        image: quay.io/wangke19/test1:v1
        imagePullPolicy: Always
        ports:
        - containerPort: 8443

-------
$ cat webhook-registration.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: validationwebhook.validationwebhook.svc
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"
webhooks:
- name: validationwebhook.validationwebhook.svc
  failurePolicy: Fail
  rules:
  - apiGroups:   ["*"]
    apiVersions: ["v1"]
    operations:  ["UPDATE"]
    resources:   ["nodes"]
  clientConfig:
    service:
      namespace: validationwebhook
      name: validationwebhook
      path: "/"
  admissionReviewVersions: ["v1"]
  sideEffects: None

$ oc apply -f webhook-deploy.yaml
namespace/validationwebhook created
service/validationwebhook created
deployment.apps/validationwebhook created

$ oc apply -f webhook-registration.yaml
validatingwebhookconfiguration.admissionregistration.k8s.io/validationwebhook.validationwebhook.svc created

webhook ran into error,
$ oc get pod -n validationwebhook
NAME                                READY   STATUS             RESTARTS      AGE
validationwebhook-7478c99bd-r9n5h   0/1     CrashLoopBackOff   3 (11s ago)   81s

After a while, check the kube-apiserver, it doesn't cate webhook status,   
$ oc get co/kube-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        False         False      35m  

3. Ask the cluster to update to a 4.10 payload built by cluster-bot with PR 1312.
$ oc adm upgrade --to-image=registry.build01.ci.openshift.org/ci-ln-cri889k/release:latest --force=true --allow-explicit-upgrade=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.build01.ci.openshift.org/ci-ln-cri889k/release:latest

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          18s     Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 9 of 836 done (1% complete)

$ oc get co/kube-apiserver;oc get clusterversion
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        True          False      41m     NodeInstallerProgressing: 2 nodes are at revision 6; 1 nodes are at revision 7

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          5m41s   Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 95 of 771 done (12% complete)

$ oc get co/kube-apiserver;oc get clusterversion
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest   True        False         False      52m     
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          16m     Working towards 4.10.0-0.ci.test-2022-02-10-152048-ci-ln-cri889k-latest: 203 of 771 done (26% complete)

Based on above, we can see as the 4.10 kube-apiserver operator comes in, the new controller takes a look around, sees the broken webhooks, and (with this fix) stays Degraded=False, kube-apiserver was updated to new revision, the behavior is as expected, PR fix works fine.

Comment 6 Ke Wang 2022-02-11 11:08:44 UTC

Retest the upgrade from 4.9.21 to latest 4.10 nightly including PR fix, steps see below,

After created broken webhook, upgrade cluster to 4.10 nightly 

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        True          False      34m     NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          2m27s   Working towards 4.10.0-0.nightly-2022-02-11-082848: 94 of 770 done (12% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-0.nightly-2022-02-11-082848   True        False         False      110m    

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          77m     Working towards 4.10.0-0.nightly-2022-02-11-082848: 648 of 770 done (84% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-0.nightly-2022-02-11-082848   True        False         False      128m    

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          95m     Unable to apply 4.10.0-0.nightly-2022-02-11-082848: wait has exceeded 40 minutes for these operators: machine-config

As the 4.10 kube-apiserver operator including PR fix comes in, will go on smoothly.Finally, the upgrade still got stuck in machine-config updating, seems hit the bug 2000937 I've seen it before. Anyway, the broken webhooks are no longer a problem for upgrade, move the bug VERIFIED.

Comment 7 Ke Wang 2022-02-11 12:44:06 UTC

Try to upgrade from 4.10.0-rc.1 to latest 4.10 nightly including PR fix, steps see below,

Created broken webhook, upgrade cluster to 4.10 nightly,

$ oc get co/kube-apiserver
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        False         True       13m     ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.180.174:443: connect: no route to host

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848 --allow-explicit-upgrade=true --force 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-11-082848

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        False         True       14m     ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.180.174:443: connect: no route to host

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          11s     Working towards 4.10.0-0.nightly-2022-02-11-082848: 20 of 770 done (2% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        True          False      20m     NodeInstallerProgressing: 1 nodes are at revision 8; 2 nodes are at revision 9

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          6m40s   Working towards 4.10.0-0.nightly-2022-02-11-082848: 95 of 770 done (12% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-0.nightly-2022-02-11-082848   True        False         False      67m     

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          53m     Working towards 4.10.0-0.nightly-2022-02-11-082848: 648 of 770 done (84% complete), waiting on machine-config

From above, we can see the broken webhooks are no longer a problem for upgrade, kube-apiserver operator cleaned the old conditions out.

Comment 10 errata-xmlrpc 2022-03-10 16:43:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.