Bug 2052513 - Failing webhooks will block an upgrade to 4.10 mid-way through the upgrade.
Summary: Failing webhooks will block an upgrade to 4.10 mid-way through the upgrade.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Luis Sanchez
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 2052339
TreeView+ depends on / blocked
 
Reported: 2022-02-09 13:21 UTC by Abu Kashem
Modified: 2022-08-10 10:49 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2052339
Environment:
Last Closed: 2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1309 0 None Merged Bug 2052513: disable webhook supportability during upgrade 2022-02-09 16:06:10 UTC
Github openshift cluster-kube-apiserver-operator pull 1313 0 None Merged Bug 2052513: degraded webhook conditions to errors 2022-02-10 19:54:26 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:49:09 UTC

Comment 1 W. Trevor King 2022-02-09 16:19:10 UTC
The way the fix was scoped [1], verification for 4.11 is going to be tricky.  Perhaps:

1. Install a recent 4.9.z.
2. Install a broken webhook.  Whatever the webhook is supposed to do doesn't work, but kube-apiserver is otherwise oblivious to the issue.
3. Ask the cluster to update directly to a 4.11 nightly.  Direct 4.9 -> 4.11 updates are a terrible idea for anyone who cares about their cluster, but the kube-apiserver is early in the update, and maybe we'll get that far before things blow up.
4. Cluster updates towards 4.10.
5. As the 4.11 kube-apiserver operator comes in, the new controller [1] takes a look around, sees the broken webhooks, and (before this fix) sets Degraded=True or (with this fix) stays Degraded=False.

If the CVO gets past the kube-apiserver ClusterOperator and starts asking later components to update, we can confirm that the 4.11 fix is working as expected.

[1]: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1309#discussion_r802636750

Comment 4 Ke Wang 2022-02-10 12:45:02 UTC
Refer to the Comment 1, verification as below,

1. Install a recent 4.9.z.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        False         12m     Cluster version is 4.9.21

2. Install a broken webhook. 
$ cat webhook-deploy.yaml # Let the targetPort inconsistent with containerPort, will cause webhook failed.
apiVersion: v1
kind: Namespace
metadata:
  name: validationwebhook

---

apiVersion: v1
kind: Service
metadata:
  name: validationwebhook
  namespace: validationwebhook
spec:
  selector:
    app: validationwebhook
  ports:
  - protocol: TCP
    port: 443
    targetPort: 8444

---

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: validationwebhook
  name: validationwebhook
  namespace: validationwebhook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: validationwebhook
  template:
    metadata:
      labels:
        app: validationwebhook
    spec:
      containers:
      - name: test1
        image: quay.io/wangke19/test1:v1
        imagePullPolicy: Always
        ports:
        - containerPort: 8443

-------
$ cat webhook-registration.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: validationwebhook.validationwebhook.svc
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"
webhooks:
- name: validationwebhook.validationwebhook.svc
  failurePolicy: Fail
  rules:
  - apiGroups:   ["*"]
    apiVersions: ["v1"]
    operations:  ["UPDATE"]
    resources:   ["nodes"]
  clientConfig:
    service:
      namespace: validationwebhook
      name: validationwebhook
      path: "/"
  admissionReviewVersions: ["v1"]
  sideEffects: None

Thu Feb 10 18:46:34 [kewang@kewang-fedora]$ oc apply -f webhook-deploy.yaml
namespace/validationwebhook created
service/validationwebhook created
deployment.apps/validationwebhook created

Thu Feb 10 18:46:54 [kewang@kewang-fedora]$ oc apply -f webhook-registration.yaml
validatingwebhookconfiguration.admissionregistration.k8s.io/validationwebhook.validationwebhook.svc created

webhook ran into error,
Thu Feb 10 18:55:01 [kewang@kewang-fedora]$ oc get pod -n validationwebhook
NAME                                READY   STATUS             RESTARTS        AGE
validationwebhook-7478c99bd-78g75   0/1     CrashLoopBackOff   6 (2m11s ago)   8m29s

After a while, check the kube-apiserver, it doesn't cate webhook status,   
Thu Feb 10 18:55:33 [kewang@kewang-fedora]$ oc get co/kube-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        False         False      32m     

3. Ask the cluster to update directly to a 4.11 nightly 4.11.0-0.nightly-2022-02-10-031822(Already landed the PR fix)
Thu Feb 10 18:56:17 [kewang@kewang-fedora]$  oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-10-031822 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-10-031822

Thu Feb 10 18:58:50 [kewang@kewang-fedora]$  oc get co/kube-apiserver;oc get clusterversion
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        True          False      35m     NodeInstallerProgressing: 3 nodes are at revision 10; 0 nodes have achieved new revision 12
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          2m18s   Working towards 4.11.0-0.nightly-2022-02-10-031822: 94 of 770 done (12% complete)

Thu Feb 10 18:59:10 [kewang@kewang-fedora]$  oc get co/kube-apiserver;oc get clusterversion
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.21    True        True          False      41m     NodeInstallerProgressing: 1 nodes are at revision 10; 2 nodes are at revision 12
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          8m19s   Working towards 4.11.0-0.nightly-2022-02-10-031822: 95 of 770 done (12% complete)

...

Thu Feb 10 20:15:32 [kewang@kewang-fedora]$ oc get co/kube-apiserver;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.11.0-0.nightly-2022-02-10-031822   True        False         False      112m    
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        True          79m     Unable to apply 4.11.0-0.nightly-2022-02-10-031822: wait has exceeded 40 minutes for these operators: machine-config

Thu Feb 10 20:15:56 [kewang@kewang-fedora]$ oc get co/machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.9.21    True        True          True       114m    Unable to apply 4.11.0-0.nightly-2022-02-10-031822: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-b41efcb481e1920fbd445f1b6daa2729 expected 9629f56063b846202dea4ac24a722b963296a46f has ce39533a3346da509f49379c537133eea9bb6c06: all 3 nodes are at latest configuration rendered-master-b41efcb481e1920fbd445f1b6daa2729, retrying

Finally, the upgrade got stuck in machine-config updating, we don't care that for this bug verification, I tested the same case with upgrade path 4.10 -> 4.11 nightly and 4.11 nightly A -> 4.11 nightly B, both got stuck in kube-apiserver updating as the following,

Thu Feb 10 12:42:56 [kewang@kewang-fedora]$ oc get co/kube-apiserver -w
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        False         True       120m    ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.84.193:443: connect: no route to host
kube-apiserver   4.10.0-rc.1   True        False         True       120m    ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.84.193:443: i/o timeout

Thu Feb 10 12:43:27 [kewang@kewang-fedora]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          3m50s   Working towards 4.11.0-0.nightly-2022-02-09-185722: 94 of 770 done (12% complete)

Based on above, As the 4.11 kube-apiserver operator comes in, the new controller takes a look around, sees the broken webhooks, and (before this fix) sets Degraded=True or (with this fix) stays Degraded=False, the behavior is as expected, move the bug VERIFIED.

Comment 5 Abu Kashem 2022-02-10 15:32:55 UTC
moving it to assigned since we have decided to rename the condition (remove Degraded suffix) so as not to block upgrade.

Comment 7 Ke Wang 2022-02-11 11:02:03 UTC
Retest the upgrade from 4.10.0-rc.1 to latest 4.11 including PR fix, steps see below,

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        False         12m     Cluster version is 4.10.0-rc.1

$ oc get co/kube-apiserver  # rc.1 without PR fix, kube-apiserver will stay DEGRADED
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        False         True       18m     ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.201.34:443: i/o timeout

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-11-014337 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-11-014337

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        False         True       19m     ValidatingAdmissionWebhookConfigurationDegraded: validationwebhook.validationwebhook.svc: dial tcp 172.30.201.34:443: i/o timeout

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          19s     Working towards 4.11.0-0.nightly-2022-02-11-014337: 6 of 770 done (0% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion  #  As the 4.11 kube-apiserver operator comes in, will go on smoothly without above ValidatingAdmissionWebhookConfigurationDegraded condition 
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-rc.1   True        True          False      26m     NodeInstallerProgressing: 2 nodes are at revision 9; 1 nodes are at revision 10

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          7m12s   Working towards 4.11.0-0.nightly-2022-02-11-014337: 95 of 770 done (12% complete)
...
$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.11.0-0.nightly-2022-02-11-014337   True        False         False      42m     

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          23m     Working towards 4.11.0-0.nightly-2022-02-11-014337: 383 of 770 done (49% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.11.0-0.nightly-2022-02-11-014337   True        False         False      107m    

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          88m     Working towards 4.11.0-0.nightly-2022-02-11-014337: 648 of 770 done (84% complete)

$ oc get co/kube-apiserver;echo;oc get clusterversion
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.11.0-0.nightly-2022-02-11-014337   True        False         False      124m    

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        True          105m    Unable to apply 4.11.0-0.nightly-2022-02-11-014337: wait has exceeded 40 minutes for these operators: machine-config

Finally, the upgrade still got stuck in machine-config updating, seems hit the bug 2000937 I've seen it before. Anyway, the broken webhooks are no longer a problem for upgrade, move the bug VERIFIED.

Comment 9 errata-xmlrpc 2022-08-10 10:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.