Please provide a must-gather.
@Standa The web console depends on the auth operator. Due that the first impact seem by this issue is console becoming degraded (and authentication also). I added those cases to this BZ because I am involved and all of them I can say with confidence to you that all of them are related to this same root cause. Indeed I noticed this issue and also found out a workaround, which is described bellow: ``` When this issue happens we need to delete all ingresscontrollers other than the default and after some seconds the authentication operators become healthy again (because there is only the default router to validate the oauth URL, which does not generate the error anymore). After authentication operator is healthy again, I recreate other ingresscontroller. ``` The clusters of all these 3 cases has been restored using this workaround. However it is a workaround... it needs to be fixed. I also guess it seems to be related to those secrets, however for me it is very clear that this is a bug due those secrets are generated automatically by the operator and seems that it is generating them wrong when you have multiple ingresscontrollers in place. Regards, Giovanni Fontana
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
Dear, I managed to reproduce this issue in a lab environment with OCP 4.6.6. It is not difficult to reproduce, below are the steps to reproduce it: 1. Start with a functional OCP IPI cluster over VMware or OpenStack (but I suppose it happens in all on-premises cluster - not cloud). 2. Create a new machineset for a new machine that will host a new ingresscontroller, eg.: piVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: labels: machine.openshift.io/cluster-api-cluster: <cluster-id> name: <cluster-id>-sharding namespace: openshift-machine-api spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: <cluster-id> machine.openshift.io/cluster-api-machineset: <cluster-id>-sharding template: metadata: labels: machine.openshift.io/cluster-api-cluster: <cluster-id> machine.openshift.io/cluster-api-machine-role: sharding machine.openshift.io/cluster-api-machine-type: sharding machine.openshift.io/cluster-api-machineset: <cluster-id>-sharding spec: metadata: labels: node-role.kubernetes.io/sharding: "" providerSpec: value: apiVersion: vsphereprovider.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 8192 metadata: creationTimestamp: null network: devices: - networkName: VM Network numCPUs: 2 numCoresPerSocket: 1 snapshot: "" template: ocp-bsrnx-rhcos userDataSecret: name: worker-user-data workspace: datacenter: <vmware-dc> datastore: <vmware-ds> folder: <vmware-folder> resourcePool: <vmware-pool> server: <vcenter> 3. Wait for the new node comes up: [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION ocp-bsrnx-master-0 Ready master 22h v1.19.0+43983cd ocp-bsrnx-master-1 Ready master 22h v1.19.0+43983cd ocp-bsrnx-master-2 Ready master 22h v1.19.0+43983cd ocp-bsrnx-sharding-5w6mm Ready sharding,worker 22m v1.19.0+43983cd ocp-bsrnx-worker-t77rv Ready worker 21h v1.19.0+43983cd ocp-bsrnx-worker-xd5dc Ready worker 21h v1.19.0+43983cd ocp-bsrnx-worker-zqr9z Ready worker 22h v1.19.0+43983cd 4. Create a new ingress controller using another domain, eg: apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: sharded namespace: openshift-ingress-operator spec: domain: hm.ocp.rhbr-consulting.com replicas: 1 nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/sharding: "" routeSelector: matchLabels: type: sharded status: {} kind: List metadata: resourceVersion: "" selfLink: "" 5. Wait for the new ingress controller be ready: [root@bastion ~]# oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-6c4db9d57b-jvpbf 1/1 Running 0 22h 10.168.255.18 ocp-bsrnx-worker-zqr9z <none> <none> router-default-6c4db9d57b-ph4rk 1/1 Running 0 21h 10.168.255.30 ocp-bsrnx-worker-xd5dc <none> <none> router-sharded-5d57694db4-65p4d 1/1 Running 0 32m 10.168.255.39 ocp-bsrnx-sharding-5w6mm <none> <none> 6. Now, SHUTDOWN GRACEFULLY ALL THE WORKER NODES and wait them to be NotReady: [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION ocp-bsrnx-master-0 Ready master 22h v1.19.0+43983cd ocp-bsrnx-master-1 Ready master 22h v1.19.0+43983cd ocp-bsrnx-master-2 Ready master 22h v1.19.0+43983cd ocp-bsrnx-sharding-5w6mm NotReady sharding,worker 37m v1.19.0+43983cd ocp-bsrnx-worker-t77rv NotReady worker 21h v1.19.0+43983cd ocp-bsrnx-worker-xd5dc NotReady worker 21h v1.19.0+43983cd ocp-bsrnx-worker-zqr9z NotReady worker 22h v1.19.0+43983cd 7. Start the workers again and wait for them to be ready again: [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION ocp-bsrnx-master-0 Ready master 23h v1.19.0+43983cd ocp-bsrnx-master-1 Ready master 23h v1.19.0+43983cd ocp-bsrnx-master-2 Ready master 23h v1.19.0+43983cd ocp-bsrnx-sharding-5w6mm Ready sharding,worker 93m v1.19.0+43983cd ocp-bsrnx-worker-t77rv Ready worker 22h v1.19.0+43983cd ocp-bsrnx-worker-xd5dc Ready worker 22h v1.19.0+43983cd ocp-bsrnx-worker-zqr9z Ready worker 23h v1.19.0+43983cd 8. When you restart all workers you will force the authentication operator to reconcile again (now with 2 ingress controllers in place) and you will be able to see the error (x509: certificate is valid for *.hm.ocp.rhbr-consulting.com, not oauth-openshift.apps.ocp.rhbr-consulting.com), as you can see below: [root@bastion ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.6 False True True 58m cloud-credential 4.6.6 True False False 23h cluster-autoscaler 4.6.6 True False False 23h config-operator 4.6.6 True False False 23h console 4.6.6 True False True 23h csi-snapshot-controller 4.6.6 True False False 23h dns 4.6.6 True False False 23h (...) [root@bastion ~]# oc describe co authentication Name: authentication Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-12-02T19:33:53Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:spec: f:status: .: f:extension: Manager: cluster-version-operator Operation: Update Time: 2020-12-02T19:33:53Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:relatedObjects: f:versions: Manager: authentication-operator Operation: Update Time: 2020-12-03T18:17:37Z Resource Version: 416326 Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication UID: c2bb4606-bfaa-4331-b26b-2742ffdc63c3 Spec: Status: Conditions: Last Transition Time: 2020-12-03T18:15:47Z Message: OAuthRouteCheckEndpointAccessibleControllerDegraded: "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" returned "503 Service Unavailable" Reason: OAuthRouteCheckEndpointAccessibleController_SyncError Status: True Type: Degraded Last Transition Time: 2020-12-03T18:14:27Z Message: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" not successfull yet Reason: OAuthVersionRoute_WaitingForRoute Status: True Type: Progressing Last Transition Time: 2020-12-03T18:13:47Z Message: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" failed: x509: certificate is valid for *.hm.ocp.rhbr-consulting.com, not oauth-openshift.apps.ocp.rhbr-consulting.com OAuthRouteCheckEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" returned "503 Service Unavailable" Reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed Status: False Type: Available Last Transition Time: 2020-12-02T19:38:41Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: (...)
> The + signals were the tip. For some reason the authentication operator gerenates a certificate for every route shard of the cluster and it modifies the configmap to use all the certificates, I don't know if this is the expect behavior or a problem. That's expected, the CAO is considering all the domains present in the router-certs secret for its SNI. It's up to Router folks to find out why the required domain does not appear in the secret.
As there is no fix that the net-edge team can supply here I moving this initially to the installer component as it is there (or as part of the machine-config?) where we would need an option which would allow a node to opt out of running keepalived. (Assuming that's feasible from the installer's perspective.) I'm also dropping the severity from urgent->high as we have a confirmed workaround (see comment #30).
The installer team has not been involved in the keepalived unit. It seems to have been created by and maintained by the kni team.
*** Bug 1936675 has been marked as a duplicate of this bug. ***
To avoid the workaround breaking MCO, it needs to be applied with a machine-config. The one below should prevent keepalived from running on the specified nodes (make sure to change the role label appropriately). Note that applying this will trigger a reboot, like most machine-configs. apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: name: 50-disable-keepalived labels: # Replace this with a selector that will apply only to the other ingress nodes machineconfiguration.openshift.io/role: worker spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:, verification: {} filesystem: root mode: 420 path: /etc/kubernetes/manifests/keepalived.yaml
Verified on 4.8.0-rc.0 3 masters + 3 workers cluster deployed #extra ingress controller applied: apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: sharded namespace: openshift-ingress-operator spec: domain: hm.ocp.rhbr-consulting.com replicas: 1 #check that new ingress controller is started on 3rd worker (default has 2 replicas) [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-5b864b9594-5v8rd 1/1 Running 0 13h fd2e:6f44:5dd8::6c worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> router-default-5b864b9594-9lx82 1/1 Running 0 13h fd2e:6f44:5dd8::96 worker-0-2.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> router-sharded-7c6f7b55d7-dss4l 1/1 Running 0 6s fd2e:6f44:5dd8::63 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> [kni@provisionhost-0-0 ~]$ oc get clusteroperator | grep authentication authentication 4.8.0-rc.0 True False False 12h #find ingressVIP [kni@provisionhost-0-0 ~]$ cat install-config.yaml | grep ingressVIP ingressVIP: fd2e:6f44:5dd8::a #find worker that holds ingressVIP: [core@worker-0-2 ~]$ ip a | grep fd2e:6f44:5dd8::a inet6 fd2e:6f44:5dd8::a/128 scope global nodad deprecated noprefixroute # restart worker with ingressVIP and check where it moves: [core@worker-0-0 ~]$ ip a | grep fd2e:6f44:5dd8::a inet6 fd2e:6f44:5dd8::a/128 scope global nodad deprecated noprefixroute ingressVIP moves to worker that have default ingress controller and not extra one, tested few times and authentication operator is not degraded: [kni@provisionhost-0-0 ~]$ oc get clusteroperator | grep authentication authentication 4.8.0-rc.0 True False False 13h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438