Bug 1886572
| Summary: | auth: error contacting auth provider when extra ingress (not default) goes down | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Santiago Maudet <smaudet> | |
| Component: | Networking | Assignee: | Yossi Boaron <yboaron> | |
| Networking sub component: | runtime-cfg | QA Contact: | Victor Voronkov <vvoronko> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | amcdermo, aos-bugs, beth.white, bnemec, brad, cdepaula, copan, dsilvaju, ecavalca, gfontana, gscheffe, hpokorny, jcoscia, mapandey, mbooth, mfojtik, mnunes, openshift-bugs-escalate, rcunha, rkshirsa, rpittau, sdasu, sgreene, slaznick, wlewis, yboaron, zmaciel | |
| Version: | 4.5 | Keywords: | Triaged | |
| Target Milestone: | --- | Flags: | rpittau:
needinfo-
rcunha: needinfo- rpittau: needinfo- rpittau: needinfo- |
|
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1988102 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 22:33:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
|
Comment 2
Standa Laznicka
2020-10-09 13:21:10 UTC
@Standa The web console depends on the auth operator. Due that the first impact seem by this issue is console becoming degraded (and authentication also). I added those cases to this BZ because I am involved and all of them I can say with confidence to you that all of them are related to this same root cause. Indeed I noticed this issue and also found out a workaround, which is described bellow: ``` When this issue happens we need to delete all ingresscontrollers other than the default and after some seconds the authentication operators become healthy again (because there is only the default router to validate the oauth URL, which does not generate the error anymore). After authentication operator is healthy again, I recreate other ingresscontroller. ``` The clusters of all these 3 cases has been restored using this workaround. However it is a workaround... it needs to be fixed. I also guess it seems to be related to those secrets, however for me it is very clear that this is a bug due those secrets are generated automatically by the operator and seems that it is generating them wrong when you have multiple ingresscontrollers in place. Regards, Giovanni Fontana Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Dear,
I managed to reproduce this issue in a lab environment with OCP 4.6.6. It is not difficult to reproduce, below are the steps to reproduce it:
1. Start with a functional OCP IPI cluster over VMware or OpenStack (but I suppose it happens in all on-premises cluster - not cloud).
2. Create a new machineset for a new machine that will host a new ingresscontroller, eg.:
piVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: <cluster-id>
name: <cluster-id>-sharding
namespace: openshift-machine-api
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: <cluster-id>
machine.openshift.io/cluster-api-machineset: <cluster-id>-sharding
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: <cluster-id>
machine.openshift.io/cluster-api-machine-role: sharding
machine.openshift.io/cluster-api-machine-type: sharding
machine.openshift.io/cluster-api-machineset: <cluster-id>-sharding
spec:
metadata:
labels:
node-role.kubernetes.io/sharding: ""
providerSpec:
value:
apiVersion: vsphereprovider.openshift.io/v1beta1
credentialsSecret:
name: vsphere-cloud-credentials
diskGiB: 120
kind: VSphereMachineProviderSpec
memoryMiB: 8192
metadata:
creationTimestamp: null
network:
devices:
- networkName: VM Network
numCPUs: 2
numCoresPerSocket: 1
snapshot: ""
template: ocp-bsrnx-rhcos
userDataSecret:
name: worker-user-data
workspace:
datacenter: <vmware-dc>
datastore: <vmware-ds>
folder: <vmware-folder>
resourcePool: <vmware-pool>
server: <vcenter>
3. Wait for the new node comes up:
[root@bastion ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
ocp-bsrnx-master-0 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-master-1 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-master-2 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-sharding-5w6mm Ready sharding,worker 22m v1.19.0+43983cd
ocp-bsrnx-worker-t77rv Ready worker 21h v1.19.0+43983cd
ocp-bsrnx-worker-xd5dc Ready worker 21h v1.19.0+43983cd
ocp-bsrnx-worker-zqr9z Ready worker 22h v1.19.0+43983cd
4. Create a new ingress controller using another domain, eg:
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: sharded
namespace: openshift-ingress-operator
spec:
domain: hm.ocp.rhbr-consulting.com
replicas: 1
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/sharding: ""
routeSelector:
matchLabels:
type: sharded
status: {}
kind: List
metadata:
resourceVersion: ""
selfLink: ""
5. Wait for the new ingress controller be ready:
[root@bastion ~]# oc get pods -n openshift-ingress -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
router-default-6c4db9d57b-jvpbf 1/1 Running 0 22h 10.168.255.18 ocp-bsrnx-worker-zqr9z <none> <none>
router-default-6c4db9d57b-ph4rk 1/1 Running 0 21h 10.168.255.30 ocp-bsrnx-worker-xd5dc <none> <none>
router-sharded-5d57694db4-65p4d 1/1 Running 0 32m 10.168.255.39 ocp-bsrnx-sharding-5w6mm <none> <none>
6. Now, SHUTDOWN GRACEFULLY ALL THE WORKER NODES and wait them to be NotReady:
[root@bastion ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
ocp-bsrnx-master-0 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-master-1 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-master-2 Ready master 22h v1.19.0+43983cd
ocp-bsrnx-sharding-5w6mm NotReady sharding,worker 37m v1.19.0+43983cd
ocp-bsrnx-worker-t77rv NotReady worker 21h v1.19.0+43983cd
ocp-bsrnx-worker-xd5dc NotReady worker 21h v1.19.0+43983cd
ocp-bsrnx-worker-zqr9z NotReady worker 22h v1.19.0+43983cd
7. Start the workers again and wait for them to be ready again:
[root@bastion ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
ocp-bsrnx-master-0 Ready master 23h v1.19.0+43983cd
ocp-bsrnx-master-1 Ready master 23h v1.19.0+43983cd
ocp-bsrnx-master-2 Ready master 23h v1.19.0+43983cd
ocp-bsrnx-sharding-5w6mm Ready sharding,worker 93m v1.19.0+43983cd
ocp-bsrnx-worker-t77rv Ready worker 22h v1.19.0+43983cd
ocp-bsrnx-worker-xd5dc Ready worker 22h v1.19.0+43983cd
ocp-bsrnx-worker-zqr9z Ready worker 23h v1.19.0+43983cd
8. When you restart all workers you will force the authentication operator to reconcile again (now with 2 ingress controllers in place) and you will be able to see the error (x509: certificate is valid for *.hm.ocp.rhbr-consulting.com, not oauth-openshift.apps.ocp.rhbr-consulting.com), as you can see below:
[root@bastion ~]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.6 False True True 58m
cloud-credential 4.6.6 True False False 23h
cluster-autoscaler 4.6.6 True False False 23h
config-operator 4.6.6 True False False 23h
console 4.6.6 True False True 23h
csi-snapshot-controller 4.6.6 True False False 23h
dns 4.6.6 True False False 23h
(...)
[root@bastion ~]# oc describe co authentication
Name: authentication
Namespace:
Labels: <none>
Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
include.release.openshift.io/self-managed-high-availability: true
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2020-12-02T19:33:53Z
Generation: 1
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:exclude.release.openshift.io/internal-openshift-hosted:
f:include.release.openshift.io/self-managed-high-availability:
f:spec:
f:status:
.:
f:extension:
Manager: cluster-version-operator
Operation: Update
Time: 2020-12-02T19:33:53Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:relatedObjects:
f:versions:
Manager: authentication-operator
Operation: Update
Time: 2020-12-03T18:17:37Z
Resource Version: 416326
Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication
UID: c2bb4606-bfaa-4331-b26b-2742ffdc63c3
Spec:
Status:
Conditions:
Last Transition Time: 2020-12-03T18:15:47Z
Message: OAuthRouteCheckEndpointAccessibleControllerDegraded: "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" returned "503 Service Unavailable"
Reason: OAuthRouteCheckEndpointAccessibleController_SyncError
Status: True
Type: Degraded
Last Transition Time: 2020-12-03T18:14:27Z
Message: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" not successfull yet
Reason: OAuthVersionRoute_WaitingForRoute
Status: True
Type: Progressing
Last Transition Time: 2020-12-03T18:13:47Z
Message: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" failed: x509: certificate is valid for *.hm.ocp.rhbr-consulting.com, not oauth-openshift.apps.ocp.rhbr-consulting.com
OAuthRouteCheckEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ocp.rhbr-consulting.com/healthz" returned "503 Service Unavailable"
Reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed
Status: False
Type: Available
Last Transition Time: 2020-12-02T19:38:41Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension: <nil>
Related Objects:
(...)
> The + signals were the tip. For some reason the authentication operator gerenates a certificate for every route shard of the cluster and it modifies the configmap to use all the certificates, I don't know if this is the expect behavior or a problem.
That's expected, the CAO is considering all the domains present in the router-certs secret for its SNI. It's up to Router folks to find out why the required domain does not appear in the secret.
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. As there is no fix that the net-edge team can supply here I moving this initially to the installer component as it is there (or as part of the machine-config?) where we would need an option which would allow a node to opt out of running keepalived. (Assuming that's feasible from the installer's perspective.) I'm also dropping the severity from urgent->high as we have a confirmed workaround (see comment #30). The installer team has not been involved in the keepalived unit. It seems to have been created by and maintained by the kni team. *** Bug 1936675 has been marked as a duplicate of this bug. *** To avoid the workaround breaking MCO, it needs to be applied with a machine-config. The one below should prevent keepalived from running on the specified nodes (make sure to change the role label appropriately).
Note that applying this will trigger a reboot, like most machine-configs.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 50-disable-keepalived
labels:
# Replace this with a selector that will apply only to the other ingress nodes
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:,
verification: {}
filesystem: root
mode: 420
path: /etc/kubernetes/manifests/keepalived.yaml
Verified on 4.8.0-rc.0
3 masters + 3 workers cluster deployed
#extra ingress controller applied:
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: sharded
namespace: openshift-ingress-operator
spec:
domain: hm.ocp.rhbr-consulting.com
replicas: 1
#check that new ingress controller is started on 3rd worker (default has 2 replicas)
[kni@provisionhost-0-0 ~]$ oc get pods -n openshift-ingress -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
router-default-5b864b9594-5v8rd 1/1 Running 0 13h fd2e:6f44:5dd8::6c worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none>
router-default-5b864b9594-9lx82 1/1 Running 0 13h fd2e:6f44:5dd8::96 worker-0-2.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none>
router-sharded-7c6f7b55d7-dss4l 1/1 Running 0 6s fd2e:6f44:5dd8::63 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none>
[kni@provisionhost-0-0 ~]$ oc get clusteroperator | grep authentication
authentication 4.8.0-rc.0 True False False 12h
#find ingressVIP
[kni@provisionhost-0-0 ~]$ cat install-config.yaml | grep ingressVIP
ingressVIP: fd2e:6f44:5dd8::a
#find worker that holds ingressVIP:
[core@worker-0-2 ~]$ ip a | grep fd2e:6f44:5dd8::a
inet6 fd2e:6f44:5dd8::a/128 scope global nodad deprecated noprefixroute
# restart worker with ingressVIP and check where it moves:
[core@worker-0-0 ~]$ ip a | grep fd2e:6f44:5dd8::a
inet6 fd2e:6f44:5dd8::a/128 scope global nodad deprecated noprefixroute
ingressVIP moves to worker that have default ingress controller and not extra one, tested few times and authentication operator is not degraded:
[kni@provisionhost-0-0 ~]$ oc get clusteroperator | grep authentication
authentication 4.8.0-rc.0 True False False 13h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |