Bug 1881155
Summary: | operator install authentication: Authentication require functional ingress which requires at least one schedulable and ready node | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Petr Muller <pmuller> |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> |
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aiyengar, alukiano, amcdermo, aos-bugs, hongli, mfojtik, sttts, wking |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
operator install authentication
|
|
Last Closed: | 2021-02-24 15:18:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1893879 |
Description
Petr Muller
2020-09-21 16:25:38 UTC
Plan to use this bugzilla to improve status reporting for network edge. There's a related jira ticket: https://issues.redhat.com/browse/NE-392. Target set to next release version while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Taking a stab at non-QE verification: 1. Install a 4.7 cluster via cluster-bot: launch 4.7 2. Confirm the installed version: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.7.0-0.nightly-2020-10-27-051128 3. Look at our nodes: $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-bxhlmm2-f76d1-btg2b-master-0 Ready master 34m v1.19.0+e67f5dc ci-ln-bxhlmm2-f76d1-btg2b-master-1 Ready master 34m v1.19.0+e67f5dc ci-ln-bxhlmm2-f76d1-btg2b-master-2 Ready master 34m v1.19.0+e67f5dc ci-ln-bxhlmm2-f76d1-btg2b-worker-b-qx9t6 Ready worker 26m v1.19.0+e67f5dc ci-ln-bxhlmm2-f76d1-btg2b-worker-c-rvqs5 Ready worker 29m v1.19.0+e67f5dc ci-ln-bxhlmm2-f76d1-btg2b-worker-d-kgvvz Ready worker 26m v1.19.0+e67f5dc 4. Confirm that the two relevant operators are happy: $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort 2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well 2020-11-02T19:22:13Z ingress Available True AsExpected desired and current number of IngressControllers are equal 2020-11-02T19:22:13Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal 2020-11-02T19:24:26Z ingress Degraded False NoIngressControllersDegraded - 2020-11-02T19:41:53Z authentication Progressing False AsExpected All is well 2020-11-02T19:48:29Z authentication Available True AsExpected OAuthServerDeploymentAvailable: availableReplicas==2 2020-11-02T19:48:29Z authentication Degraded False AsExpected All is well 5. Make all the compute nodes 'infra' [1]: $ oc get -o json nodes | jq -r '.items[].metadata.name' | grep worker | while read NODE; do oc label node "${NODE}" node-role.kubernetes.io/infra=; oc label node "${NODE}" node-role.kubernetes.io/worker-; done 6. Check the pods: $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-d6668cf74-phpg9 1/1 Running 0 34m router-default-d6668cf74-rfbc9 1/1 Running 0 34m The fact that new pods aren't schedulable yet is ok, as long as the pods stay running. 7. See that auth got mad (bug 1893386), despite the router still being happy: $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort 2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well 2020-11-02T19:22:13Z ingress Available True AsExpected desired and current number of IngressControllers are equal 2020-11-02T19:22:13Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal 2020-11-02T19:24:26Z ingress Degraded False NoIngressControllersDegraded - 2020-11-02T19:41:53Z authentication Progressing False AsExpected All is well 2020-11-02T19:48:29Z authentication Degraded False AsExpected All is well 2020-11-02T19:51:34Z authentication Available False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods). 8. Kill the router pods. The node role changes should keep them from being rescheduled: $ oc -n openshift-ingress get -o json pods | jq -r '.items[].metadata.name' | while read POD; do oc -n openshift-ingress delete pod "${POD}"; done 9. Check the pods: $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-d6668cf74-9hxld 0/1 Pending 0 2m36s router-default-d6668cf74-jk85x 0/1 Pending 0 4m2s 10. Wait 10 minutes [2]: $ sleep 600 11. See that both auth and ingress are mad: $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-11-02T19:57:15Z authentication Degraded True OAuthRouteCheckEndpointAccessibleController_SyncError OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz": dial tcp 35.231.72.50:443: connect: connection refused 2020-11-02T19:55:15Z authentication Progressing True OAuthVersionRoute_WaitingForRoute OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz" not successfull yet 2020-11-02T19:51:34Z authentication Available False OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed::ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods). OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz" failed: dial tcp 35.231.72.50:443: connect: connection refused OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz": dial tcp 35.231.72.50:443: connect: connection refused 2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well 2020-11-02T19:54:48Z ingress Available False IngressUnavailable Not all ingress controllers are available. 2020-11-02T19:54:48Z ingress Progressing True Reconciling Not all ingress controllers are available. 2020-11-02T20:05:18Z ingress Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-d6668cf74-jk85x" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Pod "router-default-d6668cf74-9hxld" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1) You can see the "0/6 nodes are available: 6 node(s) didn't match node selector" in the ingress Degraded=True message. 12. Configure the IngressController [3] NodePlacement [4] to allow scheduling on infra nodes: $ oc -n openshift-ingress-operator patch ingresscontroller default --type json -p '[{"op": "add", "path": "/spec/nodePlacement", "value": {"nodeSelector": {"matchLabels": {"node-role.kubernetes.io/infra": ""}}}}]' 13: See that the pods are happy again: $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-65b7fc9b4f-59bw5 1/1 Running 0 23s router-default-65b7fc9b4f-pml2d 1/1 Running 0 23s 14: See that auth is still sad (bug 1893386) but that ingress is correctly happy again: $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort 2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well 2020-11-02T19:51:34Z authentication Available False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods). 2020-11-02T20:10:33Z ingress Available True AsExpected desired and current number of IngressControllers are equal 2020-11-02T20:10:33Z ingress Degraded False NoIngressControllersDegraded - 2020-11-02T20:10:33Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal 2020-11-02T20:10:46Z authentication Degraded False AsExpected All is well 2020-11-02T20:10:48Z authentication Progressing False AsExpected All is well [1]: https://github.com/openshift/machine-config-operator/blob/0170e082a8b8228373bd841d17555fff2cfb51b7/docs/custom-pools.md#creating-a-custom-pool [2]: https://github.com/openshift/cluster-ingress-operator/pull/465/files#diff-56b131774a926e7a0e30a9be7dac7bf5c5cec11ff709aa6604cecc9ef117ede2R360 [3]: https://docs.openshift.com/container-platform/4.6/networking/ingress-operator.html#configuring-ingress-controller [4]: https://github.com/openshift/api/blob/9252afb032e11093b53406ae80e0acb3410603b2/operator/v1/types_ingress.go#L131-L137 In reference to the attached PR related to enhancement with reporting of ingress, with "4.6.0-0.nightly-2020-11-07-035509" it is noted that clearer deployment degrades reasoning and "PodsScheduled" state field are getting displayed: ----- 2020-11-09T09:13:25Z ingress Available False IngressUnavailable Not all ingress controllers are available. 2020-11-09T09:13:25Z ingress Progressing True Reconciling Not all ingress controllers are available. 2020-11-09T09:23:26Z ingress Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: ingresscontroller "internalapps" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-internalapps-66bc4c5dc-hjqqn" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/1 of replicas are available, max unavailable is 0) ----- Hence marking as "verified" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |