Bug 1881155

Summary:	operator install authentication: Authentication require functional ingress which requires at least one schedulable and ready node
Product:	OpenShift Container Platform	Reporter:	Petr Muller <pmuller>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aiyengar, alukiano, amcdermo, aos-bugs, hongli, mfojtik, sttts, wking
Version:	4.6
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	operator install authentication
Last Closed:	2021-02-24 15:18:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1893879

Description Petr Muller 2020-09-21 16:25:38 UTC

# operator install authentication

Operator unavailable (OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes): ReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods). 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+install+authentication

Failing in various jobs, not limited by a platform:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.6/1308046682126028800
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1307958161860202496
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.6/1308010268730593280


E0921 12:29:16.791380      36 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: Get "https://api.ci-op-f98i4h6b-93317.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=25744&timeoutSeconds=546&watch=true": dial tcp 44.240.39.73:6443: connect: connection refused
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_DeploymentAvailableReplicasCheckFailed::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost::WellKnownReadyController_SyncError: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.0.10:443/healthz\": dial tcp 172.30.0.10:443: connect: connection refused\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address\nRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-f98i4h6b-93317.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty\nOAuthServerDeploymentDegraded: deployments.apps \"oauth-openshift\" not found\nOAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-f98i4h6b-93317.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty\nOAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps \"oauth-openshift\" not found\nWellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \"oauth-openshift\" not found (check authentication operator, it is supposed to create this)" 

Sippy associated the failing tests with https://bugzilla.redhat.com/show_bug.cgi?id=1879633 but that looks like it should be limited to proxy jobs.

Comment 1 Andrew McDermott 2020-09-22 16:09:47 UTC

Plan to use this bugzilla to improve status reporting for network edge.

There's a related jira ticket: https://issues.redhat.com/browse/NE-392.

Comment 2 Andrew McDermott 2020-09-24 16:11:49 UTC

WIP PR: https://github.com/openshift/cluster-ingress-operator/pull/465

Comment 3 Andrew McDermott 2020-10-02 15:50:49 UTC

Target set to next release version while investigation is either
ongoing or pending. Will be considered for earlier release versions
when diagnosed and resolved.

Comment 5 W. Trevor King 2020-11-02 20:11:40 UTC

Taking a stab at non-QE verification:

1. Install a 4.7 cluster via cluster-bot: launch 4.7
2. Confirm the installed version:

     $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
     4.7.0-0.nightly-2020-10-27-051128

3. Look at our nodes:

     $ oc get nodes
     NAME                                       STATUS   ROLES    AGE   VERSION
     ci-ln-bxhlmm2-f76d1-btg2b-master-0         Ready    master   34m   v1.19.0+e67f5dc
     ci-ln-bxhlmm2-f76d1-btg2b-master-1         Ready    master   34m   v1.19.0+e67f5dc
     ci-ln-bxhlmm2-f76d1-btg2b-master-2         Ready    master   34m   v1.19.0+e67f5dc
     ci-ln-bxhlmm2-f76d1-btg2b-worker-b-qx9t6   Ready    worker   26m   v1.19.0+e67f5dc
     ci-ln-bxhlmm2-f76d1-btg2b-worker-c-rvqs5   Ready    worker   29m   v1.19.0+e67f5dc
     ci-ln-bxhlmm2-f76d1-btg2b-worker-d-kgvvz   Ready    worker   26m   v1.19.0+e67f5dc

4. Confirm that the two relevant operators are happy:

     $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort
     2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well
     2020-11-02T19:22:13Z ingress Available True AsExpected desired and current number of IngressControllers are equal
     2020-11-02T19:22:13Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal
     2020-11-02T19:24:26Z ingress Degraded False NoIngressControllersDegraded -
     2020-11-02T19:41:53Z authentication Progressing False AsExpected All is well
     2020-11-02T19:48:29Z authentication Available True AsExpected OAuthServerDeploymentAvailable: availableReplicas==2
     2020-11-02T19:48:29Z authentication Degraded False AsExpected All is well

5. Make all the compute nodes 'infra' [1]:

     $ oc get -o json nodes | jq -r '.items[].metadata.name' | grep worker | while read NODE; do oc label node "${NODE}" node-role.kubernetes.io/infra=; oc label node "${NODE}" node-role.kubernetes.io/worker-; done

6. Check the pods:

     $ oc -n openshift-ingress get pods
     NAME                             READY   STATUS    RESTARTS   AGE
     router-default-d6668cf74-phpg9   1/1     Running   0          34m
     router-default-d6668cf74-rfbc9   1/1     Running   0          34m

   The fact that new pods aren't schedulable yet is ok, as long as the pods stay running.

7. See that auth got mad (bug 1893386), despite the router still being happy:

     $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort
     2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well
     2020-11-02T19:22:13Z ingress Available True AsExpected desired and current number of IngressControllers are equal
     2020-11-02T19:22:13Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal
     2020-11-02T19:24:26Z ingress Degraded False NoIngressControllersDegraded -
     2020-11-02T19:41:53Z authentication Progressing False AsExpected All is well
     2020-11-02T19:48:29Z authentication Degraded False AsExpected All is well
     2020-11-02T19:51:34Z authentication Available False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).

8. Kill the router pods.  The node role changes should keep them from being rescheduled:

     $ oc -n openshift-ingress get -o json pods | jq -r '.items[].metadata.name' | while read POD; do oc -n openshift-ingress delete pod "${POD}"; done

9. Check the pods:

     $ oc -n openshift-ingress get pods
     NAME                             READY   STATUS    RESTARTS   AGE
     router-default-d6668cf74-9hxld   0/1     Pending   0          2m36s
     router-default-d6668cf74-jk85x   0/1     Pending   0          4m2s

10. Wait 10 minutes [2]:

     $ sleep 600

11. See that both auth and ingress are mad:

     $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
     2020-11-02T19:57:15Z authentication Degraded True OAuthRouteCheckEndpointAccessibleController_SyncError OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz": dial tcp 35.231.72.50:443: connect: connection refused
     2020-11-02T19:55:15Z authentication Progressing True OAuthVersionRoute_WaitingForRoute OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz" not successfull yet
     2020-11-02T19:51:34Z authentication Available False OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed::ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).
     OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz" failed: dial tcp 35.231.72.50:443: connect: connection refused
     OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-ln-bxhlmm2-f76d1.origin-ci-int-gce.dev.openshift.com/healthz": dial tcp 35.231.72.50:443: connect: connection refused
     2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well
     2020-11-02T19:54:48Z ingress Available False IngressUnavailable Not all ingress controllers are available.
     2020-11-02T19:54:48Z ingress Progressing True Reconciling Not all ingress controllers are available.
     2020-11-02T20:05:18Z ingress Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-d6668cf74-jk85x" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Pod "router-default-d6668cf74-9hxld" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1)

    You can see the "0/6 nodes are available: 6 node(s) didn't match node selector" in the ingress Degraded=True message.

12. Configure the IngressController [3] NodePlacement [4] to allow scheduling on infra nodes:

     $ oc -n openshift-ingress-operator patch ingresscontroller default --type json -p '[{"op": "add", "path": "/spec/nodePlacement", "value": {"nodeSelector": {"matchLabels": {"node-role.kubernetes.io/infra": ""}}}}]'

13: See that the pods are happy again:

     $ oc -n openshift-ingress get pods
     NAME                              READY   STATUS    RESTARTS   AGE
     router-default-65b7fc9b4f-59bw5   1/1     Running   0          23s
     router-default-65b7fc9b4f-pml2d   1/1     Running   0          23s

14: See that auth is still sad (bug 1893386) but that ingress is correctly happy again:

     $ oc get -o json clusteroperators | jq -r '.items[] | .metadata.name as $name | select($name == "authentication" or $name == "ingress").status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' | sort
     2020-11-02T19:17:15Z authentication Upgradeable True AsExpected All is well
     2020-11-02T19:51:34Z authentication Available False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).
     2020-11-02T20:10:33Z ingress Available True AsExpected desired and current number of IngressControllers are equal
     2020-11-02T20:10:33Z ingress Degraded False NoIngressControllersDegraded -
     2020-11-02T20:10:33Z ingress Progressing False AsExpected desired and current number of IngressControllers are equal
     2020-11-02T20:10:46Z authentication Degraded False AsExpected All is well
     2020-11-02T20:10:48Z authentication Progressing False AsExpected All is well

[1]: https://github.com/openshift/machine-config-operator/blob/0170e082a8b8228373bd841d17555fff2cfb51b7/docs/custom-pools.md#creating-a-custom-pool
[2]: https://github.com/openshift/cluster-ingress-operator/pull/465/files#diff-56b131774a926e7a0e30a9be7dac7bf5c5cec11ff709aa6604cecc9ef117ede2R360
[3]: https://docs.openshift.com/container-platform/4.6/networking/ingress-operator.html#configuring-ingress-controller
[4]: https://github.com/openshift/api/blob/9252afb032e11093b53406ae80e0acb3410603b2/operator/v1/types_ingress.go#L131-L137

Comment 6 Arvind iyengar 2020-11-09 09:52:57 UTC

In reference to the attached PR related to enhancement with reporting of ingress, with "4.6.0-0.nightly-2020-11-07-035509" it is noted that clearer deployment degrades reasoning and "PodsScheduled" state field are getting displayed:
-----
2020-11-09T09:13:25Z ingress Available False IngressUnavailable Not all ingress controllers are available.
2020-11-09T09:13:25Z ingress Progressing True Reconciling Not all ingress controllers are available.
2020-11-09T09:23:26Z ingress Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: ingresscontroller "internalapps" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-internalapps-66bc4c5dc-hjqqn" cannot be scheduled: 0/6 nodes are available: 6 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/1 of replicas are available, max unavailable is 0)
-----

Hence marking as "verified"

Comment 9 errata-xmlrpc 2021-02-24 15:18:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633