Bug 1827152

Summary:	e2e-gcp-serial 4.5 setup fails with IngressStateEndpointsDegraded: No endpoints found for oauth-server
Product:	OpenShift Container Platform	Reporter:	Tomas Smetana <tsmetana>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aconstan, aos-bugs, bbennett, mmasters
Version:	4.5
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SDN-CI-IMPACT
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1829584 (view as bug list)		Environment:
Last Closed:	2020-05-20 15:52:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomas Smetana 2020-04-23 11:10:58 UTC

The 4.5 e2e-gcp-serial container setup fails with the following error:

level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-gz8d3iis-df9a4.origin-ci-int-gce.dev.openshift.com:6443 to initialize..."
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-gz8d3iis-df9a4.origin-ci-int-gce.dev.openshift.com: []"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator console Progressing is True with OAuthClientSync_FailedHost: OAuthClientSyncProgressing: waiting on route host"
level=info msg="Cluster operator console Available is Unknown with NoData: "
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingUserWorkloadThanosRulerFailed: Failed to rollout the stack. Error: running task Updating User Workload Thanos Ruler failed: failed to retrieve Grafana datasources config: secrets \"grafana-datasources\" not found"
level=info msg="Cluster operator monitoring Available is False with : "
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, ingress, monitoring"

See the details here:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625

It looks like the routing components don't come up during the setup causing it to fail.

Comment 1 Miciah Dashiel Butler Masters 2020-04-29 20:05:30 UTC

The ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/clusteroperators.json has the following status conditions:

    "conditions": [
        {
            "lastTransitionTime": "2020-04-23T08:23:40Z",
            "message": "Not all ingress controllers are available.",
            "reason": "IngressUnavailable",
            "status": "False",
            "type": "Available"
        },
        {
            "lastTransitionTime": "2020-04-23T08:23:40Z",
            "message": "Not all ingress controllers are available.",
            "reason": "Reconciling",
            "status": "True",
            "type": "Progressing"
        },
        {
            "lastTransitionTime": "2020-04-23T08:24:18Z",
            "message": "Some ingresscontrollers are degraded: default",
            "reason": "IngressControllersDegraded",
            "status": "True",
            "type": "Degraded"
        }
    ],

However, the ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/must-gather.tar has the following status conditions:

    conditions:
    - lastTransitionTime: "2020-04-23T08:50:43Z"
      message: desired and current number of IngressControllers are equal
      status: "True"
      type: Available
    - lastTransitionTime: "2020-04-23T08:50:43Z"
      message: desired and current number of IngressControllers are equal
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-04-23T08:50:43Z"
      reason: NoIngressControllersDegraded
      status: "False"
      type: Degraded

From events in the openshift-ingress-operator namespace, it appears the ingress operator starts around 8:23:

    LAST SEEN             MESSAGE
    2020-04-23T08:23:15Z  Created pod: ingress-operator-5ff5b94496-x6pzf
    ...
    2020-04-23T08:23:35Z  Created container ingress-operator

The operator creates the ingress controller in the openshift-ingress namespace less than a minute later, but the events in the openshift-ingress namespace show that it takes a while for the worker nodes to become ready and for networking to become ready:

    LAST SEEN             MESSAGE
    2020-04-23T08:25:06Z  0/3 nodes are available: 3 node(s) didn't match node selector.
    ...
    2020-04-23T08:27:02Z  Successfully assigned openshift-ingress/router-default-698dd6776c-gpjmr to ci-op-6grzl-w-c-w5wnp.c.openshift-gce-devel-ci.internal
    ...
    2020-04-23T08:27:19Z  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(eed02b40643951983fb47c3b3d08d550e99d446d51af6f9c7e7c68e03866e7f8): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n'
    ...
    2020-04-23T08:29:56Z  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(03409525bd0e4c6d41c7cb7064ae80f20705bdb56be87ec77beed87a25137134): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n'
    ...
    2020-04-23T08:47:19Z  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(f563f80aff0cf9498ea9ecb629efbeb59e6920b5e396fb4a5b32e35003b0118c): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n'
    2020-04-23T08:47:33Z  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-7t8wc_openshift-ingress_80a3910e-93fb-4834-91b1-e98dcfb843a2_0(9af524fa2882d14245ef36c35697150c0f6e8ce920b8c1f8dd124354774510b2): Multus: [openshift-ingress/router-default-698dd6776c-7t8wc]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n'
    2020-04-23T08:50:28Z  Add eth0 [10.131.0.88/23]
    2020-04-23T08:50:37Z  Add eth0 [10.128.2.77/23]

One thing that confuses me is that the ingress operator has logs for the period between 8:29 and 8:50.  Some events are combined, so we cannot be sure, but I suspect that the Multus error was recurring during that period even though the ingress controller pod was somehow up and running, so perhaps Multus was causing problems for the running ingress controller pod that caused it to be in the degraded state.

The ingress operator logs show the following:

    2020-04-23T08:50:21.106Z	INFO	operator.ingress_controller	ingress/controller.go:142	reconciling	{"request": "openshift-ingress-operator/default"}
    2020-04-23T08:50:21.161Z	INFO	operator.ingress_controller	ingress/deployment.go:805	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
    2020-04-23T08:50:21.220Z	ERROR	operator.ingress_controller	ingress/controller.go:209	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded"}

We could benefit from better visibility into why it took the ingress controller 20 minutes to become ready; I'll spin off a Bugzilla report for that.

Eventually the ingress controller becomes ready, and the IngressController's status shows that it is fine:

      - lastTransitionTime: "2020-04-23T08:23:40Z"
        reason: Valid
        status: "True"
        type: Admitted
      - lastTransitionTime: "2020-04-23T08:50:43Z"
        status: "True"
        type: Available
      - lastTransitionTime: "2020-04-23T08:50:43Z"
        message: The deployment has Available status condition set to True
        reason: DeploymentAvailable
        status: "False"
        type: DeploymentDegraded
      - lastTransitionTime: "2020-04-23T08:23:44Z"
        message: The endpoint publishing strategy supports a managed load balancer
        reason: WantedByEndpointPublishingStrategy
        status: "True"
        type: LoadBalancerManaged
      - lastTransitionTime: "2020-04-23T08:24:29Z"
        message: The LoadBalancer service is provisioned
        reason: LoadBalancerProvisioned
        status: "True"
        type: LoadBalancerReady
      - lastTransitionTime: "2020-04-23T08:23:44Z"
        message: DNS management is supported and zones are specified in the cluster DNS
          config.
        reason: Normal
        status: "True"
        type: DNSManaged
      - lastTransitionTime: "2020-04-23T08:24:30Z"
        message: The record is provisioned in all reported zones.
        reason: NoFailedZones
        status: "True"
        type: DNSReady
      - lastTransitionTime: "2020-04-23T08:50:43Z"
        status: "False"
        type: Degraded

So it is not obvious whether the ingress operator is causing the problem or whether it is a casualty of an underlying network issue.  The Multus errors warrant investigation, but that is beyond my ken, so I am assigning this report to openshift-sdn.  Can someone from the SDN team help out in diagnosing this issue?

Comment 2 Casey Callendrello 2020-05-06 15:03:08 UTC

Uh. This is really weird. The sdn-controller, which allocates NetNamespaces on a 1:1 basis with Namespaces, just.... went out to lunch?

I0423 08:50:24.489782       1 vnids.go:115] Allocated netid 8127529 for namespace "openshift-ingress"

Keep in mind, the Namespace was created 08:23:40. Huh.

Comment 3 Casey Callendrello 2020-05-19 13:06:05 UTC

This strikes me as a very weird one-off.. The informer for sdn-controller just... ignored the namespace.

Given that this has been a calm, reliable piece of code for a long time, and I just pushed some improvements to connections to the apiserver on GCP, I think we might be able to close this.