Bug 1827152
Summary: | e2e-gcp-serial 4.5 setup fails with IngressStateEndpointsDegraded: No endpoints found for oauth-server | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tomas Smetana <tsmetana> | |
Component: | Networking | Assignee: | Casey Callendrello <cdc> | |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED WORKSFORME | Docs Contact: | ||
Severity: | medium | |||
Priority: | unspecified | CC: | aconstan, aos-bugs, bbennett, mmasters | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | SDN-CI-IMPACT | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1829584 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-20 15:52:41 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Tomas Smetana
2020-04-23 11:10:58 UTC
The ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/clusteroperators.json has the following status conditions: "conditions": [ { "lastTransitionTime": "2020-04-23T08:23:40Z", "message": "Not all ingress controllers are available.", "reason": "IngressUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2020-04-23T08:23:40Z", "message": "Not all ingress controllers are available.", "reason": "Reconciling", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2020-04-23T08:24:18Z", "message": "Some ingresscontrollers are degraded: default", "reason": "IngressControllersDegraded", "status": "True", "type": "Degraded" } ], However, the ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/must-gather.tar has the following status conditions: conditions: - lastTransitionTime: "2020-04-23T08:50:43Z" message: desired and current number of IngressControllers are equal status: "True" type: Available - lastTransitionTime: "2020-04-23T08:50:43Z" message: desired and current number of IngressControllers are equal status: "False" type: Progressing - lastTransitionTime: "2020-04-23T08:50:43Z" reason: NoIngressControllersDegraded status: "False" type: Degraded From events in the openshift-ingress-operator namespace, it appears the ingress operator starts around 8:23: LAST SEEN MESSAGE 2020-04-23T08:23:15Z Created pod: ingress-operator-5ff5b94496-x6pzf ... 2020-04-23T08:23:35Z Created container ingress-operator The operator creates the ingress controller in the openshift-ingress namespace less than a minute later, but the events in the openshift-ingress namespace show that it takes a while for the worker nodes to become ready and for networking to become ready: LAST SEEN MESSAGE 2020-04-23T08:25:06Z 0/3 nodes are available: 3 node(s) didn't match node selector. ... 2020-04-23T08:27:02Z Successfully assigned openshift-ingress/router-default-698dd6776c-gpjmr to ci-op-6grzl-w-c-w5wnp.c.openshift-gce-devel-ci.internal ... 2020-04-23T08:27:19Z Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(eed02b40643951983fb47c3b3d08d550e99d446d51af6f9c7e7c68e03866e7f8): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' ... 2020-04-23T08:29:56Z Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(03409525bd0e4c6d41c7cb7064ae80f20705bdb56be87ec77beed87a25137134): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' ... 2020-04-23T08:47:19Z (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(f563f80aff0cf9498ea9ecb629efbeb59e6920b5e396fb4a5b32e35003b0118c): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' 2020-04-23T08:47:33Z (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-7t8wc_openshift-ingress_80a3910e-93fb-4834-91b1-e98dcfb843a2_0(9af524fa2882d14245ef36c35697150c0f6e8ce920b8c1f8dd124354774510b2): Multus: [openshift-ingress/router-default-698dd6776c-7t8wc]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' 2020-04-23T08:50:28Z Add eth0 [10.131.0.88/23] 2020-04-23T08:50:37Z Add eth0 [10.128.2.77/23] One thing that confuses me is that the ingress operator has logs for the period between 8:29 and 8:50. Some events are combined, so we cannot be sure, but I suspect that the Multus error was recurring during that period even though the ingress controller pod was somehow up and running, so perhaps Multus was causing problems for the running ingress controller pod that caused it to be in the degraded state. The ingress operator logs show the following: 2020-04-23T08:50:21.106Z INFO operator.ingress_controller ingress/controller.go:142 reconciling {"request": "openshift-ingress-operator/default"} 2020-04-23T08:50:21.161Z INFO operator.ingress_controller ingress/deployment.go:805 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-04-23T08:50:21.220Z ERROR operator.ingress_controller ingress/controller.go:209 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} We could benefit from better visibility into why it took the ingress controller 20 minutes to become ready; I'll spin off a Bugzilla report for that. Eventually the ingress controller becomes ready, and the IngressController's status shows that it is fine: - lastTransitionTime: "2020-04-23T08:23:40Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-04-23T08:50:43Z" status: "True" type: Available - lastTransitionTime: "2020-04-23T08:50:43Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "False" type: DeploymentDegraded - lastTransitionTime: "2020-04-23T08:23:44Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2020-04-23T08:24:29Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2020-04-23T08:23:44Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2020-04-23T08:24:30Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2020-04-23T08:50:43Z" status: "False" type: Degraded So it is not obvious whether the ingress operator is causing the problem or whether it is a casualty of an underlying network issue. The Multus errors warrant investigation, but that is beyond my ken, so I am assigning this report to openshift-sdn. Can someone from the SDN team help out in diagnosing this issue? Uh. This is really weird. The sdn-controller, which allocates NetNamespaces on a 1:1 basis with Namespaces, just.... went out to lunch? I0423 08:50:24.489782 1 vnids.go:115] Allocated netid 8127529 for namespace "openshift-ingress" Keep in mind, the Namespace was created 08:23:40. Huh. This strikes me as a very weird one-off.. The informer for sdn-controller just... ignored the namespace. Given that this has been a calm, reliable piece of code for a long time, and I just pushed some improvements to connections to the apiserver on GCP, I think we might be able to close this. |