The 4.5 e2e-gcp-serial container setup fails with the following error: level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-gz8d3iis-df9a4.origin-ci-int-gce.dev.openshift.com:6443 to initialize..." level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-gz8d3iis-df9a4.origin-ci-int-gce.dev.openshift.com: []" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " level=info msg="Cluster operator authentication Available is Unknown with NoData: " level=info msg="Cluster operator console Progressing is True with OAuthClientSync_FailedHost: OAuthClientSyncProgressing: waiting on route host" level=info msg="Cluster operator console Available is Unknown with NoData: " level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available." level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available." level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingUserWorkloadThanosRulerFailed: Failed to rollout the stack. Error: running task Updating User Workload Thanos Ruler failed: failed to retrieve Grafana datasources config: secrets \"grafana-datasources\" not found" level=info msg="Cluster operator monitoring Available is False with : " level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, ingress, monitoring" See the details here: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625 It looks like the routing components don't come up during the setup causing it to fail.
The ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/clusteroperators.json has the following status conditions: "conditions": [ { "lastTransitionTime": "2020-04-23T08:23:40Z", "message": "Not all ingress controllers are available.", "reason": "IngressUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2020-04-23T08:23:40Z", "message": "Not all ingress controllers are available.", "reason": "Reconciling", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2020-04-23T08:24:18Z", "message": "Some ingresscontrollers are degraded: default", "reason": "IngressControllersDegraded", "status": "True", "type": "Degraded" } ], However, the ingress clusteroperator from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1625/artifacts/e2e-gcp-serial/must-gather.tar has the following status conditions: conditions: - lastTransitionTime: "2020-04-23T08:50:43Z" message: desired and current number of IngressControllers are equal status: "True" type: Available - lastTransitionTime: "2020-04-23T08:50:43Z" message: desired and current number of IngressControllers are equal status: "False" type: Progressing - lastTransitionTime: "2020-04-23T08:50:43Z" reason: NoIngressControllersDegraded status: "False" type: Degraded From events in the openshift-ingress-operator namespace, it appears the ingress operator starts around 8:23: LAST SEEN MESSAGE 2020-04-23T08:23:15Z Created pod: ingress-operator-5ff5b94496-x6pzf ... 2020-04-23T08:23:35Z Created container ingress-operator The operator creates the ingress controller in the openshift-ingress namespace less than a minute later, but the events in the openshift-ingress namespace show that it takes a while for the worker nodes to become ready and for networking to become ready: LAST SEEN MESSAGE 2020-04-23T08:25:06Z 0/3 nodes are available: 3 node(s) didn't match node selector. ... 2020-04-23T08:27:02Z Successfully assigned openshift-ingress/router-default-698dd6776c-gpjmr to ci-op-6grzl-w-c-w5wnp.c.openshift-gce-devel-ci.internal ... 2020-04-23T08:27:19Z Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(eed02b40643951983fb47c3b3d08d550e99d446d51af6f9c7e7c68e03866e7f8): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' ... 2020-04-23T08:29:56Z Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(03409525bd0e4c6d41c7cb7064ae80f20705bdb56be87ec77beed87a25137134): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' ... 2020-04-23T08:47:19Z (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-gpjmr_openshift-ingress_5e5501b7-4715-4820-a67d-e86e5ec99780_0(f563f80aff0cf9498ea9ecb629efbeb59e6920b5e396fb4a5b32e35003b0118c): Multus: [openshift-ingress/router-default-698dd6776c-gpjmr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' 2020-04-23T08:47:33Z (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-698dd6776c-7t8wc_openshift-ingress_80a3910e-93fb-4834-91b1-e98dcfb843a2_0(9af524fa2882d14245ef36c35697150c0f6e8ce920b8c1f8dd124354774510b2): Multus: [openshift-ingress/router-default-698dd6776c-7t8wc]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking confAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: openshift-ingress, netnamespaces.network.openshift.io \"openshift-ingress\" not found\n' 2020-04-23T08:50:28Z Add eth0 [10.131.0.88/23] 2020-04-23T08:50:37Z Add eth0 [10.128.2.77/23] One thing that confuses me is that the ingress operator has logs for the period between 8:29 and 8:50. Some events are combined, so we cannot be sure, but I suspect that the Multus error was recurring during that period even though the ingress controller pod was somehow up and running, so perhaps Multus was causing problems for the running ingress controller pod that caused it to be in the degraded state. The ingress operator logs show the following: 2020-04-23T08:50:21.106Z INFO operator.ingress_controller ingress/controller.go:142 reconciling {"request": "openshift-ingress-operator/default"} 2020-04-23T08:50:21.161Z INFO operator.ingress_controller ingress/deployment.go:805 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-04-23T08:50:21.220Z ERROR operator.ingress_controller ingress/controller.go:209 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} We could benefit from better visibility into why it took the ingress controller 20 minutes to become ready; I'll spin off a Bugzilla report for that. Eventually the ingress controller becomes ready, and the IngressController's status shows that it is fine: - lastTransitionTime: "2020-04-23T08:23:40Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-04-23T08:50:43Z" status: "True" type: Available - lastTransitionTime: "2020-04-23T08:50:43Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "False" type: DeploymentDegraded - lastTransitionTime: "2020-04-23T08:23:44Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2020-04-23T08:24:29Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2020-04-23T08:23:44Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2020-04-23T08:24:30Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2020-04-23T08:50:43Z" status: "False" type: Degraded So it is not obvious whether the ingress operator is causing the problem or whether it is a casualty of an underlying network issue. The Multus errors warrant investigation, but that is beyond my ken, so I am assigning this report to openshift-sdn. Can someone from the SDN team help out in diagnosing this issue?
Uh. This is really weird. The sdn-controller, which allocates NetNamespaces on a 1:1 basis with Namespaces, just.... went out to lunch? I0423 08:50:24.489782 1 vnids.go:115] Allocated netid 8127529 for namespace "openshift-ingress" Keep in mind, the Namespace was created 08:23:40. Huh.
This strikes me as a very weird one-off.. The informer for sdn-controller just... ignored the namespace. Given that this has been a calm, reliable piece of code for a long time, and I just pushed some improvements to connections to the apiserver on GCP, I think we might be able to close this.