Description of problem: etcd reports Missing CNI default network during upgrade leading to multiple operators degrading. Feb 12 15:41:56.138 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-apiserver changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-controller-manager changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-scheduler changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) OpenShift API was one of the operators that degraded leading to test failures. API was unreachable during upgrade for at least 33s: Feb 12 15:39:38.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:39:38.475 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:39:54.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 12 15:39:55.401 - 4s E openshift-apiserver OpenShift API is not responding to GET requests Feb 12 15:40:00.701 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:42:55.840 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:42:55.919 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:43:18.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:43:18.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:43:39.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:43:39.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:43:58.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded Feb 12 15:43:58.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:44:18.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:44:18.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:44:38.403 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 12 15:44:38.482 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:44:59.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:44:59.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:45:15.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:45:15.481 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:45:33.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:45:34.401 - 13s E openshift-apiserver OpenShift API is not responding to GET requests Feb 12 15:45:48.478 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:46:10.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 12 15:46:10.478 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:46:27.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:46:27.480 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:46:56.859 E kube-apiserver Kube API started failing: etcdserver: request timed out Feb 12 15:46:57.173 I kube-apiserver Kube API started responding to GET requests Feb 12 15:47:04.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:47:05.401 - 13s E openshift-apiserver OpenShift API is not responding to GET requests Feb 12 15:47:19.479 I openshift-apiserver OpenShift API started responding to GET requests Feb 12 15:47:35.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Feb 12 15:47:35.489 I openshift-apiserver OpenShift API started responding to GET requests Full set of things that went Degraded=True during update: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262/build-log.txt | grep 'changed Degraded' Feb 12 15:40:06.047 E clusteroperator/monitoring changed Degraded to True: UpdatingconfigurationsharingFailed: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Alertmanager host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main) Feb 12 15:41:47.463 W clusteroperator/monitoring changed Degraded to False Feb 12 15:41:56.138 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-apiserver changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-controller-manager changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:56.167 E clusteroperator/kube-scheduler changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:41:58.446 W clusteroperator/etcd changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready Feb 12 15:41:58.453 W clusteroperator/kube-apiserver changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready Feb 12 15:41:58.478 W clusteroperator/kube-controller-manager changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/kube-controller-manager-ip-10-0-151-34.us-west-1.compute.internal container="cluster-policy-controller" is not ready\nStaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/kube-controller-manager-ip-10-0-151-34.us-west-1.compute.internal container="kube-controller-manager" is not ready\nNodeControllerDegraded: All master nodes are ready Feb 12 15:41:58.546 W clusteroperator/kube-scheduler changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/openshift-kube-scheduler-ip-10-0-151-34.us-west-1.compute.internal container="scheduler" is not ready Feb 12 15:41:58.595 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Feb 12 15:41:59.155 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Feb 12 15:41:59.189 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDaemonSet_UnavailablePod: APIServerDaemonSetDegraded: 1 of 3 requested instances are unavailable Feb 12 15:42:27.203 W clusteroperator/openshift-apiserver changed Degraded to False Feb 12 15:44:51.240 E clusteroperator/etcd changed Degraded to True: TargetConfigController_SynchronizationError: TargetConfigControllerDegraded: "configmap/kube-apiserver-pod": could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com." Feb 12 15:44:57.654 W clusteroperator/etcd changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is terminated: "Error" - "/bin/sh: line 3: 8 Terminated sleep 24h\n"\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is terminated: "Error" - "/bin/sh: line 3: 7 Terminated sleep 24h\n"\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-136-79.us-west-1.compute.internal" not ready since 2020-02-12 15:44:52 +0000 UTC because KubeletNotReady ([PLEG is not healthy: pleg has yet to be successful, runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network])\nConfigObservationDegraded: error looking up self: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com." Feb 12 15:46:34.959 E clusteroperator/monitoring changed Degraded to True: UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: reconciling Alertmanager ClusterRoleBinding failed: updating ClusterRoleBinding object failed: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field Feb 12 15:46:57.454 E clusteroperator/etcd changed Degraded to True: ConfigObservation_Error::HostEndpoints_ErrorUpdatingHostEndpoints::NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 14:\nNodeInstallerDegraded: \nStaticPodsDegraded: etcdserver: request timed out\nHostEndpointsDegraded: unable to determine etcd member dns name for node ip-10-0-151-34.us-west-1.compute.internal: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com."\nConfigObservationDegraded: error looking up self: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com." Feb 12 15:47:04.585 E clusteroperator/console changed Degraded to True: OAuthClientSync_FailedGet: OAuthClientSyncDegraded: oauth client for console does not exist and cannot be created (the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io console)) Feb 12 15:47:04.943 E clusteroperator/authentication changed Degraded to True: OAuthClients_Error: OAuthClientsDegraded: the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io openshift-challenging-client) Feb 12 15:47:35.625 W clusteroperator/console changed Degraded to False Feb 12 15:47:37.021 W clusteroperator/authentication changed Degraded to False Feb 12 15:47:39.939 W clusteroperator/etcd changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is waiting: "ContainerCreating" - ""\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is waiting: "ContainerCreating" - ""\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-137-157.us-west-1.compute.internal" not ready since 2020-02-12 15:47:24 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) Feb 12 15:49:13.378 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDaemonSet_UnavailablePod: APIServerDaemonSetDegraded: 1 of 3 requested instances are unavailable Feb 12 15:49:24.332 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-137-157.us-west-1.compute.internal" not ready since 2020-02-12 15:49:04 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network) Feb 12 15:49:24.383 W clusteroperator/etcd changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready Feb 12 15:49:49.714 W clusteroperator/monitoring changed Degraded to False Feb 12 15:49:53.204 W clusteroperator/openshift-apiserver changed Degraded to False Additional info: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262
Maybe also related: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17289
Still present in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17380.
Bunches of these in *->4.3.3 CI [1]: AWS 4.2.20 -> 4.3.3 [2] has slightly different symptoms: fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 19 21:07:31.731: Service was unreachable during upgrade for at least 1m29s: but lots of the default network error: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/build-log.txt | sort | uniq | grep -c 'Missing CNI default network' 999 And in another AWS 4.2.20 -> 4.3.3 [3] with: fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 21:09:59.119: API was unreachable during upgrade for at least 2m47s: has: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18083/build-log.txt | sort | uniq | grep -c 'Missing CNI default network' 1464 Similarly for 4.2.20 -> 4.3.3 GCP [4] and 4.3.2 -> 4.3.3 Azure [5]. Currently shows up in 45% of our failing update CI over the past 24h [6]. But perhaps this is a dup of the pending-for-4.3 bug 1755784? On the other hand, if this was pending for 4.3, it would already be fixed in 4.4 and 4.5, and I would have expected that to make it less common in CI failures. Also in this space and potential dupes are bug 1804681 and... maybe bug 1764629? Anyhow, escalating for triage. [1]: https://github.com/openshift/cincinnati-graph-data/pull/57#issuecomment-589276189 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18083 [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/345/build-log.txt [5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/61 [6]: https://search.svc.ci.openshift.org/chart?name=upgrade&search=Missing%20CNI%20default%20network
We need answers to bellow question to properly analyze the impact of the bug in upgrades. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the buggy condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is recovery possible only after more extensive cluster-admin intervention? Is recovery impossible (bricked cluster)? What is the observed rate of failure we see in CI? Is there a manual workaround that exists to recover from the bug? What are manual steps?
Just looking around in: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262 "Observed a panic" 2 occurrences in e2e-aws-upgrade/pods/openshift-etcd-operator_etcd-operator-674466b55f-bxsf2_operator.log and 2 occurrences in e2e-aws-upgrade/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-151-34.us-west-1.compute.internal_kube-apiserver_previous.log error "remote error: tls: bad certificate", ServerName "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com" E | rafthttp: failed to read 96cb92ee905d00b4 on stream Message (unexpected EOF) 2020-02-12 15:50:26.309575 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:672d7d7a556f1de9 RaftAttributes:{PeerURLs:[https://etcd-2.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:2380]} Attributes:{Name: ClientURLs:[]}} 2020-02-12 15:50:27.727867 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:daffd35da4f09a43 RaftAttributes:{PeerURLs:[https://etcd-2.ci-op-9sbrfg2t-77109.origin- in e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-79.us-west-1.compute.internal_etcd-member.log E0212 15:42:38.555510 4370 reflector.go:280] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Namespace: Get https://api-int.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=24404&timeout=5m30s&timeoutSeconds=330&watch=true: dial tcp 10.0.128.100:6443: connect: connection refused in e2e-aws-upgrade/pods/openshift-sdn_sdn-cql6c_sdn.log So sdn can't talk to apiserver (which seems to be dead) apiserver may not be able to talk to etcd either. More investigation needed. Now just to make things more interesting "OOMKilled" appears in several files including: build-log.txt - E ns/openshift-multus pod/multus-admission-controller-n2skf node/ip-10-0-151-34.us-west-1.compute.internal container=multus-admission-controller container exited with code 137 (OOMKilled): E ns/openshift-multus pod/multus-4spjd node/ip-10-0-142-214.us-west-1.compute.internal container=kube-multus container exited with code 137 (OOMKilled): E ns/openshift-etcd pod/etcd-staticpod-bz9xl node/ip-10-0-136-79.us-west-1.compute.internal container=etcd-staticpod container exited with code 255 (OOMKilled): e2e-aws-upgrade/pods.json "lastState": {"terminated": {"reason": "OOMKilled",}"name": "etcd-staticpod", e2e-aws-upgrade/events.json: "message": "error killing pod: failed to \"KillPodSandbox\" for \"085fe018-0786-437b-a02f-f46701e1d63c\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_multus-admission-controller-pc5lb_openshift-multus_085fe018-0786-437b-a02f-f46701e1d63c_0(ed3a701869d19862753bb5135487ef43e63f361224ef737f745383e46a7f2d24): Missing CNI default network\"", "message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_packageserver-5c87ff4c4b-wzdf5_openshift-operator-lifecycle-manager_094db5a2-187f-41d7-93de-183af0c13ead_0(57c85a26acb908bb0bcbe610264a7ed09f2bf5f6deaf630e243e41912cf5fbf8): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/87750/ns/net found: failed to Statfs \"/proc/87750/ns/net\": no such file or directory" "message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_packageserver-6bcb677dcb-trf52_openshift-operator-lifecycle-manager_7c805db9-8aa6-4797-8406-4ac9438938d3_0(f987589c06a0503110753ecc73bf22f1c74aec33a11a3ea5d02c667338467e44): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/29841/ns/net found: failed to Statfs \"/proc/29841/ns/net\": no such file or directory", e2e-aws-upgrade/nodes/workers-journal: gistry(6140667f-dcc2-4d8b-bd12-938c795c1170)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: E0212 15:26:38.909596 2078 kuberuntime_manager.go:729] createPodSandbox for pod "image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: E0212 15:26:38.909710 2078 pod_workers.go:191] Error syncing pod 6140667f-dcc2-4d8b-bd12-938c795c1170 ("image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)"), skipping: failed to "CreatePodSandbox" for "image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)" with CreatePodSandboxError: "CreatePodSandbox for pod \"image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs \"/proc/43097/ns/net\": no such file or directory" Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: I0212 15:26:38.909756 2078 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-image-registry", Name:"image-registry-575dbf7596-99jbb", UID:"6140667f-dcc2-4d8b-bd12-938c795c1170", APIVersion:"v1", ResourceVersion:"27190", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory The OOMKilled issue is being worked on. Clayton thinks it will improve things.
Lalatendu Mohanty - I can't answer most of your questions. The triage started as a network problem because of the "Missing CNI default network". Networking failed to come up because SDN can't talk to the apiserver. The apiserver log has 2 panics, it apparently can't talk to etcd. etcd has 4 panics. Also, several pods are OOMKilled. There are potentially several bugs here. I am not sure which, if any, is causing the failure. This cluster did not come up far enough to run any user pods or system pods that require cluster networking. Unfortunately I can't be very helpful with your questions. ----------- We need answers to bellow question to properly analyze the impact of the bug in upgrades. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? - The cluster is down, Telemetry, Insights etc. are not running. What kind of clusters are impacted because of the bug? - The failed test was on aws. What cluster functionality is degraded while hitting the bug? - No cluster networking, host network pods can run, pods that need cluster networking can't. Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. - I don't think so, however I really don't know. Is it possible to recover the cluster from the bug? - I don't know. This was a CI test that failed. We don't try to recover. Just report the error and move on. Is recovery automatic without intervention? I.e. is the buggy condition transient? - I don't know. I don't think it will recover, however... Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? - I don't know if the cluster can be fixed. Is recovery possible only after more extensive cluster-admin intervention? - I don't know Is recovery impossible (bricked cluster)? - I don't know What is the observed rate of failure we see in CI? - I was assigned this bug. Its the only case I have investigated. There are many failed upgrade tests. Some of them may be this bug. Is there a manual workaround that exists to recover from the bug? What are manual steps? - I don't know I am very sorry that I cannot be more helpful. Being on the network team, I don't directly work with upgrades. This doesn't appear to be a network problem.
The problem is cri-o is trying to start containers before the multus daemonsset can create networking. Its a race. networking comes up 54 sec after the "Missing CNI default network" is generated. From Doug Smith (dosmith) regarding the above "missing CNI default network" Here's some info... Main error message @ 23:34:03.589 Feb 19 23:34:03.589 [...snip...]: Missing CNI default network (4 times) However -- there are no Multus entrypoint logs until 2020-02-19T23:34:57 at the earliest. (this is the script that will create the primary CNI configuration) Here's a collection of Multus entrypoint logs: https://paste.centos.org/view/d67495e9 So, we can say with some authority -- no primary CNI configuration was created until at the earliest -- the same second as the first Multus entrypoint @ 2020-02-19T23:34:57. What I take this to mean with the missing CNI default network is that CRIO (and/or Kubelet) doesn't see the CNI configuration yet, because it hasn't been created -- it isn't created until the Multus daemonset that runs the entrypoint script which creates the primary CNI configuration. It will then wait until openshift-sdn lays down its CNI configuration, which it uses to make a Multus configuration using openshift-sdn as the default network. In this case in these logs, it doesn't look like there's huge delays for that to happen.
Does the upgrade complete properly with everything working? Is there a bug here or is this an unexpected message? The "Missing CNI default network" pods should be retried and ultimately work.
@slowrie Do you know answer to Phil's question?
(In reply to Lalatendu Mohanty from comment #9) > @slowrie Do you know answer to Phil's question? No I don't.
I strongly suspect that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1805444
Weibin, can you see if this is resolved by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1805444 please? Thanks
Hi Ben, https://bugzilla.redhat.com/show_bug.cgi?id=1805444 is tested and verified in 4.3.0-0.nightly-2020-03-02-094404 Weibin
According to comments 12 and comments 15, QE verified this bug now, we can reopen the bug if it happen again in CI upgrade testing.
Looks like this was verified two weeks ago (comment 16), but this 4.4 bug has not been cloned back to 4.3, despite this issue coming up in 4.3 CI tests (comment 3). Have we just not gotten around to backporting it yet?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days