1802246 – etcd reports Missing CNI default network during upgrade

Bug 1802246 - etcd reports Missing CNI default network during upgrade

Summary: etcd reports Missing CNI default network during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Phil Cameron
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-12 17:31 UTC by slowrie
Modified:	2023-09-14 05:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 21:57:52 UTC
Target Upstream Version:
Embargoed:
Flags:	weliang: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:57:54 UTC

Description slowrie 2020-02-12 17:31:46 UTC

Description of problem:

etcd reports Missing CNI default network during upgrade leading to multiple operators degrading.

Feb 12 15:41:56.138 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-apiserver changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-controller-manager changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-scheduler changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)

OpenShift API was one of the operators that degraded leading to test failures.

API was unreachable during upgrade for at least 33s:

Feb 12 15:39:38.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:39:38.475 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:39:54.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 12 15:39:55.401 - 4s    E openshift-apiserver OpenShift API is not responding to GET requests
Feb 12 15:40:00.701 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:42:55.840 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:42:55.919 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:43:18.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:43:18.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:43:39.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:43:39.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:43:58.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded
Feb 12 15:43:58.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:44:18.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:44:18.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:44:38.403 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 12 15:44:38.482 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:44:59.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:44:59.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:45:15.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:45:15.481 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:45:33.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:45:34.401 - 13s   E openshift-apiserver OpenShift API is not responding to GET requests
Feb 12 15:45:48.478 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:46:10.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 12 15:46:10.478 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:46:27.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:46:27.480 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:46:56.859 E kube-apiserver Kube API started failing: etcdserver: request timed out
Feb 12 15:46:57.173 I kube-apiserver Kube API started responding to GET requests
Feb 12 15:47:04.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:47:05.401 - 13s   E openshift-apiserver OpenShift API is not responding to GET requests
Feb 12 15:47:19.479 I openshift-apiserver OpenShift API started responding to GET requests
Feb 12 15:47:35.401 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 12 15:47:35.489 I openshift-apiserver OpenShift API started responding to GET requests

Full set of things that went Degraded=True during update:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262/build-log.txt | grep 'changed Degraded'
Feb 12 15:40:06.047 E clusteroperator/monitoring changed Degraded to True: UpdatingconfigurationsharingFailed: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Alertmanager host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main)
Feb 12 15:41:47.463 W clusteroperator/monitoring changed Degraded to False
Feb 12 15:41:56.138 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-apiserver changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-controller-manager changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:56.167 E clusteroperator/kube-scheduler changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-151-34.us-west-1.compute.internal" not ready since 2020-02-12 15:41:38 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:41:58.446 W clusteroperator/etcd changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready
Feb 12 15:41:58.453 W clusteroperator/kube-apiserver changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready
Feb 12 15:41:58.478 W clusteroperator/kube-controller-manager changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/kube-controller-manager-ip-10-0-151-34.us-west-1.compute.internal container="cluster-policy-controller" is not ready\nStaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/kube-controller-manager-ip-10-0-151-34.us-west-1.compute.internal container="kube-controller-manager" is not ready\nNodeControllerDegraded: All master nodes are ready
Feb 12 15:41:58.546 W clusteroperator/kube-scheduler changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: nodes/ip-10-0-151-34.us-west-1.compute.internal pods/openshift-kube-scheduler-ip-10-0-151-34.us-west-1.compute.internal container="scheduler" is not ready
Feb 12 15:41:58.595 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Feb 12 15:41:59.155 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Feb 12 15:41:59.189 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDaemonSet_UnavailablePod: APIServerDaemonSetDegraded: 1 of 3 requested instances are unavailable
Feb 12 15:42:27.203 W clusteroperator/openshift-apiserver changed Degraded to False
Feb 12 15:44:51.240 E clusteroperator/etcd changed Degraded to True: TargetConfigController_SynchronizationError: TargetConfigControllerDegraded: "configmap/kube-apiserver-pod": could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com."
Feb 12 15:44:57.654 W clusteroperator/etcd changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is terminated: "Error" - "/bin/sh: line 3:     8 Terminated              sleep 24h\n"\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is terminated: "Error" - "/bin/sh: line 3:     7 Terminated              sleep 24h\n"\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-136-79.us-west-1.compute.internal" not ready since 2020-02-12 15:44:52 +0000 UTC because KubeletNotReady ([PLEG is not healthy: pleg has yet to be successful, runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network])\nConfigObservationDegraded: error looking up self: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com."
Feb 12 15:46:34.959 E clusteroperator/monitoring changed Degraded to True: UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: reconciling Alertmanager ClusterRoleBinding failed: updating ClusterRoleBinding object failed: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
Feb 12 15:46:57.454 E clusteroperator/etcd changed Degraded to True: ConfigObservation_Error::HostEndpoints_ErrorUpdatingHostEndpoints::NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 14:\nNodeInstallerDegraded: \nStaticPodsDegraded: etcdserver: request timed out\nHostEndpointsDegraded: unable to determine etcd member dns name for node ip-10-0-151-34.us-west-1.compute.internal: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com."\nConfigObservationDegraded: error looking up self: could not resolve member "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com."
Feb 12 15:47:04.585 E clusteroperator/console changed Degraded to True: OAuthClientSync_FailedGet: OAuthClientSyncDegraded: oauth client for console does not exist and cannot be created (the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io console))
Feb 12 15:47:04.943 E clusteroperator/authentication changed Degraded to True: OAuthClients_Error: OAuthClientsDegraded: the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io openshift-challenging-client)
Feb 12 15:47:35.625 W clusteroperator/console changed Degraded to False
Feb 12 15:47:37.021 W clusteroperator/authentication changed Degraded to False
Feb 12 15:47:39.939 W clusteroperator/etcd changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd" is waiting: "ContainerCreating" - ""\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is not ready\nStaticPodsDegraded: nodes/ip-10-0-136-79.us-west-1.compute.internal pods/etcd-ip-10-0-136-79.us-west-1.compute.internal container="etcd-metrics" is waiting: "ContainerCreating" - ""\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-137-157.us-west-1.compute.internal" not ready since 2020-02-12 15:47:24 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
Feb 12 15:49:13.378 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDaemonSet_UnavailablePod: APIServerDaemonSetDegraded: 1 of 3 requested instances are unavailable
Feb 12 15:49:24.332 E clusteroperator/etcd changed Degraded to True: NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-137-157.us-west-1.compute.internal" not ready since 2020-02-12 15:49:04 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)
Feb 12 15:49:24.383 W clusteroperator/etcd changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready
Feb 12 15:49:49.714 W clusteroperator/monitoring changed Degraded to False
Feb 12 15:49:53.204 W clusteroperator/openshift-apiserver changed Degraded to False

Additional info:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262

Comment 1 slowrie 2020-02-12 22:04:36 UTC

Maybe also related: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17289

Comment 2 Benjamin Gilbert 2020-02-13 16:15:16 UTC

Still present in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17380.

Comment 3 W. Trevor King 2020-02-20 20:47:02 UTC

Bunches of these in *->4.3.3 CI [1]:


AWS 4.2.20 -> 4.3.3 [2] has slightly different symptoms:

  fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 19 21:07:31.731: Service was unreachable during upgrade for at least 1m29s:

but lots of the default network error:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/build-log.txt | sort | uniq | grep -c 'Missing CNI default network'
  999

And in another AWS 4.2.20 -> 4.3.3 [3] with:

  fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 21:09:59.119: API was unreachable during upgrade for at least 2m47s:

has:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18083/build-log.txt | sort | uniq | grep -c 'Missing CNI default network'
  1464

Similarly for 4.2.20 -> 4.3.3 GCP [4] and 4.3.2 -> 4.3.3 Azure [5].  Currently shows up in 45% of our failing update CI over the past 24h [6].  But perhaps this is a dup of the pending-for-4.3 bug 1755784?  On the other hand, if this was pending for 4.3, it would already be fixed in 4.4 and 4.5, and I would have expected that to make it less common in CI failures.  Also in this space and potential dupes are bug 1804681 and... maybe bug 1764629?  Anyhow, escalating for triage.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/57#issuecomment-589276189
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18083
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/345/build-log.txt
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/61
[6]: https://search.svc.ci.openshift.org/chart?name=upgrade&search=Missing%20CNI%20default%20network

Comment 4 Lalatendu Mohanty 2020-02-25 13:15:40 UTC

We need answers to bellow question to properly analyze the impact of the bug in upgrades. 

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the buggy condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery possible only after more extensive cluster-admin intervention?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? What are manual steps?

Comment 5 Phil Cameron 2020-02-25 14:08:28 UTC

Just looking around in: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262

"Observed a panic" 2 occurrences in e2e-aws-upgrade/pods/openshift-etcd-operator_etcd-operator-674466b55f-bxsf2_operator.log 
and 2 occurrences in e2e-aws-upgrade/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-151-34.us-west-1.compute.internal_kube-apiserver_previous.log

error "remote error: tls: bad certificate", ServerName "etcd-0.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com"
 E | rafthttp: failed to read 96cb92ee905d00b4 on stream Message (unexpected EOF)
2020-02-12 15:50:26.309575 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:672d7d7a556f1de9 RaftAttributes:{PeerURLs:[https://etcd-2.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:2380]} Attributes:{Name: ClientURLs:[]}}
2020-02-12 15:50:27.727867 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:daffd35da4f09a43 RaftAttributes:{PeerURLs:[https://etcd-2.ci-op-9sbrfg2t-77109.origin-
 in e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-79.us-west-1.compute.internal_etcd-member.log

E0212 15:42:38.555510    4370 reflector.go:280] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Namespace: Get https://api-int.ci-op-9sbrfg2t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=24404&timeout=5m30s&timeoutSeconds=330&watch=true: dial tcp 10.0.128.100:6443: connect: connection refused
in e2e-aws-upgrade/pods/openshift-sdn_sdn-cql6c_sdn.log

So sdn can't talk to apiserver (which seems to be dead) apiserver may not be able to talk to etcd either. More investigation needed.

Now just to make things more interesting "OOMKilled" appears in several files including:
build-log.txt - E ns/openshift-multus pod/multus-admission-controller-n2skf node/ip-10-0-151-34.us-west-1.compute.internal container=multus-admission-controller container exited with code 137 (OOMKilled):
E ns/openshift-multus pod/multus-4spjd node/ip-10-0-142-214.us-west-1.compute.internal container=kube-multus container exited with code 137 (OOMKilled):
E ns/openshift-etcd pod/etcd-staticpod-bz9xl node/ip-10-0-136-79.us-west-1.compute.internal container=etcd-staticpod container exited with code 255 (OOMKilled):
e2e-aws-upgrade/pods.json "lastState": {"terminated": {"reason": "OOMKilled",}"name": "etcd-staticpod",

e2e-aws-upgrade/events.json:
"message": "error killing pod: failed to \"KillPodSandbox\" for \"085fe018-0786-437b-a02f-f46701e1d63c\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_multus-admission-controller-pc5lb_openshift-multus_085fe018-0786-437b-a02f-f46701e1d63c_0(ed3a701869d19862753bb5135487ef43e63f361224ef737f745383e46a7f2d24): Missing CNI default network\"",
"message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_packageserver-5c87ff4c4b-wzdf5_openshift-operator-lifecycle-manager_094db5a2-187f-41d7-93de-183af0c13ead_0(57c85a26acb908bb0bcbe610264a7ed09f2bf5f6deaf630e243e41912cf5fbf8): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/87750/ns/net found: failed to Statfs \"/proc/87750/ns/net\": no such file or directory"
"message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_packageserver-6bcb677dcb-trf52_openshift-operator-lifecycle-manager_7c805db9-8aa6-4797-8406-4ac9438938d3_0(f987589c06a0503110753ecc73bf22f1c74aec33a11a3ea5d02c667338467e44): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/29841/ns/net found: failed to Statfs \"/proc/29841/ns/net\": no such file or directory",

e2e-aws-upgrade/nodes/workers-journal:
gistry(6140667f-dcc2-4d8b-bd12-938c795c1170)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory
Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: E0212 15:26:38.909596    2078 kuberuntime_manager.go:729] createPodSandbox for pod "image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory
Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: E0212 15:26:38.909710    2078 pod_workers.go:191] Error syncing pod 6140667f-dcc2-4d8b-bd12-938c795c1170 ("image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)"), skipping: failed to "CreatePodSandbox" for "image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)" with CreatePodSandboxError: "CreatePodSandbox for pod \"image-registry-575dbf7596-99jbb_openshift-image-registry(6140667f-dcc2-4d8b-bd12-938c795c1170)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs \"/proc/43097/ns/net\": no such file or directory"
Feb 12 15:26:38 ip-10-0-142-214 hyperkube[2078]: I0212 15:26:38.909756    2078 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-image-registry", Name:"image-registry-575dbf7596-99jbb", UID:"6140667f-dcc2-4d8b-bd12-938c795c1170", APIVersion:"v1", ResourceVersion:"27190", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-registry-575dbf7596-99jbb_openshift-image-registry_6140667f-dcc2-4d8b-bd12-938c795c1170_0(be15354ef70861bfdd0447b3da9171ca4f8207a580201c052ae73acfe45970e3): Multus: error adding pod to network "openshift-sdn": delegateAdd: cannot set "openshift-sdn" interface name to "eth0": validateIfName: no net namespace /proc/43097/ns/net found: failed to Statfs "/proc/43097/ns/net": no such file or directory


The OOMKilled issue is being worked on. Clayton thinks it will improve things.

Comment 6 Phil Cameron 2020-02-25 15:19:05 UTC

Lalatendu Mohanty - I can't answer most of your questions. The triage started as a network problem because of the "Missing CNI default network". Networking failed to come up because SDN can't talk to the apiserver. The apiserver log has 2 panics, it apparently can't talk to etcd. etcd has 4 panics. Also, several pods are OOMKilled. There are potentially several bugs here. I am not sure which, if any, is causing the failure. This cluster did not come up far enough to run any user pods or system pods that require cluster networking. Unfortunately I can't be very helpful with your questions.

-----------
We need answers to bellow question to properly analyze the impact of the bug in upgrades.

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
- The cluster is down, Telemetry, Insights etc. are not running.

What kind of clusters are impacted because of the bug?
- The failed test was on aws.

What cluster functionality is degraded while hitting the bug?
- No cluster networking, host network pods can run, pods that need cluster networking can't.

Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc.
- I don't think so, however I really don't know.

Is it possible to recover the cluster from the bug?
- I don't know. This was a CI test that failed. We don't try to recover. Just report the error and move on.

Is recovery automatic without intervention? I.e. is the buggy condition transient?
- I don't know. I don't think it will recover, however...

Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
- I don't know if the cluster can be fixed.

Is recovery possible only after more extensive cluster-admin intervention?
- I don't know

Is recovery impossible (bricked cluster)?
- I don't know

What is the observed rate of failure we see in CI?
- I was assigned this bug. Its the only case I have investigated. There are many failed upgrade tests. Some of them may be this bug.

Is there a manual workaround that exists to recover from the bug? What are manual steps?
- I don't know

I am very sorry that I cannot be more helpful. Being on the network team, I don't directly work with upgrades. This doesn't appear to be a network problem.

Comment 7 Phil Cameron 2020-02-26 16:58:10 UTC

The problem is cri-o is trying to start containers before the multus daemonsset can create networking. Its a race. networking comes up 54 sec after the "Missing CNI default network" is generated.

From Doug Smith (dosmith)
regarding the above "missing CNI default network"

Here's some info... Main error message @ 23:34:03.589

Feb 19 23:34:03.589 [...snip...]: Missing CNI default network (4 times)

However -- there are no Multus entrypoint logs until 2020-02-19T23:34:57 at the earliest. (this is the script that will create the primary CNI configuration)
Here's a collection of Multus entrypoint logs: https://paste.centos.org/view/d67495e9

So, we can say with some authority -- no primary CNI configuration was created until at the earliest -- the same second as the first Multus entrypoint @ 2020-02-19T23:34:57. What I take this to mean with the missing CNI default network is that CRIO (and/or Kubelet) doesn't see the CNI configuration yet, because it hasn't been created -- it isn't created until the Multus daemonset that runs the entrypoint script which creates the primary CNI configuration. It will then wait until openshift-sdn lays down its CNI configuration, which it uses to make a Multus configuration using openshift-sdn as the default network. In this case in these logs, it doesn't look like there's huge delays for that to happen.

Comment 8 Phil Cameron 2020-02-26 18:29:53 UTC

Does the upgrade complete properly with everything working? Is there a bug here or is this an unexpected message? The "Missing CNI default network" pods should be retried and ultimately work.

Comment 9 Lalatendu Mohanty 2020-02-27 20:01:42 UTC

@slowrie Do you know answer to Phil's question?

Comment 10 slowrie 2020-03-02 17:29:45 UTC

(In reply to Lalatendu Mohanty from comment #9)
> @slowrie Do you know answer to Phil's question?

No I don't.

Comment 12 Ben Bennett 2020-03-02 18:36:12 UTC

I strongly suspect that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1805444

Comment 14 Ben Bennett 2020-03-02 20:28:58 UTC

Weibin, can you see if this is resolved by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1805444 please?  Thanks

Comment 15 Weibin Liang 2020-03-02 20:37:05 UTC

Hi Ben,

https://bugzilla.redhat.com/show_bug.cgi?id=1805444 is tested and verified in 4.3.0-0.nightly-2020-03-02-094404

Weibin

Comment 16 Weibin Liang 2020-03-05 13:57:09 UTC

According to comments 12 and comments 15, QE verified this bug now, we can reopen the bug if it happen again in CI upgrade testing.

Comment 18 W. Trevor King 2020-03-20 21:50:56 UTC

Looks like this was verified two weeks ago (comment 16), but this 4.4 bug has not been cloned back to 4.3, despite this issue coming up in 4.3 CI tests (comment 3).  Have we just not gotten around to backporting it yet?

Comment 20 errata-xmlrpc 2020-05-13 21:57:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 21 Red Hat Bugzilla 2023-09-14 05:52:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.