Bug 1791117

Summary: Network disruption of openshift-apiserver during 4.4-4.4 upgrade
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, bbennett, dosmith, scuppett
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 13:21:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-01-14 21:58:14 UTC
An e2e test to verify both Kube and Openshift apiservers remain available during upgrade failed in a 4.4 to 4.4 test.

A particular snippet that seemed to highlight the reason for the failure was:

Jan 12 20:58:46.230 I ns/openshift-sdn daemonset/sdn-controller Deleted pod: sdn-controller-lkj6v
Jan 12 20:58:46.230 I ns/openshift-sdn pod/sdn-7d6g9 Pulling image "registry.svc.ci.openshift.org/ci-op-jbtg7jjb/stable@sha256:f8de726661ce92ee52c4de8498a9f2868a4569b7ae62e59442d09ccbb78302b5"
Jan 12 20:58:46.364 W ns/openshift-controller-manager pod/controller-manager-g9tkl network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Jan 12 20:58:48.379 W ns/openshift-controller-manager pod/controller-manager-g9tkl network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (2 times)
Jan 12 20:58:48.387 W ns/openshift-machine-api pod/cluster-autoscaler-operator-748f454f48-xlbsk network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Jan 12 20:58:48.701 W ns/openshift-operator-lifecycle-manager pod/catalog-operator-86488444c-v4h5q Readiness probe failed: Get http://10.129.0.46:8080/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (2 times)
Jan 12 20:58:49.097 W ns/openshift-apiserver pod/apiserver-zg25k Readiness probe failed: Get https://10.129.0.43:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (2 times)
Jan 12 20:58:49.374 W ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-5c859c6585-kb6ph network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Jan 12 20:58:49.742 W node/ip-10-0-157-152.ec2.internal condition Ready changed
Jan 12 20:58:49.745 I node/ip-10-0-157-152.ec2.internal Node ip-10-0-157-152.ec2.internal status is now: NodeReady (2 times)
Jan 12 20:58:49.882 I ns/openshift-machine-api machine/ci-op-jbtg7jjb-77109-dx8t6-worker-us-east-1b-tn6s4 Updated machine ci-op-jbtg7jjb-77109-dx8t6-worker-us-east-1b-tn6s4 (3 times)
Jan 12 20:58:50.366 W ns/openshift-ingress-operator pod/ingress-operator-8c8c9579c-hph6g network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Jan 12 20:58:50.373 W ns/openshift-controller-manager pod/controller-manager-g9tkl network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (3 times)
Jan 12 20:58:50.381 W ns/openshift-machine-api pod/cluster-autoscaler-operator-748f454f48-xlbsk network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (2 times)

Which looks like:

1. openshift/sdn on a node is updated
2. 8-12 seconds later openshift-apiserver (on the pod network) fails readiness checks and is taken out of rotation

At a first glance this would be a very serious bug if upgrading openshift-sdn caused a disruption to pods on the pod network

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14098

Comment 1 Clayton Coleman 2020-01-15 04:05:23 UTC
May be related to 1791162, but not sure.

Comment 4 Ben Bennett 2020-03-10 13:21:50 UTC

*** This bug has been marked as a duplicate of bug 1785457 ***

Comment 5 Red Hat Bugzilla 2023-09-14 05:49:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days