Hide Forgot
Description of problem: API Services unavailable after upgrade from 4.5.38 to 4.6.27 using UPI on GCP with RHCOS & RHEL7.9 & XPN cluster Version-Release number of selected component (if applicable): 4.6.27 How reproducible: Steps to Reproduce: 11:33:14 Status: 11:33:14 Conditions: 11:33:14 Last Transition Time: 2021-05-06T13:05:36Z 11:33:14 Message: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (crashlooping container is waiting in apiserver-855458b9cf-vjdzr pod) 11:33:14 Reason: APIServerDeployment_UnavailablePod 11:33:14 Status: True 11:33:14 Type: Degraded 11:33:14 Last Transition Time: 2021-05-06T12:47:22Z 11:33:14 Reason: AsExpected 11:33:14 Status: False 11:33:14 Type: Progressing 11:33:14 Last Transition Time: 2021-05-06T13:02:27Z 11:33:14 Message: APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 11:33:14 Reason: APIServices_Error 11:33:14 Status: False 11:33:14 Type: Available 11:33:14 Last Transition Time: 2021-05-06T10:48:28Z 11:33:14 Reason: AsExpected 11:33:14 Status: True 11:33:14 Type: Upgradeable 11:33:14 Extension: <nil> 11:33:14 Related Objects: 11:33:14 Group: operator.openshift.io 11:33:14 Name: cluster 11:33:14 Resource: openshiftapiservers 11:33:14 Group: 11:33:14 Name: openshift-config 11:33:14 Resource: namespaces 11:33:14 Group: 11:33:14 Name: openshift-config-managed 11:33:14 Resource: namespaces 11:33:14 Group: 11:33:14 Name: openshift-apiserver-operator 11:33:14 Resource: namespaces 11:33:14 Group: 11:33:14 Name: openshift-apiserver 11:33:14 Resource: namespaces 11:33:14 Group: 11:33:14 Name: openshift-etcd-operator 11:33:14 Resource: namespaces 11:33:14 Group: 11:33:14 Name: host-etcd-2 11:33:14 Namespace: openshift-etcd 11:33:14 Resource: endpoints 11:33:14 Group: controlplane.operator.openshift.io 11:33:14 Name: 11:33:14 Namespace: openshift-apiserver 11:33:14 Resource: podnetworkconnectivitychecks 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.apps.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.authorization.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.build.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.image.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.oauth.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.project.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.quota.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.route.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.security.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.template.openshift.io 11:33:14 Resource: apiservices 11:33:14 Group: apiregistration.k8s.io 11:33:14 Name: v1.user.openshift.io 11:33:14 Resource: apiservices 11:33:14 Versions: 11:33:14 Name: operator 11:33:14 Version: 4.6.27 11:33:14 Name: openshift-apiserver 11:33:14 Version: 4.6.27 11:33:14 Events: <none> Actual results: Expected results: Additional info: must-gather http://file.rdu.redhat.com/rpattath/must-gather.local.532156088658246876.tar.gz
The API service IP is broken in apiserver-855458b9cf-vjdzr: 2021-05-06T16:05:50.850545399Z F0506 16:05:50.850455 1 cmd.go:72] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s": dial tcp 172.30.0.1:443: connect: no route to host Sending to networking.
I understand what the problem is, I know need to think about a solution. The problem: the SDN pod has a wait loop in its code waiting for OVS to be running before it proceeds to initializing itself. However I am seeing the 4.6 SDN pod start while the OVS 4.5 pod is still running, it thus thinks that the OVS replica on the node is OK and proceeds. Seconds after the OVS 4.5 replica is terminated and upgraded to the 4.6 replica. During that time window all events processed which require pod-level networking, error out (for ex: apiserver-855458b9cf-vjdzr) SDN 4.6 replica is started and logs: 2021-05-06T13:01:54.04135271Z I0506 13:01:54.041207 135002 cmd.go:121] Reading proxy configuration from /config/kube-proxy-config.yaml 2021-05-06T13:01:54.042924713Z I0506 13:01:54.042871 135002 feature_gate.go:243] feature gates: &{map[]} 2021-05-06T13:01:54.043051804Z I0506 13:01:54.043011 135002 cmd.go:216] Watching config file /config/kube-proxy-config.yaml for changes 2021-05-06T13:01:54.043152862Z I0506 13:01:54.043119 135002 cmd.go:216] Watching config file /config/..2021_05_06_13_01_42.789393572/kube-proxy-config.yaml for changes 2021-05-06T13:01:54.084832668Z I0506 13:01:54.084754 135002 node.go:152] Initializing SDN node "ugd-14008-mrkr2-m-2.c.openshift-qe.internal" (10.0.0.53) of type "redhat/openshift-ovs-networkpolicy" 2021-05-06T13:01:54.085296844Z I0506 13:01:54.085221 135002 cmd.go:159] Starting node networking (v0.0.0-alpha.0-246-g8210fdc1) 2021-05-06T13:01:54.085296844Z I0506 13:01:54.085244 135002 node.go:340] Starting openshift-sdn network plugin 2021-05-06T13:01:54.414206178Z I0506 13:01:54.414116 135002 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) 2021-05-06T13:01:54.863833659Z I0506 13:01:54.863769 135002 node.go:389] Starting openshift-sdn pod manager 2021-05-06T13:01:54.864511982Z I0506 13:01:54.864451 135002 node.go:443] Reattaching pod 'openshift-service-ca/service-ca-8d744cd4-kgthq' to SDN 2021-05-06T13:01:54.947997196Z I0506 13:01:54.947912 135002 pod.go:508] CNI_ADD openshift-service-ca/service-ca-8d744cd4-kgthq got IP 10.128.0.65, ofport 3 2021-05-06T13:01:54.948139894Z I0506 13:01:54.948112 135002 node.go:443] Reattaching pod 'openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9' to SDN 2021-05-06T13:01:55.039529313Z I0506 13:01:55.039409 135002 pod.go:508] CNI_ADD openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9 got IP 10.128.0.68, ofport 4 2021-05-06T13:01:55.039869779Z I0506 13:01:55.039812 135002 node.go:443] Reattaching pod 'openshift-dns-operator/dns-operator-7f88f58858-gs9qb' to SDN 2021-05-06T13:01:55.131654614Z I0506 13:01:55.131581 135002 pod.go:508] CNI_ADD openshift-dns-operator/dns-operator-7f88f58858-gs9qb got IP 10.128.0.9, ofport 5 2021-05-06T13:01:55.131765423Z I0506 13:01:55.131743 135002 node.go:443] Reattaching pod 'openshift-apiserver/apiserver-855458b9cf-vjdzr' to SDN 2021-05-06T13:01:55.21618871Z I0506 13:01:55.216119 135002 pod.go:508] CNI_ADD openshift-apiserver/apiserver-855458b9cf-vjdzr got IP 10.128.0.53, ofport 6 2021-05-06T13:01:55.216300817Z I0506 13:01:55.216273 135002 node.go:443] Reattaching pod 'openshift-ingress-operator/ingress-operator-789b58d964-59whm' to SDN 2021-05-06T13:01:55.298253796Z I0506 13:01:55.298184 135002 pod.go:508] CNI_ADD openshift-ingress-operator/ingress-operator-789b58d964-59whm got IP 10.128.0.60, ofport 7 2021-05-06T13:01:55.298344822Z I0506 13:01:55.298324 135002 node.go:443] Reattaching pod 'openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb' to SDN 2021-05-06T13:01:55.376391929Z I0506 13:01:55.376314 135002 pod.go:508] CNI_ADD openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb got IP 10.128.0.62, ofport 8 2021-05-06T13:01:55.376503858Z I0506 13:01:55.376479 135002 node.go:443] Reattaching pod 'openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj' to SDN 2021-05-06T13:01:55.45834169Z I0506 13:01:55.458271 135002 pod.go:508] CNI_ADD openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj got IP 10.128.0.55, ofport 9 2021-05-06T13:01:55.458458818Z I0506 13:01:55.458428 135002 node.go:443] Reattaching pod 'openshift-console-operator/console-operator-868dd878c-8j7z9' to SDN 2021-05-06T13:01:55.555805515Z I0506 13:01:55.555742 135002 pod.go:508] CNI_ADD openshift-console-operator/console-operator-868dd878c-8j7z9 got IP 10.128.0.59, ofport 10 2021-05-06T13:01:55.555928798Z I0506 13:01:55.555906 135002 node.go:443] Reattaching pod 'openshift-multus/multus-admission-controller-2rrdj' to SDN 2021-05-06T13:01:55.671468988Z I0506 13:01:55.671415 135002 pod.go:508] CNI_ADD openshift-multus/multus-admission-controller-2rrdj got IP 10.128.0.7, ofport 11 2021-05-06T13:01:55.671655722Z I0506 13:01:55.671638 135002 node.go:443] Reattaching pod 'openshift-multus/network-metrics-daemon-57f4w' to SDN 2021-05-06T13:01:55.73830363Z I0506 13:01:55.738226 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: OpenFlow receive failed (End of file) 2021-05-06T13:01:56.087907254Z I0506 13:01:56.087818 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:56.245006373Z I0506 13:01:56.244899 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:56.594228283Z I0506 13:01:56.594160 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:56.876458321Z I0506 13:01:56.876394 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:57.226701933Z I0506 13:01:57.226632 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:57.666125566Z I0506 13:01:57.665981 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:58.014223508Z I0506 13:01:58.014108 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:58.649460518Z I0506 13:01:58.649381 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:58.996941634Z I0506 13:01:58.996887 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:01:59.877080318Z I0506 13:01:59.877014 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:02:00.224342925Z I0506 13:02:00.224290 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:02:01.409368886Z I0506 13:02:01.409292 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:02:01.756231495Z I0506 13:02:01.756166 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:02:03.32379078Z I0506 13:02:03.323701 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) 2021-05-06T13:02:03.674365555Z I0506 13:02:03.674314 135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) From the kubelet logs: OVS 4.5 replica is stopped May 06 13:01:57.166026 ugd-14008-mrkr2-m-2.c.openshift-qe.internal hyperkube[1534]: I0506 13:01:57.165492 1534 kubelet_pods.go:966] Pod "ovs-ktjdm_openshift-sdn(fc8df565-c061-4e0d-84e4-555e3a03791b)" is terminated, but some volumes have not been cleaned up OVS 4.6 replica starts May 06 13:02:02.353326 ugd-14008-mrkr2-m-2.c.openshift-qe.internal hyperkube[1534]: I0506 13:02:02.352084 1534 kubelet.go:1926] SyncLoop (ADD, "api"): "ovs-cr4fs_openshift-sdn(2f94696a-b4d0-4342-954f-b589a1c2929c)" Presumably the fix here is to have the CNO wait will the rollout of the SDN pod until the OVS rollout is finished. However since this problem should only happen on 4.5 -> 4.6 upgrades (and 4.5 is reaching EOF soon) I will need to verify the complexity of such a fix.
Well, the CNO fix will need to happen on 4.6, and that is not reaching EOF soon.
SDN is supposed to exit and restart if OVS restarts. So we *should* see: - sdn 4.6 starts - sdn starts setting up the node against 4.5 ovs - ovs restarts - sdn detects ovs restart and exits - ovs restart completes - sdn 4.6 starts again - sdn tears down the half-constructed bridge from its previous run and starts setting up again But it looks like SDN only starts the OVS health check after it finishes initializating... so we're getting: - sdn 4.6 starts - sdn starts setting up the node against 4.5 ovs - ovs restarts - some parts of sdn setup fail because ovs is restarting - ovs restart completes - sdn sets up ovs health checking against 4.6 ovs So probably the fix is to start the OVS health check sooner. Or else fix the error checking so that the "some parts of sdn setup fail" becomes a fatal error. (We might have already done that but it's not in 4.6? I remember some error checking fixes at some point...)
> We might have already done that but it's not in 4.6? I remember some error checking fixes at some point... I think you're remembering https://github.com/openshift/sdn/pull/222 which we closed because we didn't understand what had happened, and didn't want to implement a half-fix for something which we never uncovered the root cause to. I re-visited the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1893067 and the must-gather provided, and the logs in the SDN pods are almost identical to this one. In this case the same issue still applies: I am struggling a lot to work out how what is failing and causing those OVS errors seen, but which wouldn't cause the SDN pod to fail somehow. "Reattaching pod.*" followed by "CNI_ADD.*got IP" means we didn't encounter any errors with the pod setup. And if our OVS healthcheck fails (while dumping flows for example): https://github.com/openshift/sdn/blob/master/pkg/network/node/healthcheck.go#L75 we would finish by doing https://github.com/openshift/sdn/blob/master/pkg/network/node/healthcheck.go#L84 and seeing it in the logs... I don't understand/fail to find, what in the code can accept failing OVS exec's but not triggering/logging an error...
*** Bug 1960199 has been marked as a duplicate of this bug. ***
*** Bug 1976342 has been marked as a duplicate of this bug. ***
No regression issue found in 4.9 nightly build and upgrade looks good. Move this bug to verified.
*** Bug 2003445 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759