1958390 – API Services unavailable after upgrade from 4.5.38 to 4.6.27

Bug 1958390 - API Services unavailable after upgrade from 4.5.38 to 4.6.27

Summary: API Services unavailable after upgrade from 4.5.38 to 4.6.27

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1960199 1976342 2003445 (view as bug list)
Depends On:
Blocks:	1885867 1999946
TreeView+	depends on / blocked

Reported:	2021-05-07 19:37 UTC by Roshni
Modified:	2021-10-18 17:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1999946 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 306	0	None	open	Bug 1958390: improve SDN's OVS healthcheck and logging	2021-05-25 12:54:09 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:31:28 UTC

Description Roshni 2021-05-07 19:37:13 UTC

Description of problem:
API Services unavailable after upgrade from 4.5.38 to 4.6.27 using UPI on GCP with RHCOS & RHEL7.9 & XPN cluster

Version-Release number of selected component (if applicable):
4.6.27

How reproducible:


Steps to Reproduce:
11:33:14  Status:
11:33:14    Conditions:
11:33:14      Last Transition Time:  2021-05-06T13:05:36Z
11:33:14      Message:               APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (crashlooping container is waiting in apiserver-855458b9cf-vjdzr pod)
11:33:14      Reason:                APIServerDeployment_UnavailablePod
11:33:14      Status:                True
11:33:14      Type:                  Degraded
11:33:14      Last Transition Time:  2021-05-06T12:47:22Z
11:33:14      Reason:                AsExpected
11:33:14      Status:                False
11:33:14      Type:                  Progressing
11:33:14      Last Transition Time:  2021-05-06T13:02:27Z
11:33:14      Message:               APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14  APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
11:33:14      Reason:                APIServices_Error
11:33:14      Status:                False
11:33:14      Type:                  Available
11:33:14      Last Transition Time:  2021-05-06T10:48:28Z
11:33:14      Reason:                AsExpected
11:33:14      Status:                True
11:33:14      Type:                  Upgradeable
11:33:14    Extension:               <nil>
11:33:14    Related Objects:
11:33:14      Group:      operator.openshift.io
11:33:14      Name:       cluster
11:33:14      Resource:   openshiftapiservers
11:33:14      Group:      
11:33:14      Name:       openshift-config
11:33:14      Resource:   namespaces
11:33:14      Group:      
11:33:14      Name:       openshift-config-managed
11:33:14      Resource:   namespaces
11:33:14      Group:      
11:33:14      Name:       openshift-apiserver-operator
11:33:14      Resource:   namespaces
11:33:14      Group:      
11:33:14      Name:       openshift-apiserver
11:33:14      Resource:   namespaces
11:33:14      Group:      
11:33:14      Name:       openshift-etcd-operator
11:33:14      Resource:   namespaces
11:33:14      Group:      
11:33:14      Name:       host-etcd-2
11:33:14      Namespace:  openshift-etcd
11:33:14      Resource:   endpoints
11:33:14      Group:      controlplane.operator.openshift.io
11:33:14      Name:       
11:33:14      Namespace:  openshift-apiserver
11:33:14      Resource:   podnetworkconnectivitychecks
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.apps.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.authorization.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.build.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.image.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.oauth.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.project.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.quota.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.route.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.security.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.template.openshift.io
11:33:14      Resource:   apiservices
11:33:14      Group:      apiregistration.k8s.io
11:33:14      Name:       v1.user.openshift.io
11:33:14      Resource:   apiservices
11:33:14    Versions:
11:33:14      Name:     operator
11:33:14      Version:  4.6.27
11:33:14      Name:     openshift-apiserver
11:33:14      Version:  4.6.27
11:33:14  Events:       <none>
Actual results:


Expected results:


Additional info:
must-gather http://file.rdu.redhat.com/rpattath/must-gather.local.532156088658246876.tar.gz

Comment 1 Stefan Schimanski 2021-05-08 07:34:09 UTC

The API service IP is broken in apiserver-855458b9cf-vjdzr:

2021-05-06T16:05:50.850545399Z F0506 16:05:50.850455       1 cmd.go:72] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s": dial tcp 172.30.0.1:443: connect: no route to host

Sending to networking.

Comment 2 Alexander Constantinescu 2021-05-19 10:59:29 UTC

I understand what the problem is, I know need to think about a solution.

The problem: the SDN pod has a wait loop in its code waiting for OVS to be running before it proceeds to initializing itself. However I am seeing the 4.6 SDN pod start while the OVS 4.5 pod is still running, it thus thinks that the OVS replica on the node is OK and proceeds. Seconds after the OVS 4.5 replica is terminated and upgraded to the 4.6 replica. During that time window all events processed which require pod-level networking, error out (for ex: apiserver-855458b9cf-vjdzr)


SDN 4.6 replica is started and logs:

2021-05-06T13:01:54.04135271Z I0506 13:01:54.041207  135002 cmd.go:121] Reading proxy configuration from /config/kube-proxy-config.yaml
2021-05-06T13:01:54.042924713Z I0506 13:01:54.042871  135002 feature_gate.go:243] feature gates: &{map[]}
2021-05-06T13:01:54.043051804Z I0506 13:01:54.043011  135002 cmd.go:216] Watching config file /config/kube-proxy-config.yaml for changes
2021-05-06T13:01:54.043152862Z I0506 13:01:54.043119  135002 cmd.go:216] Watching config file /config/..2021_05_06_13_01_42.789393572/kube-proxy-config.yaml for changes
2021-05-06T13:01:54.084832668Z I0506 13:01:54.084754  135002 node.go:152] Initializing SDN node "ugd-14008-mrkr2-m-2.c.openshift-qe.internal" (10.0.0.53) of type "redhat/openshift-ovs-networkpolicy"
2021-05-06T13:01:54.085296844Z I0506 13:01:54.085221  135002 cmd.go:159] Starting node networking (v0.0.0-alpha.0-246-g8210fdc1)
2021-05-06T13:01:54.085296844Z I0506 13:01:54.085244  135002 node.go:340] Starting openshift-sdn network plugin
2021-05-06T13:01:54.414206178Z I0506 13:01:54.414116  135002 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
2021-05-06T13:01:54.863833659Z I0506 13:01:54.863769  135002 node.go:389] Starting openshift-sdn pod manager
2021-05-06T13:01:54.864511982Z I0506 13:01:54.864451  135002 node.go:443] Reattaching pod 'openshift-service-ca/service-ca-8d744cd4-kgthq' to SDN
2021-05-06T13:01:54.947997196Z I0506 13:01:54.947912  135002 pod.go:508] CNI_ADD openshift-service-ca/service-ca-8d744cd4-kgthq got IP 10.128.0.65, ofport 3
2021-05-06T13:01:54.948139894Z I0506 13:01:54.948112  135002 node.go:443] Reattaching pod 'openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9' to SDN
2021-05-06T13:01:55.039529313Z I0506 13:01:55.039409  135002 pod.go:508] CNI_ADD openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9 got IP 10.128.0.68, ofport 4
2021-05-06T13:01:55.039869779Z I0506 13:01:55.039812  135002 node.go:443] Reattaching pod 'openshift-dns-operator/dns-operator-7f88f58858-gs9qb' to SDN
2021-05-06T13:01:55.131654614Z I0506 13:01:55.131581  135002 pod.go:508] CNI_ADD openshift-dns-operator/dns-operator-7f88f58858-gs9qb got IP 10.128.0.9, ofport 5
2021-05-06T13:01:55.131765423Z I0506 13:01:55.131743  135002 node.go:443] Reattaching pod 'openshift-apiserver/apiserver-855458b9cf-vjdzr' to SDN
2021-05-06T13:01:55.21618871Z I0506 13:01:55.216119  135002 pod.go:508] CNI_ADD openshift-apiserver/apiserver-855458b9cf-vjdzr got IP 10.128.0.53, ofport 6
2021-05-06T13:01:55.216300817Z I0506 13:01:55.216273  135002 node.go:443] Reattaching pod 'openshift-ingress-operator/ingress-operator-789b58d964-59whm' to SDN
2021-05-06T13:01:55.298253796Z I0506 13:01:55.298184  135002 pod.go:508] CNI_ADD openshift-ingress-operator/ingress-operator-789b58d964-59whm got IP 10.128.0.60, ofport 7
2021-05-06T13:01:55.298344822Z I0506 13:01:55.298324  135002 node.go:443] Reattaching pod 'openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb' to SDN
2021-05-06T13:01:55.376391929Z I0506 13:01:55.376314  135002 pod.go:508] CNI_ADD openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb got IP 10.128.0.62, ofport 8
2021-05-06T13:01:55.376503858Z I0506 13:01:55.376479  135002 node.go:443] Reattaching pod 'openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj' to SDN
2021-05-06T13:01:55.45834169Z I0506 13:01:55.458271  135002 pod.go:508] CNI_ADD openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj got IP 10.128.0.55, ofport 9
2021-05-06T13:01:55.458458818Z I0506 13:01:55.458428  135002 node.go:443] Reattaching pod 'openshift-console-operator/console-operator-868dd878c-8j7z9' to SDN
2021-05-06T13:01:55.555805515Z I0506 13:01:55.555742  135002 pod.go:508] CNI_ADD openshift-console-operator/console-operator-868dd878c-8j7z9 got IP 10.128.0.59, ofport 10
2021-05-06T13:01:55.555928798Z I0506 13:01:55.555906  135002 node.go:443] Reattaching pod 'openshift-multus/multus-admission-controller-2rrdj' to SDN
2021-05-06T13:01:55.671468988Z I0506 13:01:55.671415  135002 pod.go:508] CNI_ADD openshift-multus/multus-admission-controller-2rrdj got IP 10.128.0.7, ofport 11
2021-05-06T13:01:55.671655722Z I0506 13:01:55.671638  135002 node.go:443] Reattaching pod 'openshift-multus/network-metrics-daemon-57f4w' to SDN
2021-05-06T13:01:55.73830363Z I0506 13:01:55.738226  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: OpenFlow receive failed (End of file)
2021-05-06T13:01:56.087907254Z I0506 13:01:56.087818  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.245006373Z I0506 13:01:56.244899  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.594228283Z I0506 13:01:56.594160  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.876458321Z I0506 13:01:56.876394  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:57.226701933Z I0506 13:01:57.226632  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:57.666125566Z I0506 13:01:57.665981  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.014223508Z I0506 13:01:58.014108  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.649460518Z I0506 13:01:58.649381  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.996941634Z I0506 13:01:58.996887  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:59.877080318Z I0506 13:01:59.877014  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:00.224342925Z I0506 13:02:00.224290  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:01.409368886Z I0506 13:02:01.409292  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:01.756231495Z I0506 13:02:01.756166  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:03.32379078Z I0506 13:02:03.323701  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:03.674365555Z I0506 13:02:03.674314  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)

From the kubelet logs:

OVS 4.5 replica is stopped

May 06 13:01:57.166026 ugd-14008-mrkr2-m-2.c.openshift-qe.internal hyperkube[1534]: I0506 13:01:57.165492    1534 kubelet_pods.go:966] Pod "ovs-ktjdm_openshift-sdn(fc8df565-c061-4e0d-84e4-555e3a03791b)" is terminated, but some volumes have not been cleaned up

OVS 4.6 replica starts

May 06 13:02:02.353326 ugd-14008-mrkr2-m-2.c.openshift-qe.internal hyperkube[1534]: I0506 13:02:02.352084    1534 kubelet.go:1926] SyncLoop (ADD, "api"): "ovs-cr4fs_openshift-sdn(2f94696a-b4d0-4342-954f-b589a1c2929c)"

Presumably the fix here is to have the CNO wait will the rollout of the SDN pod until the OVS rollout is finished. However since this problem should only happen on 4.5 -> 4.6 upgrades (and 4.5 is reaching EOF soon) I will need to verify the complexity of such a fix.

Comment 3 Alexander Constantinescu 2021-05-19 11:01:41 UTC

Well, the CNO fix will need to happen on 4.6, and that is not reaching EOF soon.

Comment 4 Dan Winship 2021-05-19 14:14:44 UTC

SDN is supposed to exit and restart if OVS restarts. So we *should* see:

  - sdn 4.6 starts
  - sdn starts setting up the node against 4.5 ovs
  - ovs restarts
  - sdn detects ovs restart and exits
  - ovs restart completes
  - sdn 4.6 starts again
  - sdn tears down the half-constructed bridge from its previous run and starts setting up again

But it looks like SDN only starts the OVS health check after it finishes initializating... so we're getting:

  - sdn 4.6 starts
  - sdn starts setting up the node against 4.5 ovs
  - ovs restarts
  - some parts of sdn setup fail because ovs is restarting
  - ovs restart completes
  - sdn sets up ovs health checking against 4.6 ovs

So probably the fix is to start the OVS health check sooner. Or else fix the error checking so that the "some parts of sdn setup fail" becomes a fatal error. (We might have already done that but it's not in 4.6? I remember some error checking fixes at some point...)

Comment 5 Alexander Constantinescu 2021-05-20 11:12:24 UTC

> We might have already done that but it's not in 4.6? I remember some error checking fixes at some point...

I think you're remembering https://github.com/openshift/sdn/pull/222 which we closed because we didn't understand what had happened, and didn't want to implement a half-fix for something which we never uncovered the root cause to. 

I re-visited the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1893067 and the must-gather provided, and the logs in the SDN pods are almost identical to this one.

In this case the same issue still applies: I am struggling a lot to work out how what is failing and causing those OVS errors seen, but which wouldn't cause the SDN pod to fail somehow. 

"Reattaching pod.*"

followed by

"CNI_ADD.*got IP" 

means we didn't encounter any errors with the pod setup. 

And if our OVS healthcheck fails (while dumping flows for example): https://github.com/openshift/sdn/blob/master/pkg/network/node/healthcheck.go#L75 we would finish by doing https://github.com/openshift/sdn/blob/master/pkg/network/node/healthcheck.go#L84 and seeing it in the logs... 

I don't understand/fail to find, what in the code can accept failing OVS exec's but not triggering/logging an error...

Comment 6 Alexander Constantinescu 2021-06-07 22:03:53 UTC

*** Bug 1960199 has been marked as a duplicate of this bug. ***

Comment 7 Alexander Constantinescu 2021-07-05 11:47:13 UTC

*** Bug 1976342 has been marked as a duplicate of this bug. ***

Comment 10 zhaozhanqi 2021-09-01 06:29:07 UTC

No regression issue found in 4.9 nightly build and upgrade looks good. Move this bug to verified.

Comment 12 Alexander Constantinescu 2021-10-12 08:10:38 UTC

*** Bug 2003445 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2021-10-18 17:31:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.