1912577 – 4.1/4.2->4.3->...-> 4.7 upgrade is stuck during 4.6->4.7 with co/openshift-apiserver Degraded, co/network not Available and several other components pods CrashLoopBackOff

Bug 1912577 - 4.1/4.2->4.3->...-> 4.7 upgrade is stuck during 4.6->4.7 with co/openshift-apiserver Degraded, co/network not Available and several other components pods CrashLoopBackOff

Summary: 4.1/4.2->4.3->...-> 4.7 upgrade is stuck during 4.6->4.7 with co/openshift-ap...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-04 19:54 UTC by Anurag saxena
Modified:	2021-04-05 17:46 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:49:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sdn and ovs log (430.00 KB, application/x-tar) 2021-01-05 13:01 UTC, zhaozhanqi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 932	0	None	closed	Bug 1912577: get rid of support for running OVS in a container	2021-02-18 20:07:03 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:50:01 UTC

Comment 1 Anurag saxena 2021-01-04 20:11:40 UTC

network pods are fine though

$ oc get pods -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE
ovs-95gs9              1/1     Running   1          6h13m
ovs-9pfvw              1/1     Running   1          6h15m
ovs-dqlf4              1/1     Running   1          6h16m
ovs-g5jqp              1/1     Running   1          6h18m
ovs-g92kw              1/1     Running   1          6h14m
ovs-mxdjc              1/1     Running   1          6h17m
sdn-c6tzp              2/2     Running   1          6h17m
sdn-controller-9kpkv   1/1     Running   0          6h18m
sdn-controller-rbqfz   1/1     Running   0          6h18m
sdn-controller-xtxjj   1/1     Running   0          6h18m
sdn-f5wcp              2/2     Running   0          6h16m
sdn-g8v4x              2/2     Running   1          6h17m
sdn-pln9l              2/2     Running   1          6h18m
sdn-rk47n              2/2     Running   1          6h17m
sdn-rqrlg              2/2     Running   1          6h18m

Comment 2 Xingxing Xia 2021-01-05 04:19:08 UTC

Some info: this bug is reproduced in https://issues.redhat.com/browse/OCPQE-3043 where it was observed that the bug was reproduced in below matrix:
1. 04_Disconnected UPI on GCP with RHCOS (FIPS off)
  http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8467/console upgraded from 4.2.0-0.nightly-2020-12-21-150827 to target_build: 4.3.0-0.nightly-2020-12-21-145308,4.4.0-0.nightly-2020-12-21-142921,4.5.0-0.nightly-2020-12-21-141644,4.6.0-0.nightly-2020-12-21-163117,4.7.0-0.nightly-2020-12-21-131655 and hit the bug.
  ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **

2. 06_UPI on AWS with RHCOS (FIPS off)
  http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8469/console upgraded from 4.1.41-x86_64 to target_build: 4.2.36-x86_64,4.3.40-x86_64,4.4.31-x86_64,4.5.24-x86_64,4.6.9-x86_64,4.7.0-fc.0-x86_64 and hit the bug.

  https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console rebuilt upgrade_CI/8469 and reproduced.

https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console env is still alive: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/129431/artifact/workdir/install-dir/auth/kubeconfig .

** Dev can use it to do any debugging. Note it will be pruned after age 30h. Now the age is 20h. **

Currently its status is still stuck:
Pods CrashLoopBackOff in below namespaces/components:
openshift-apiserver             apiserver-76bd95fc9b-ssb95                                        0/2   CrashLoopBackOff   298   14h     10.128.0.30  ip-10-0-52-185.us-east-2.compute.internal
openshift-apiserver             apiserver-76bd95fc9b-x978h                                        0/2   CrashLoopBackOff   299   14h     10.130.0.42  ip-10-0-65-114.us-east-2.compute.internal
openshift-console               downloads-7b9986669d-6cx6p                                        0/1   CrashLoopBackOff   283   14h     10.129.2.22  ip-10-0-62-100.us-east-2.compute.internal
openshift-dns                   dns-default-2bnlt                                                 2/3   CrashLoopBackOff   195   15h     10.129.2.2   ip-10-0-62-100.us-east-2.compute.internal
openshift-ingress               router-default-6ddf567c8f-86jxv                                   0/1   CrashLoopBackOff   231   14h     10.129.2.23  ip-10-0-62-100.us-east-2.compute.internal
openshift-marketplace           installed-custom-openshift-ansible-service-broker-6c9dccc6wllml   0/1   CrashLoopBackOff   166   14h     10.129.2.5   ip-10-0-62-100.us-east-2.compute.internal
openshift-network-diagnostics   network-check-target-jj9tf                                        0/1   CrashLoopBackOff   0     14h     10.129.2.26  ip-10-0-62-100.us-east-2.compute.internal
openshift-oauth-apiserver       apiserver-dd8d67465-zk4tn                                         0/1   CrashLoopBackOff   164   14h     10.129.0.55  ip-10-0-48-231.us-east-2.compute.internal

Run oc describe with some above pods:
$ oc describe po dns-default-2bnlt -n openshift-dns
...
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
...
  Warning  Unhealthy  12m (x2078 over 14h)    kubelet  Readiness probe failed: Get "http://10.129.2.2:8080/health": dial tcp 10.129.2.2:8080: connect: no route to host
  Warning  Unhealthy  2m37s (x992 over 14h)   kubelet  Liveness probe failed: Get "http://10.129.2.2:8080/health": dial tcp 10.129.2.2:8080: connect: no route to host

$ oc describe pod apiserver-76bd95fc9b-x978h -n openshift-apiserver
...
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  4h13m (x106 over 14h)   kubelet  Readiness probe failed: Get "https://10.130.0.42:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

$ oc describe pod downloads-7b9986669d-6cx6p -n openshift-console
...
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  ...
  Warning  Unhealthy  24m (x830 over 14h)     kubelet  Readiness probe failed: Get "http://10.129.2.22:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Clusteroperators which are not 4.7.0-fc.0 True False False:
authentication                             4.7.0-fc.0   True    False   True    13h
dns                                        4.6.9        True    False   False   15h
ingress                                    4.7.0-fc.0   True    False   True    13h
machine-config                             4.6.9        True    False   False   14h
network                                    4.7.0-fc.0   False   True    True    14h
openshift-apiserver                        4.7.0-fc.0   True    False   True    109m

$ oc get node # all nodes are "Ready" and still v1.19.0+7070803
ip-10-0-48-231.us-east-2.compute.internal   Ready    master   19h   v1.19.0+7070803
...

Comment 3 Xingxing Xia 2021-01-05 04:24:37 UTC

(In reply to Xingxing Xia from comment #2)
> 1. 04_Disconnected UPI on GCP with RHCOS (FIPS off)
> ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **
Typo "4.6 to" redundant. Intended to type:
** Tried UPI on GCP matrix upgrading from fresh 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **

Comment 4 Dan Winship 2021-01-05 12:53:27 UTC

Can you get the OVS and SDN logs from at least one node?

Comment 5 zhaozhanqi 2021-01-05 13:01:20 UTC

Created attachment 1744566 [details]
sdn and ovs log

Comment 6 Dan Winship 2021-01-05 13:26:59 UTC

ovs log shows "openvswitch is running in container". Something is going wrong in the chained updates case that's causing it to think it should be running ovs in a container so we're getting dueling OVSes again

Comment 11 errata-xmlrpc 2021-02-24 15:49:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 12 W. Trevor King 2021-04-05 17:46:57 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.