Bug 1912577 - 4.1/4.2->4.3->...-> 4.7 upgrade is stuck during 4.6->4.7 with co/openshift-apiserver Degraded, co/network not Available and several other components pods CrashLoopBackOff
Summary: 4.1/4.2->4.3->...-> 4.7 upgrade is stuck during 4.6->4.7 with co/openshift-ap...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Dan Winship
QA Contact: zhaozhanqi
Depends On:
TreeView+ depends on / blocked
Reported: 2021-01-04 19:54 UTC by Anurag saxena
Modified: 2021-04-05 17:46 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:49:41 UTC
Target Upstream Version:

Attachments (Terms of Use)
sdn and ovs log (430.00 KB, application/x-tar)
2021-01-05 13:01 UTC, zhaozhanqi
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 932 0 None closed Bug 1912577: get rid of support for running OVS in a container 2021-02-18 20:07:03 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:50:01 UTC

Comment 1 Anurag saxena 2021-01-04 20:11:40 UTC
network pods are fine though

$ oc get pods -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE
ovs-95gs9              1/1     Running   1          6h13m
ovs-9pfvw              1/1     Running   1          6h15m
ovs-dqlf4              1/1     Running   1          6h16m
ovs-g5jqp              1/1     Running   1          6h18m
ovs-g92kw              1/1     Running   1          6h14m
ovs-mxdjc              1/1     Running   1          6h17m
sdn-c6tzp              2/2     Running   1          6h17m
sdn-controller-9kpkv   1/1     Running   0          6h18m
sdn-controller-rbqfz   1/1     Running   0          6h18m
sdn-controller-xtxjj   1/1     Running   0          6h18m
sdn-f5wcp              2/2     Running   0          6h16m
sdn-g8v4x              2/2     Running   1          6h17m
sdn-pln9l              2/2     Running   1          6h18m
sdn-rk47n              2/2     Running   1          6h17m
sdn-rqrlg              2/2     Running   1          6h18m

Comment 2 Xingxing Xia 2021-01-05 04:19:08 UTC
Some info: this bug is reproduced in https://issues.redhat.com/browse/OCPQE-3043 where it was observed that the bug was reproduced in below matrix:
1. 04_Disconnected UPI on GCP with RHCOS (FIPS off)
  http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8467/console upgraded from 4.2.0-0.nightly-2020-12-21-150827 to target_build: 4.3.0-0.nightly-2020-12-21-145308,4.4.0-0.nightly-2020-12-21-142921,4.5.0-0.nightly-2020-12-21-141644,4.6.0-0.nightly-2020-12-21-163117,4.7.0-0.nightly-2020-12-21-131655 and hit the bug.
  ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **

2. 06_UPI on AWS with RHCOS (FIPS off)
  http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8469/console upgraded from 4.1.41-x86_64 to target_build: 4.2.36-x86_64,4.3.40-x86_64,4.4.31-x86_64,4.5.24-x86_64,4.6.9-x86_64,4.7.0-fc.0-x86_64 and hit the bug.

  https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console rebuilt upgrade_CI/8469 and reproduced.

https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console env is still alive: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/129431/artifact/workdir/install-dir/auth/kubeconfig .

** Dev can use it to do any debugging. Note it will be pruned after age 30h. Now the age is 20h. **

Currently its status is still stuck:
Pods CrashLoopBackOff in below namespaces/components:
openshift-apiserver             apiserver-76bd95fc9b-ssb95                                        0/2   CrashLoopBackOff   298   14h  ip-10-0-52-185.us-east-2.compute.internal
openshift-apiserver             apiserver-76bd95fc9b-x978h                                        0/2   CrashLoopBackOff   299   14h  ip-10-0-65-114.us-east-2.compute.internal
openshift-console               downloads-7b9986669d-6cx6p                                        0/1   CrashLoopBackOff   283   14h  ip-10-0-62-100.us-east-2.compute.internal
openshift-dns                   dns-default-2bnlt                                                 2/3   CrashLoopBackOff   195   15h   ip-10-0-62-100.us-east-2.compute.internal
openshift-ingress               router-default-6ddf567c8f-86jxv                                   0/1   CrashLoopBackOff   231   14h  ip-10-0-62-100.us-east-2.compute.internal
openshift-marketplace           installed-custom-openshift-ansible-service-broker-6c9dccc6wllml   0/1   CrashLoopBackOff   166   14h   ip-10-0-62-100.us-east-2.compute.internal
openshift-network-diagnostics   network-check-target-jj9tf                                        0/1   CrashLoopBackOff   0     14h  ip-10-0-62-100.us-east-2.compute.internal
openshift-oauth-apiserver       apiserver-dd8d67465-zk4tn                                         0/1   CrashLoopBackOff   164   14h  ip-10-0-48-231.us-east-2.compute.internal

Run oc describe with some above pods:
$ oc describe po dns-default-2bnlt -n openshift-dns
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  12m (x2078 over 14h)    kubelet  Readiness probe failed: Get "": dial tcp connect: no route to host
  Warning  Unhealthy  2m37s (x992 over 14h)   kubelet  Liveness probe failed: Get "": dial tcp connect: no route to host

$ oc describe pod apiserver-76bd95fc9b-x978h -n openshift-apiserver
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  4h13m (x106 over 14h)   kubelet  Readiness probe failed: Get "": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

$ oc describe pod downloads-7b9986669d-6cx6p -n openshift-console
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  24m (x830 over 14h)     kubelet  Readiness probe failed: Get "": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Clusteroperators which are not 4.7.0-fc.0 True False False:
authentication                             4.7.0-fc.0   True    False   True    13h
dns                                        4.6.9        True    False   False   15h
ingress                                    4.7.0-fc.0   True    False   True    13h
machine-config                             4.6.9        True    False   False   14h
network                                    4.7.0-fc.0   False   True    True    14h
openshift-apiserver                        4.7.0-fc.0   True    False   True    109m

$ oc get node # all nodes are "Ready" and still v1.19.0+7070803
ip-10-0-48-231.us-east-2.compute.internal   Ready    master   19h   v1.19.0+7070803

Comment 3 Xingxing Xia 2021-01-05 04:24:37 UTC
(In reply to Xingxing Xia from comment #2)
> 1. 04_Disconnected UPI on GCP with RHCOS (FIPS off)
> ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **
Typo "4.6 to" redundant. Intended to type:
** Tried UPI on GCP matrix upgrading from fresh 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **

Comment 4 Dan Winship 2021-01-05 12:53:27 UTC
Can you get the OVS and SDN logs from at least one node?

Comment 5 zhaozhanqi 2021-01-05 13:01:20 UTC
Created attachment 1744566 [details]
sdn and ovs log

Comment 6 Dan Winship 2021-01-05 13:26:59 UTC
ovs log shows "openvswitch is running in container". Something is going wrong in the chained updates case that's causing it to think it should be running ovs in a container so we're getting dueling OVSes again

Comment 11 errata-xmlrpc 2021-02-24 15:49:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 12 W. Trevor King 2021-04-05 17:46:57 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.