network pods are fine though $ oc get pods -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-95gs9 1/1 Running 1 6h13m ovs-9pfvw 1/1 Running 1 6h15m ovs-dqlf4 1/1 Running 1 6h16m ovs-g5jqp 1/1 Running 1 6h18m ovs-g92kw 1/1 Running 1 6h14m ovs-mxdjc 1/1 Running 1 6h17m sdn-c6tzp 2/2 Running 1 6h17m sdn-controller-9kpkv 1/1 Running 0 6h18m sdn-controller-rbqfz 1/1 Running 0 6h18m sdn-controller-xtxjj 1/1 Running 0 6h18m sdn-f5wcp 2/2 Running 0 6h16m sdn-g8v4x 2/2 Running 1 6h17m sdn-pln9l 2/2 Running 1 6h18m sdn-rk47n 2/2 Running 1 6h17m sdn-rqrlg 2/2 Running 1 6h18m
Some info: this bug is reproduced in https://issues.redhat.com/browse/OCPQE-3043 where it was observed that the bug was reproduced in below matrix: 1. 04_Disconnected UPI on GCP with RHCOS (FIPS off) http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8467/console upgraded from 4.2.0-0.nightly-2020-12-21-150827 to target_build: 4.3.0-0.nightly-2020-12-21-145308,4.4.0-0.nightly-2020-12-21-142921,4.5.0-0.nightly-2020-12-21-141644,4.6.0-0.nightly-2020-12-21-163117,4.7.0-0.nightly-2020-12-21-131655 and hit the bug. ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it ** 2. 06_UPI on AWS with RHCOS (FIPS off) http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8469/console upgraded from 4.1.41-x86_64 to target_build: 4.2.36-x86_64,4.3.40-x86_64,4.4.31-x86_64,4.5.24-x86_64,4.6.9-x86_64,4.7.0-fc.0-x86_64 and hit the bug. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console rebuilt upgrade_CI/8469 and reproduced. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/8515/console env is still alive: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/129431/artifact/workdir/install-dir/auth/kubeconfig . ** Dev can use it to do any debugging. Note it will be pruned after age 30h. Now the age is 20h. ** Currently its status is still stuck: Pods CrashLoopBackOff in below namespaces/components: openshift-apiserver apiserver-76bd95fc9b-ssb95 0/2 CrashLoopBackOff 298 14h 10.128.0.30 ip-10-0-52-185.us-east-2.compute.internal openshift-apiserver apiserver-76bd95fc9b-x978h 0/2 CrashLoopBackOff 299 14h 10.130.0.42 ip-10-0-65-114.us-east-2.compute.internal openshift-console downloads-7b9986669d-6cx6p 0/1 CrashLoopBackOff 283 14h 10.129.2.22 ip-10-0-62-100.us-east-2.compute.internal openshift-dns dns-default-2bnlt 2/3 CrashLoopBackOff 195 15h 10.129.2.2 ip-10-0-62-100.us-east-2.compute.internal openshift-ingress router-default-6ddf567c8f-86jxv 0/1 CrashLoopBackOff 231 14h 10.129.2.23 ip-10-0-62-100.us-east-2.compute.internal openshift-marketplace installed-custom-openshift-ansible-service-broker-6c9dccc6wllml 0/1 CrashLoopBackOff 166 14h 10.129.2.5 ip-10-0-62-100.us-east-2.compute.internal openshift-network-diagnostics network-check-target-jj9tf 0/1 CrashLoopBackOff 0 14h 10.129.2.26 ip-10-0-62-100.us-east-2.compute.internal openshift-oauth-apiserver apiserver-dd8d67465-zk4tn 0/1 CrashLoopBackOff 164 14h 10.129.0.55 ip-10-0-48-231.us-east-2.compute.internal Run oc describe with some above pods: $ oc describe po dns-default-2bnlt -n openshift-dns ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- ... Warning Unhealthy 12m (x2078 over 14h) kubelet Readiness probe failed: Get "http://10.129.2.2:8080/health": dial tcp 10.129.2.2:8080: connect: no route to host Warning Unhealthy 2m37s (x992 over 14h) kubelet Liveness probe failed: Get "http://10.129.2.2:8080/health": dial tcp 10.129.2.2:8080: connect: no route to host $ oc describe pod apiserver-76bd95fc9b-x978h -n openshift-apiserver ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4h13m (x106 over 14h) kubelet Readiness probe failed: Get "https://10.130.0.42:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) $ oc describe pod downloads-7b9986669d-6cx6p -n openshift-console ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- ... Warning Unhealthy 24m (x830 over 14h) kubelet Readiness probe failed: Get "http://10.129.2.22:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Clusteroperators which are not 4.7.0-fc.0 True False False: authentication 4.7.0-fc.0 True False True 13h dns 4.6.9 True False False 15h ingress 4.7.0-fc.0 True False True 13h machine-config 4.6.9 True False False 14h network 4.7.0-fc.0 False True True 14h openshift-apiserver 4.7.0-fc.0 True False True 109m $ oc get node # all nodes are "Ready" and still v1.19.0+7070803 ip-10-0-48-231.us-east-2.compute.internal Ready master 19h v1.19.0+7070803 ...
(In reply to Xingxing Xia from comment #2) > 1. 04_Disconnected UPI on GCP with RHCOS (FIPS off) > ** Tried UPI on GCP matrix upgrading from fresh 4.6 to 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it ** Typo "4.6 to" redundant. Intended to type: ** Tried UPI on GCP matrix upgrading from fresh 4.6.9-x86_64 to 4.7.0-fc.0-x86_64, passed without reproducing it **
Can you get the OVS and SDN logs from at least one node?
Created attachment 1744566 [details] sdn and ovs log
ovs log shows "openvswitch is running in container". Something is going wrong in the chained updates case that's causing it to think it should be running ovs in a container so we're getting dueling OVSes again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475