Description of problems: OVN cluster,upgrading from 4.5.5 to 4.6 latest nightly build failed, one ovnkube-node-XXXXX pod is in CrashLoopBackOff and most of ovnkube-node-metrics-5hq5j pods are in pending status due to node(s) didn't have free ports for the requested ports.But the required port 9103 is occupied by ovnkube. *How reproducible:* Found this failure in upgrade ci, then reproduced it in manually upgrade *Version-Release number of selected components (if applicable):* Base version:4.5.5 Target version:4.6.0-0.nightly-2020-08-11-134736 Steps to Reproduce: 1. Install 4.5.5 baremetal OVN cluster. 2. Upgrade to latest 4.6 nightly build with below commands: oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-11-134736 --force=true --allow-explicit-upgrade=true Actual Result: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.5 True True 29m Unable to apply 4.6.0-0.nightly-2020-08-11-134736: the cluster operator network has not yet successfully rolled out oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-11-134736 True False False 23m cloud-credential 4.6.0-0.nightly-2020-08-11-134736 True False False 145m cluster-autoscaler 4.6.0-0.nightly-2020-08-11-134736 True False False 123m config-operator 4.6.0-0.nightly-2020-08-11-134736 True False False 124m console 4.6.0-0.nightly-2020-08-11-134736 True False False 23m csi-snapshot-controller 4.6.0-0.nightly-2020-08-11-134736 True False False 119m dns 4.5.5 True False False 128m etcd 4.6.0-0.nightly-2020-08-11-134736 True False False 128m image-registry 4.6.0-0.nightly-2020-08-11-134736 True False False 119m ingress 4.6.0-0.nightly-2020-08-11-134736 True False False 25m insights 4.6.0-0.nightly-2020-08-11-134736 True False False 124m kube-apiserver 4.6.0-0.nightly-2020-08-11-134736 True False False 127m kube-controller-manager 4.6.0-0.nightly-2020-08-11-134736 True False False 127m kube-scheduler 4.6.0-0.nightly-2020-08-11-134736 True False False 127m kube-storage-version-migrator 4.6.0-0.nightly-2020-08-11-134736 True False False 119m machine-api 4.6.0-0.nightly-2020-08-11-134736 True False False 124m machine-approver 4.6.0-0.nightly-2020-08-11-134736 True False False 126m machine-config 4.5.5 True False False 29m marketplace 4.6.0-0.nightly-2020-08-11-134736 True False False 23m monitoring 4.6.0-0.nightly-2020-08-11-134736 True False False 113m network 4.5.5 True True True 129m node-tuning 4.6.0-0.nightly-2020-08-11-134736 True False False 24m openshift-apiserver 4.6.0-0.nightly-2020-08-11-134736 True False False 124m openshift-controller-manager 4.6.0-0.nightly-2020-08-11-134736 True False False 23m openshift-samples 4.6.0-0.nightly-2020-08-11-134736 True False False 24m operator-lifecycle-manager 4.6.0-0.nightly-2020-08-11-134736 True False False 128m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-11-134736 True False False 128m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-11-134736 True False False 24m service-ca 4.6.0-0.nightly-2020-08-11-134736 True False False 129m storage 4.6.0-0.nightly-2020-08-11-134736 True False False 25m oc get co network -o yaml - lastTransitionTime: "2020-08-12T05:43:28Z" message: |- DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated) DaemonSet "openshift-ovn-kubernetes/ovnkube-node" update is rolling out (1 out of 6 updated) DaemonSet "openshift-ovn-kubernetes/ovnkube-node-metrics" is not available (awaiting 5 nodes) reason: Deploying status: "True" type: Progressing Check the OVN pods in openshift-ovn-kubernetes oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-khpnv 4/4 Running 0 21m ovnkube-master-kzhd8 4/4 Running 0 23m ovnkube-master-metrics-2jqqm 1/1 Running 0 23m ovnkube-master-metrics-gjwqj 1/1 Running 0 23m ovnkube-master-metrics-xckrq 1/1 Running 0 23m ovnkube-master-qqb8d 4/4 Running 0 22m ovnkube-node-5t95k 2/2 Running 0 132m ovnkube-node-67drc 2/2 Running 0 123m ovnkube-node-9smvf 2/2 Running 0 123m ovnkube-node-fwblq 1/2 CrashLoopBackOff 9 23m ovnkube-node-metrics-5hq5j 0/1 Pending 0 23m ovnkube-node-metrics-f9spb 0/1 Pending 0 23m ovnkube-node-metrics-fwmhb 0/1 Pending 0 23m ovnkube-node-metrics-ksfjc 0/1 Pending 0 23m ovnkube-node-metrics-ldvmx 0/1 Pending 0 23m ovnkube-node-metrics-rxdp6 1/1 Running 0 23m ovnkube-node-mtj9b 2/2 Running 0 123m ovnkube-node-r8vn4 2/2 Running 0 132m ovs-node-525v5 1/1 Running 0 23m ovs-node-jnn2h 1/1 Running 0 23m ovs-node-ncmt6 1/1 Running 0 21m ovs-node-pxvh9 1/1 Running 0 22m ovs-node-tr6rl 1/1 Running 0 21m ovs-node-xk4xl 1/1 Running 0 21m oc logs ovnkube-node-fwblq -c ovn-controller -n openshift-ovn-kubernetes 2020-08-12T06:05:45Z|00176|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (70% CPU usage) 2020-08-12T06:05:45Z|00177|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (70% CPU usage) 2020-08-12T06:05:45Z|00178|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (70% CPU usage) 2020-08-12T06:05:45Z|00179|pinctrl|WARN|Dropped 173 log messages in last 45 seconds (most recently, 0 seconds ago) due to excessive rate 2020-08-12T06:05:45Z|00180|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address 2020-08-12T06:05:45Z|00181|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (70% CPU usage) 2020-08-12T06:06:12Z|00182|patch|ERR|Dropped 16 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate 2020-08-12T06:06:12Z|00183|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet' 2020-08-12T06:06:42Z|00184|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet' 2020-08-12T06:06:45Z|00185|pinctrl|WARN|Dropped 41 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate 2020-08-12T06:06:45Z|00186|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address 2020-08-12T06:07:12Z|00187|patch|ERR|Dropped 4 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate 2020-08-12T06:07:12Z|00188|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet' 2020-08-12T06:07:42Z|00189|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet' 2020-08-12T06:07:45Z|00190|pinctrl|WARN|Dropped 35 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate 2020-08-12T06:07:45Z|00191|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address Check the pending pods: oc describe pod ovnkube-node-metrics-ldvmx -n openshift-ovn-kubernetes Containers: kube-rbac-proxy: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31114cd64b9dc44e9d61cec370f9c96acdb7dd391f8228552e552e6605aba735 Port: 9103/TCP Host Port: 9103/TCP Warning FailedScheduling 28s (x31 over 25m) default-scheduler 0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports. check nodes for tcp port 9103,it was occupied by ovnkube [root@huir-upg1-jp5v8-compute-2 ~]# netstat -ntlp | grep 9103 tcp6 0 0 :::9103 :::* LISTEN 2608/ovnkube
I think there are 2 different issues here. For the original bug the issue is that we are not upgrading ovnkube-node before we launch the metrics pods, so metrics cannot deploy because it needs port 9103. For the the other comment, it looks like something is wrong with configuring br-ex bridge with the ovs-configuration service, however I cannot launch an oc debug node pod on that cluster, so I'm unable to investigate further. Either way, Weinan please open a new BZ for the issue you are encountering as it is a separate bug.
(In reply to Tim Rozet from comment #6) > I think there are 2 different issues here. For the original bug the issue is > that we are not upgrading ovnkube-node before we launch the metrics pods, so > metrics cannot deploy because it needs port 9103. For the the other comment, > it looks like something is wrong with configuring br-ex bridge with the > ovs-configuration service, however I cannot launch an oc debug node pod on > that cluster, so I'm unable to investigate further. Either way, Weinan > please open a new BZ for the issue you are encountering as it is a separate > bug. Actually the metrics port problem might not be the real issue. I see in the manual upgrade cluster ovs-configuration problems.
It seems we can't escape the x509 error. As per auth team "no. It's reasonably safeish to ignore a cert error for getting logs (so we built that), but it's considerably less safe cases where users send data to the potentially unsafe endpoint" The other way i believe is to leverage a bastion host here which is not working for me at the moment :(
@huiran Could you try to repro this issue on Monday on a new setup and share? Had a hard time with oc debug and bastion host on this one. Thanks
OK so after further investigation it looks like the problem is that CNO upgrades before MCO. Which means MCO never has a chance to start system OVS and run the ovs-configuration service and OVN fails to start. We have a couple of options here: 1. Add the same detection we use for ovs-node DS to detect whether or not OVS is running in the host or not, and use that to determine if we should run in local GW mode. That would allow CNO to "upgrade" and then when MCO runs it would reboot the node and ovn-kube would then run the right way after that. 2. Move CNO to run after MCO in upgrade path. Need to figure out if #2 is feasible, otherwise we go with #1.
*** Bug 1868083 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475