Description of problem:
CNO can get wedged on its surge rollingUpdate during a cluster upgrade if it gets scheduled onto the same master as the existing CNO pod.
$ oc -n openshift-network-operator get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
network-operator-6c95c58b67-9gsb7 0/1 CrashLoopBackOff 9 23m 18.104.22.168 mcs-master-03.dmz <none> <none>
network-operator-f88c9fdf9-mh7hq 1/1 Running 0 13d 22.214.171.124 mcs-master-03.dmz <none> <none>
$ oc -n openshift-network-operator logs network-operator-6c95c58b67-9gsb7
W0404 17:11:37.101246 1 cmd.go:204] Using insecure, self-signed certificates
I0404 17:11:37.333403 1 observer_polling.go:159] Starting file observer
I0404 17:11:37.374956 1 builder.go:238] network-operator version 4.8.0-202203102349.p0.g9150952.assembly.stream-9150952-9150952e02594242937e5c7a3c8bd073d9f1ada0
F0404 17:11:37.375302 1 cmd.go:129] failed to create listener: failed to listen on 0.0.0.0:9104: listen tcp 0.0.0.0:9104: bind: address already in use
Version-Release number of selected component (if applicable):
4.8.28 to 4.8.35
Steps to Reproduce:
1. Preform cluster upgrade
CNO was stuck, had to delete the CrashLoopBackOff pod so it got scheduled elsewhere.
Should not schedule on the same master as the current CNO pod.
Verified this bug on 4.11.0-0.nightly-2022-04-26-181148
$ oc get deployment -n openshift-network-operator network-operator -o yaml | grep ports: -A4
- containerPort: 9104
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.