Description of problem: In CI we're seeing regular errors like: I0716 18:10:50.559905 1 request.go:668] Waited for 1.033505889s due to client-side throttling, not priority and fairness, request: GET:https://[fd02::1]:443/apis/operators.coreos.com/v1alpha2?timeout=32s {"level":"info","ts":1626459052.1155467,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8085"} {"level":"error","ts":1626459052.1158388,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"} This seems to be a port conflict between CAPBM and BMO: https://github.com/openshift/cluster-api-provider-baremetal/blob/94eff91d0acc5e01a5ce258ef1d755dc1a42ed37/cmd/manager/main.go#L62 https://github.com/openshift/baremetal-operator/blob/master/main.go#L102 Version-Release number of selected component (if applicable): How reproducible: The error only occurs if the machine-controller and metal3/BMO pod get started on the same master, hence this isn't constantly failing in CI. Steps to Reproduce: Check failing CI runs, or probably can be reproduced by deleting the metal3 pod until it gets rescheduled on the same master as the machine-controller pod. Additional info: I think we need to pick some non-conflicting ports and update https://github.com/openshift/enhancements/blob/master/enhancements/network/host-port-registry.md to avoid this happening again in future
BMO runs with host networking, but CAPBM does not so CAPBM shouldn't be causing the problem. It's more likely that some new operator has been added running with host networking and neither that nor BMO has registered port 9440 in the registry. I'd assume that basically every operator will use the port 9440 due to it being the default from a code generator.
I've seen this also, {"level":"error","ts":1626796639.1610942,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"} [root@master-2 core]# netstat -apn | grep 9440 tcp6 0 0 :::9440 :::* LISTEN 4643/cluster-contro tcp6 0 0 fd01:0:0:1::2:59748 fd01:0:0:1::15:9440 TIME_WAIT - [root@master-2 core]# ps -ef | grep 4632 root 4632 1 0 15:40 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -c e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_7f78554b-e95e-4291-a548-5107b53bb528/cluster-cloud-controller-manager/0.log --log-level info -n k8s_cluster-cloud-controller-manager_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_openshift-cloud-controller-manager-operator_7f78554b-e95e-4291-a548-5107b53bb528_0 -P /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba -s root 4643 4632 0 15:40 ? 00:00:01 /cluster-controller-manager-operator --leader-elect --images-json=/etc/cloud-controller-manager-config/images.json [root@master-2 core]# nc -l localhost 9440 Ncat: bind to ::1:9440: Address already in use. QUITTING.
As already pointed out by @zbitter and then identified by @derekh, the conflict happens when both BMO and cluster-cloud-controller-manager-operator are deployed on the same master, since they are both using the host network and allocate the same port (9440) for the health checks. I've been able to replicate consistently the issue by forcing the two operators landing on the same node.
Resolved by https://github.com/openshift/cluster-baremetal-operator/pull/180
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759