Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1983975

Summary: BMO fails to start with port conflict
Product: OpenShift Container Platform Reporter: Steven Hardy <shardy>
Component: Bare Metal Hardware ProvisioningAssignee: Andrea Fasano <afasano>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: afasano, aos-bugs, derekh, rbartal, stbenjam, tsedovic, zbitter
Version: 4.9Keywords: Triaged
Target Milestone: ---Flags: afasano: needinfo-
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-serial-ipv4=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all
Last Closed: 2021-10-18 17:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Hardy 2021-07-20 10:34:04 UTC
Description of problem:

In CI we're seeing regular errors like:

I0716 18:10:50.559905       1 request.go:668] Waited for 1.033505889s due to client-side throttling, not priority and fairness, request: GET:https://[fd02::1]:443/apis/operators.coreos.com/v1alpha2?timeout=32s
{"level":"info","ts":1626459052.1155467,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8085"}
{"level":"error","ts":1626459052.1158388,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

This seems to be a port conflict between CAPBM and BMO:

https://github.com/openshift/cluster-api-provider-baremetal/blob/94eff91d0acc5e01a5ce258ef1d755dc1a42ed37/cmd/manager/main.go#L62

https://github.com/openshift/baremetal-operator/blob/master/main.go#L102


Version-Release number of selected component (if applicable):


How reproducible:

The error only occurs if the machine-controller and metal3/BMO pod get started on the same master, hence this isn't constantly failing in CI.


Steps to Reproduce:

Check failing CI runs, or probably can be reproduced by deleting the metal3 pod until it gets rescheduled on the same master as the machine-controller pod.


Additional info:

I think we need to pick some non-conflicting ports and update https://github.com/openshift/enhancements/blob/master/enhancements/network/host-port-registry.md to avoid this happening again in future

Comment 1 Zane Bitter 2021-07-20 14:42:41 UTC
BMO runs with host networking, but CAPBM does not so CAPBM shouldn't be causing the problem. It's more likely that some new operator has been added running with host networking and neither that nor BMO has registered port 9440 in the registry. I'd assume that basically every operator will use the port 9440 due to it being the default from a code generator.

Comment 2 Derek Higgins 2021-07-20 16:01:39 UTC
I've seen this also, 

{"level":"error","ts":1626796639.1610942,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}                                                                                                       

[root@master-2 core]# netstat -apn | grep 9440
tcp6       0      0 :::9440                 :::*                    LISTEN      4643/cluster-contro
tcp6       0      0 fd01:0:0:1::2:59748     fd01:0:0:1::15:9440     TIME_WAIT   -

[root@master-2 core]# ps -ef | grep 4632
root        4632       1  0 15:40 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -c e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_7f78554b-e95e-4291-a548-5107b53bb528/cluster-cloud-controller-manager/0.log --log-level info -n k8s_cluster-cloud-controller-manager_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_openshift-cloud-controller-manager-operator_7f78554b-e95e-4291-a548-5107b53bb528_0 -P /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba -s              
root        4643    4632  0 15:40 ?        00:00:01 /cluster-controller-manager-operator --leader-elect --images-json=/etc/cloud-controller-manager-config/images.json             

[root@master-2 core]# nc -l localhost 9440
Ncat: bind to ::1:9440: Address already in use. QUITTING.

Comment 3 Andrea Fasano 2021-07-21 15:55:52 UTC
As already pointed out by @zbitter and then identified by @derekh, the conflict happens when both BMO and cluster-cloud-controller-manager-operator are deployed on the same master, since they are both using the host network and allocate the same port (9440) for the health checks. I've been able to replicate consistently the issue by forcing the two operators landing on the same node.

Comment 4 Steven Hardy 2021-07-28 14:53:34 UTC
Resolved by https://github.com/openshift/cluster-baremetal-operator/pull/180

Comment 8 errata-xmlrpc 2021-10-18 17:40:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759