Bug 1983975 - BMO fails to start with port conflict
Summary: BMO fails to start with port conflict
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Andrea Fasano
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-20 10:34 UTC by Steven Hardy
Modified: 2021-10-18 17:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-serial-ipv4=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all
Last Closed: 2021-10-18 17:40:26 UTC
Target Upstream Version:
Embargoed:
afasano: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator pull 180 0 None closed Customize metal3 health endpoint to avoid port conflicts 2021-07-28 14:52:31 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:40:43 UTC

Description Steven Hardy 2021-07-20 10:34:04 UTC
Description of problem:

In CI we're seeing regular errors like:

I0716 18:10:50.559905       1 request.go:668] Waited for 1.033505889s due to client-side throttling, not priority and fairness, request: GET:https://[fd02::1]:443/apis/operators.coreos.com/v1alpha2?timeout=32s
{"level":"info","ts":1626459052.1155467,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8085"}
{"level":"error","ts":1626459052.1158388,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

This seems to be a port conflict between CAPBM and BMO:

https://github.com/openshift/cluster-api-provider-baremetal/blob/94eff91d0acc5e01a5ce258ef1d755dc1a42ed37/cmd/manager/main.go#L62

https://github.com/openshift/baremetal-operator/blob/master/main.go#L102


Version-Release number of selected component (if applicable):


How reproducible:

The error only occurs if the machine-controller and metal3/BMO pod get started on the same master, hence this isn't constantly failing in CI.


Steps to Reproduce:

Check failing CI runs, or probably can be reproduced by deleting the metal3 pod until it gets rescheduled on the same master as the machine-controller pod.


Additional info:

I think we need to pick some non-conflicting ports and update https://github.com/openshift/enhancements/blob/master/enhancements/network/host-port-registry.md to avoid this happening again in future

Comment 1 Zane Bitter 2021-07-20 14:42:41 UTC
BMO runs with host networking, but CAPBM does not so CAPBM shouldn't be causing the problem. It's more likely that some new operator has been added running with host networking and neither that nor BMO has registered port 9440 in the registry. I'd assume that basically every operator will use the port 9440 due to it being the default from a code generator.

Comment 2 Derek Higgins 2021-07-20 16:01:39 UTC
I've seen this also, 

{"level":"error","ts":1626796639.1610942,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}                                                                                                       

[root@master-2 core]# netstat -apn | grep 9440
tcp6       0      0 :::9440                 :::*                    LISTEN      4643/cluster-contro
tcp6       0      0 fd01:0:0:1::2:59748     fd01:0:0:1::15:9440     TIME_WAIT   -

[root@master-2 core]# ps -ef | grep 4632
root        4632       1  0 15:40 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -c e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_7f78554b-e95e-4291-a548-5107b53bb528/cluster-cloud-controller-manager/0.log --log-level info -n k8s_cluster-cloud-controller-manager_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_openshift-cloud-controller-manager-operator_7f78554b-e95e-4291-a548-5107b53bb528_0 -P /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba -s              
root        4643    4632  0 15:40 ?        00:00:01 /cluster-controller-manager-operator --leader-elect --images-json=/etc/cloud-controller-manager-config/images.json             

[root@master-2 core]# nc -l localhost 9440
Ncat: bind to ::1:9440: Address already in use. QUITTING.

Comment 3 Andrea Fasano 2021-07-21 15:55:52 UTC
As already pointed out by @zbitter and then identified by @derekh, the conflict happens when both BMO and cluster-cloud-controller-manager-operator are deployed on the same master, since they are both using the host network and allocate the same port (9440) for the health checks. I've been able to replicate consistently the issue by forcing the two operators landing on the same node.

Comment 4 Steven Hardy 2021-07-28 14:53:34 UTC
Resolved by https://github.com/openshift/cluster-baremetal-operator/pull/180

Comment 8 errata-xmlrpc 2021-10-18 17:40:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.