1983975 – BMO fails to start with port conflict

Bug 1983975 - BMO fails to start with port conflict

Summary: BMO fails to start with port conflict

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Andrea Fasano
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-20 10:34 UTC by Steven Hardy
Modified:	2021-10-18 17:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-serial-ipv4=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all
Last Closed:	2021-10-18 17:40:26 UTC
Target Upstream Version:
Embargoed:
Flags:	afasano: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 180	0	None	closed	Customize metal3 health endpoint to avoid port conflicts	2021-07-28 14:52:31 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:40:43 UTC

Description Steven Hardy 2021-07-20 10:34:04 UTC

Description of problem:

In CI we're seeing regular errors like:

I0716 18:10:50.559905       1 request.go:668] Waited for 1.033505889s due to client-side throttling, not priority and fairness, request: GET:https://[fd02::1]:443/apis/operators.coreos.com/v1alpha2?timeout=32s
{"level":"info","ts":1626459052.1155467,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8085"}
{"level":"error","ts":1626459052.1158388,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

This seems to be a port conflict between CAPBM and BMO:

https://github.com/openshift/cluster-api-provider-baremetal/blob/94eff91d0acc5e01a5ce258ef1d755dc1a42ed37/cmd/manager/main.go#L62

https://github.com/openshift/baremetal-operator/blob/master/main.go#L102


Version-Release number of selected component (if applicable):


How reproducible:

The error only occurs if the machine-controller and metal3/BMO pod get started on the same master, hence this isn't constantly failing in CI.


Steps to Reproduce:

Check failing CI runs, or probably can be reproduced by deleting the metal3 pod until it gets rescheduled on the same master as the machine-controller pod.


Additional info:

I think we need to pick some non-conflicting ports and update https://github.com/openshift/enhancements/blob/master/enhancements/network/host-port-registry.md to avoid this happening again in future

Comment 1 Zane Bitter 2021-07-20 14:42:41 UTC

BMO runs with host networking, but CAPBM does not so CAPBM shouldn't be causing the problem. It's more likely that some new operator has been added running with host networking and neither that nor BMO has registered port 9440 in the registry. I'd assume that basically every operator will use the port 9440 due to it being the default from a code generator.

Comment 2 Derek Higgins 2021-07-20 16:01:39 UTC

I've seen this also, 

{"level":"error","ts":1626796639.1610942,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}                                                                                                       

[root@master-2 core]# netstat -apn | grep 9440
tcp6       0      0 :::9440                 :::*                    LISTEN      4643/cluster-contro
tcp6       0      0 fd01:0:0:1::2:59748     fd01:0:0:1::15:9440     TIME_WAIT   -

[root@master-2 core]# ps -ef | grep 4632
root        4632       1  0 15:40 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -c e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_7f78554b-e95e-4291-a548-5107b53bb528/cluster-cloud-controller-manager/0.log --log-level info -n k8s_cluster-cloud-controller-manager_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_openshift-cloud-controller-manager-operator_7f78554b-e95e-4291-a548-5107b53bb528_0 -P /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba -s              
root        4643    4632  0 15:40 ?        00:00:01 /cluster-controller-manager-operator --leader-elect --images-json=/etc/cloud-controller-manager-config/images.json             

[root@master-2 core]# nc -l localhost 9440
Ncat: bind to ::1:9440: Address already in use. QUITTING.

Comment 3 Andrea Fasano 2021-07-21 15:55:52 UTC

As already pointed out by @zbitter and then identified by @derekh, the conflict happens when both BMO and cluster-cloud-controller-manager-operator are deployed on the same master, since they are both using the host network and allocate the same port (9440) for the health checks. I've been able to replicate consistently the issue by forcing the two operators landing on the same node.

Comment 4 Steven Hardy 2021-07-28 14:53:34 UTC

Resolved by https://github.com/openshift/cluster-baremetal-operator/pull/180

Comment 8 errata-xmlrpc 2021-10-18 17:40:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.