Bug 1983975
| Summary: | BMO fails to start with port conflict | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Hardy <shardy> |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Andrea Fasano <afasano> |
| Bare Metal Hardware Provisioning sub component: | cluster-baremetal-operator | QA Contact: | Raviv Bar-Tal <rbartal> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | afasano, aos-bugs, derekh, rbartal, stbenjam, tsedovic, zbitter |
| Version: | 4.9 | Keywords: | Triaged |
| Target Milestone: | --- | Flags: | afasano:
needinfo-
|
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-serial-ipv4=all
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all
|
|
| Last Closed: | 2021-10-18 17:40:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Steven Hardy
2021-07-20 10:34:04 UTC
BMO runs with host networking, but CAPBM does not so CAPBM shouldn't be causing the problem. It's more likely that some new operator has been added running with host networking and neither that nor BMO has registered port 9440 in the registry. I'd assume that basically every operator will use the port 9440 due to it being the default from a code generator. I've seen this also,
{"level":"error","ts":1626796639.1610942,"logger":"setup","msg":"unable to start manager","error":"error listening on :9440: listen tcp :9440: bind: address already in use","stacktrace":"main.main\n\t/go/src/github.com/metal3-io/baremetal-operator/main.go:134\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}
[root@master-2 core]# netstat -apn | grep 9440
tcp6 0 0 :::9440 :::* LISTEN 4643/cluster-contro
tcp6 0 0 fd01:0:0:1::2:59748 fd01:0:0:1::15:9440 TIME_WAIT -
[root@master-2 core]# ps -ef | grep 4632
root 4632 1 0 15:40 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -c e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_7f78554b-e95e-4291-a548-5107b53bb528/cluster-cloud-controller-manager/0.log --log-level info -n k8s_cluster-cloud-controller-manager_cluster-cloud-controller-manager-operator-5c6bd885fd-2m8fh_openshift-cloud-controller-manager-operator_7f78554b-e95e-4291-a548-5107b53bb528_0 -P /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u e8b6641246eb3168e1a0e4277dedf8ca684219aaa40cb61a5d6b1bec0e2c6eba -s
root 4643 4632 0 15:40 ? 00:00:01 /cluster-controller-manager-operator --leader-elect --images-json=/etc/cloud-controller-manager-config/images.json
[root@master-2 core]# nc -l localhost 9440
Ncat: bind to ::1:9440: Address already in use. QUITTING.
As already pointed out by @zbitter and then identified by @derekh, the conflict happens when both BMO and cluster-cloud-controller-manager-operator are deployed on the same master, since they are both using the host network and allocate the same port (9440) for the health checks. I've been able to replicate consistently the issue by forcing the two operators landing on the same node. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |