Have cloned the original BZ as there are two issues at play here. This BZ is to focus on the issue with the Alibaba CCM. Normally, a CCM hosts all metrics and health endpoints on port 10258. The Alibaba CCM splits these so that health is hosted on 10258 and metrics are on 8080. 8080 is an unregistered port and should not be used. I would like Alibaba to investigate moving both metrics and health endpoints to that same listener so that they can share port 10258 and conform to the standards set by the rest of the CCMs.
I've had a quick look through to see how this could be done, Controller Runtime allows you to add extra http handlers to the metrics bind. My suggestion would be to check if the metrics and health port flags you have are the same, and if they are, set the metrics server up in the controller runtime manager, and add the health check as an extra handler to that server. If you look at how the AddHealthzCheck works, it shows how to construct a Healthz handler so it should be pretty straight forward
The metrics exposed by Alibaba CCM are limited, so we plan not to expose the metric port by default. Thanks for your suggestion, I wil investigate how to move metrics and health endpoints to that same listener 10258. It may be completed in next release.
Validated on - [miyadav@miyadav alicloud]$ oc get clusterversion --kubeconfig config NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-10-014106 True False 14m Cluster version is 4.10.0-0.nightly-2022-01-10-014106 Steps : 1. Port 8080 for metrics is no longer exposed as seen from logs ( compared them to earlier installations when the fix was not present ) oc logs alibaba-cloud-controller-manager-7dd8499f4b-7rrnm -n openshift-cloud-controller-manager --kubeconfig config . . . . I0110 12:21:29.109046 1 main.go:40] "msg"="Version of operator-sdk: v0.19.4" I0110 12:21:30.160271 1 request.go:665] Waited for 1.040726849s due to client-side throttling, not priority and fairness, request: GET:https://api-int.miyadav-01j.alicloud-qe.devcluster.openshift.com:6443/ apis/operators.coreos.com/v2?timeout=32s I0110 12:21:30.366530 1 clientMgr.go:199] clientMgr "msg"="use ram role mode to get token" I0110 12:21:30.374531 1 clientMgr.go:176] clientMgr "msg"="wait for Token ready" I0110 12:21:30.374597 1 main.go:83] "msg"="Registering Components." I0110 12:21:30.374711 1 main.go:88] "msg"="Loaded controllers: [node route service]" I0110 12:21:30.374725 1 main.go:92] "msg"="Starting the Cmd." I0110 12:21:30.374891 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager/ccm... I0110 12:21:30.387748 1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager/ccm I0110 12:21:30.387976 1 controller.go:178] controller/service-controller "msg"="Starting EventSource" "source"="kind source: *v1.Service" I0110 12:21:30.388070 1 controller.go:178] controller/service-controller "msg"="Starting EventSource" "source"="kind source: *v1.Endpoints" I0110 12:21:30.388088 1 controller.go:178] controller/service-controller "msg"="Starting EventSource" "source"="kind source: *v1.Node" I0110 12:21:30.388099 1 controller.go:186] controller/service-controller "msg"="Starting Controller" I0110 12:21:30.387977 1 controller.go:178] controller/node-controller "msg"="Starting EventSource" "source"="kind source: *v1.Node" . . . Additional Info : logs without fix : . . . . I0110 10:24:31.992827 1 main.go:36] "msg"="Cloud Controller Manager Version: v1.9.3.376-g5c84e19-aliyun-217-gd3779d52d, git commit: d3779d52d51f5b1937d4ccde7d7440437d9c690a, build date: 2022-01-01T01:16:11+0000" I0110 10:24:31.992844 1 main.go:38] "msg"="Go Version: go1.17.2" I0110 10:24:31.992866 1 main.go:39] "msg"="Go OS/Arch: linux/amd64" I0110 10:24:31.992871 1 main.go:40] "msg"="Version of operator-sdk: v0.19.4" I0110 10:24:33.043757 1 request.go:665] Waited for 1.03241868s due to client-side throttling, not priority and fairness, request: GET:https://api-int.miyadav-0110.alicloud-qe.devcluster.openshift.com:6443/apis/autoscaling/v1?timeout=32s I0110 10:24:33.247022 1 deleg.go:130] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I0110 10:24:33.250468 1 clientMgr.go:199] clientMgr "msg"="use ram role mode to get token" I0110 10:24:33.258126 1 clientMgr.go:176] clientMgr "msg"="wait for Token ready" I0110 10:24:33.258198 1 main.go:83] "msg"="Registering Components." . . . Moving to VERIFIED based on above .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056