Bug 2037689

Summary: [IPI on Alibabacloud] sometimes operator 'cloud-controller-manager' tells empty VERSION, due to conflicts on listening tcp :8080
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: jigu
Cloud Compute sub component: Cloud Controller Manager QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, gpei, jiwei, zhsun
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2037680 Environment:
Last Closed: 2022-03-10 16:37:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Joel Speed 2022-01-06 10:46:19 UTC
Have cloned the original BZ as there are two issues at play here.

This BZ is to focus on the issue with the Alibaba CCM.

Normally, a CCM hosts all metrics and health endpoints on port 10258.
The Alibaba CCM splits these so that health is hosted on 10258 and metrics are on 8080.

8080 is an unregistered port and should not be used.

I would like Alibaba to investigate moving both metrics and health endpoints to that same listener so that they can share port 10258 and conform to the standards set by the rest of the CCMs.

Comment 2 Joel Speed 2022-01-06 10:59:32 UTC
I've had a quick look through to see how this could be done, Controller Runtime allows you to add extra http handlers to the metrics bind. My suggestion would be to check if the metrics and health port flags you have are the same, and if they are, set the metrics server up in the controller runtime manager, and add the health check as an extra handler to that server. If you look at how the AddHealthzCheck works, it shows how to construct a Healthz handler so it should be pretty straight forward

Comment 3 jigu 2022-01-07 02:50:20 UTC
The metrics exposed by Alibaba CCM are limited, so we plan not to expose the metric port by default. Thanks for your suggestion, I wil investigate how to move metrics and health endpoints to that same listener 10258. It may be completed in next release.

Comment 6 Milind Yadav 2022-01-10 13:03:28 UTC
Validated on - 
[miyadav@miyadav alicloud]$ oc get clusterversion --kubeconfig config 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-10-014106   True        False         14m     Cluster version is 4.10.0-0.nightly-2022-01-10-014106


Steps : 
1. Port 8080 for metrics is no longer exposed as seen from logs ( compared them to earlier installations when the fix was not present ) 

oc logs alibaba-cloud-controller-manager-7dd8499f4b-7rrnm -n openshift-cloud-controller-manager --kubeconfig config
.
.
.
.
I0110 12:21:29.109046       1 main.go:40]  "msg"="Version of operator-sdk: v0.19.4"  
I0110 12:21:30.160271       1 request.go:665] Waited for 1.040726849s due to client-side throttling, not priority and fairness, request: GET:https://api-int.miyadav-01j.alicloud-qe.devcluster.openshift.com:6443/
apis/operators.coreos.com/v2?timeout=32s
I0110 12:21:30.366530       1 clientMgr.go:199] clientMgr "msg"="use ram role mode to get token"  
I0110 12:21:30.374531       1 clientMgr.go:176] clientMgr "msg"="wait for Token ready"  
I0110 12:21:30.374597       1 main.go:83]  "msg"="Registering Components."  
I0110 12:21:30.374711       1 main.go:88]  "msg"="Loaded controllers: [node route service]"  
I0110 12:21:30.374725       1 main.go:92]  "msg"="Starting the Cmd."  
I0110 12:21:30.374891       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager/ccm...
I0110 12:21:30.387748       1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager/ccm
I0110 12:21:30.387976       1 controller.go:178] controller/service-controller "msg"="Starting EventSource"  "source"="kind source: *v1.Service"
I0110 12:21:30.388070       1 controller.go:178] controller/service-controller "msg"="Starting EventSource"  "source"="kind source: *v1.Endpoints"
I0110 12:21:30.388088       1 controller.go:178] controller/service-controller "msg"="Starting EventSource"  "source"="kind source: *v1.Node"
I0110 12:21:30.388099       1 controller.go:186] controller/service-controller "msg"="Starting Controller"  
I0110 12:21:30.387977       1 controller.go:178] controller/node-controller "msg"="Starting EventSource"  "source"="kind source: *v1.Node"
.
.
.

Additional Info :

logs without fix :
.
.
.
.
I0110 10:24:31.992827       1 main.go:36]  "msg"="Cloud Controller Manager Version: v1.9.3.376-g5c84e19-aliyun-217-gd3779d52d, git commit: d3779d52d51f5b1937d4ccde7d7440437d9c690a, build date: 2022-01-01T01:16:11+0000"
I0110 10:24:31.992844       1 main.go:38]  "msg"="Go Version: go1.17.2"
I0110 10:24:31.992866       1 main.go:39]  "msg"="Go OS/Arch: linux/amd64"
I0110 10:24:31.992871       1 main.go:40]  "msg"="Version of operator-sdk: v0.19.4"
I0110 10:24:33.043757       1 request.go:665] Waited for 1.03241868s due to client-side throttling, not priority and fairness, request: GET:https://api-int.miyadav-0110.alicloud-qe.devcluster.openshift.com:6443/apis/autoscaling/v1?timeout=32s
I0110 10:24:33.247022       1 deleg.go:130] controller-runtime/metrics "msg"="Metrics server is starting to listen"  "addr"=":8080"
I0110 10:24:33.250468       1 clientMgr.go:199] clientMgr "msg"="use ram role mode to get token"
I0110 10:24:33.258126       1 clientMgr.go:176] clientMgr "msg"="wait for Token ready"
I0110 10:24:33.258198       1 main.go:83]  "msg"="Registering Components."
.
.
.

Moving to VERIFIED based on above .

Comment 9 errata-xmlrpc 2022-03-10 16:37:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056