2037680 – [IPI on Alibabacloud] sometimes operator 'cloud-controller-manager' tells empty VERSION, due to conflicts on listening tcp :8080

Bug 2037680 - [IPI on Alibabacloud] sometimes operator 'cloud-controller-manager' tells empty VERSION, due to conflicts on listening tcp :8080

Summary: [IPI on Alibabacloud] sometimes operator 'cloud-controller-manager' tells emp...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	dmoiseev
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2064837
TreeView+	depends on / blocked

Reported:	2022-01-06 10:14 UTC by Jianli Wei
Modified:	2022-04-11 08:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2037689 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:37:12 UTC
Target Upstream Version:
Embargoed:
Flags:	miyadav: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-cloud-controller-manager-operator pull 164	0	None	open	Bug 2037680: Fix CCCMO metric ports configuration	2022-01-06 11:28:44 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:37:22 UTC

Description Jianli Wei 2022-01-06 10:14:13 UTC

Version:
./openshift-install 4.10.0-0.nightly-2022-01-05-052228
built from commit 22d874c8d0751d5645de95121662e32d17d6eada
release image registry.ci.openshift.org/ocp/release@sha256:934dfba08338fbb64926f77950ab69d1fe23d5e1efe3f4ed66aa1740bb181c72
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?
The operator 'cloud-controller-manager' doesn't tell the expected VERSION.

What did you expect to happen?
It should tell the expected VERSION, as all other operators.

How to reproduce it (as minimally and precisely as possible)?
Not sure but sometimes, we got the issue 3 times so far.

Anything else we need to know?
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-05-052228   True        False         64m     Error while reconciling 4.10.0-0.nightly-2022-01-05-052228: cloud-controller-manager has an unknown error: ClusterOperatorUpdating
$ oc get co | grep -Ev '4.10.0-0.nightly-2022-01-05-052228   True        False         False'
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
>cloud-controller-manager                                                        True        False         False      96m     
etcd                                       4.10.0-0.nightly-2022-01-05-052228   True        True          False      92m     NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 5; 1 nodes are at revision 9
kube-scheduler                             4.10.0-0.nightly-2022-01-05-052228   True        True          False      91m     NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 5; 1 nodes are at revision 7
$ oc get pods -n openshift-cloud-controller-manager-operator -o wide
NAME                                                         READY   STATUS             RESTARTS         AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
cluster-cloud-controller-manager-operator-7bbb479445-fk44b   1/2     CrashLoopBackOff   19 (3m11s ago)   79m   10.0.0.212   jiwei-405-j8w4h-master-0   <none>           <none>
$ oc -n openshift-cloud-controller-manager-operator logs cluster-cloud-controller-manager-operator-7bbb479445-fk44b -c cluster-cloud-controller-manager
I0106 09:49:25.929382       1 request.go:665] Waited for 1.047151096s due to client-side throttling, not priority and fairness, request: GET:https://api-int.jiwei-405.alicloud-qe.devcluster.openshift.com:6443/apis/template.openshift.io/v1?timeout=32s
I0106 09:49:27.082311       1 logr.go:249] CCMOperator/controller-runtime/metrics "msg"="Metrics server is starting to listen"  "addr"=":8080"
E0106 09:49:27.082641       1 logr.go:265] CCMOperator/controller-runtime/metrics "msg"="metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
>E0106 09:49:27.082702       1 logr.go:265] CCMOperator/setup "msg"="unable to start manager" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
$ oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
jiwei-405-j8w4h-master-0                  Ready    master   95m   v1.22.1+6859754
jiwei-405-j8w4h-master-1                  Ready    master   74m   v1.22.1+6859754
jiwei-405-j8w4h-master-2                  Ready    master   97m   v1.22.1+6859754
jiwei-405-j8w4h-worker-us-east-1a-cvvhj   Ready    worker   85m   v1.22.1+6859754
jiwei-405-j8w4h-worker-us-east-1b-qgngd   Ready    worker   85m   v1.22.1+6859754
$ 

$ oc debug node/jiwei-405-j8w4h-master-0
Starting pod/jiwei-405-j8w4h-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.212
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# netstat -lnpt | grep 8080
tcp6       0      0 :::8080                 :::*                    LISTEN      32690/alibaba-cloud 
sh-4.4# ps -ef | grep 32690
root       32690   32668  0 08:33 ?        00:00:02 /bin/alibaba-cloud-controller-manager --allow-untagged-cloud=true --leader-elect=true --leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26s --leader-elect-resource-namespace=openshift-cloud-controller-manager --cloud-provider=alicloud --use-service-account-credentials=true --cloud-config=/etc/alibaba/config/cloud-config.conf --feature-gates=ServiceNodeExclusion=true --configure-cloud-routes=false --allocate-node-cidrs=false
root      140671  140444  0 09:53 ?        00:00:00 grep 32690
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
$

Comment 2 Joel Speed 2022-01-06 10:47:15 UTC

Comment 3 Joel Speed 2022-01-06 10:48:23 UTC

@dmoiseev Could you please add a new port to the port registry https://github.com/openshift/enhancements/blob/master/dev-guide/host-port-registry.md for the config sync controller (I'd suggest 10260) and then make sure that the config sync controller is using the assigned port for its metrics listener

Comment 6 Milind Yadav 2022-01-07 12:23:08 UTC


Validated on nightly - 4.10.0-0.nightly-2022-01-07-050246


oc get co | grep controller
cloud-controller-manager                   4.10.0-0.nightly-2022-01-07-050246   True        False         False      77m     


oc logs cluster-cloud-controller-manager-operator-b6686989f-cjzb8 -c config-sync-controllers | less
.
.
I0107 10:57:21.717684       1 internal.go:362] CCCMOConfigSyncControllers "msg"="Starting server" "addr"={"IP":"127.0.0.1","Port":9260,"Zone":""} "kind"="health probe" 
I0107 10:57:21.718086       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader...
I0107 10:57:21.727969       1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader
I0107 10:57:21.728276       1 controller.go:178] CCCMOConfigSyncControllers/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"="kind source: *v1.ConfigMap"
I0107 10:57:21.728338       1 controller.go:178] CCCMOConfigSyncControllers/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"="kind source: *v1.Infrastructure"
.
.


Additional info:
Cluster was not fully deployed successfully , but does this port change looks good ? 
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          81m     Working towards 4.10.0-0.nightly-2022-01-07-050246: 665 of 766 done (86% complete)

Will add must-gather in a while

Comment 8 Joel Speed 2022-01-10 10:48:40 UTC

I've reviewed the must gather attached and I think the port changes are ok. I'm confident that this has resolved the issue reported in this bug. Please move to verified

Comment 9 Milind Yadav 2022-01-10 11:32:05 UTC

Thanks @Joel

Comment 12 errata-xmlrpc 2022-03-10 16:37:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.