Bug 2056387 - [IPI on Alibabacloud][RHEL scaleup] new RHEL worker were not added into the backend of Ingress SLB automatically
Summary: [IPI on Alibabacloud][RHEL scaleup] new RHEL worker were not added into the b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Brian Lu
QA Contact: Jianli Wei
Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-21 06:41 UTC by Jianli Wei
Modified: 2023-09-18 04:32 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Unclear status, see https://bugzilla.redhat.com/show_bug.cgi?id=2056387#c41 Proposed doc text: * When scaling up an Alibaba Cloud cluster with {op-system-base} compute nodes, the new nodes show as `Ready`, but the Ingress pods do not transition to `Running` on these nodes. As a result, the scale-up operation does not succeed. As a workaround, you can perform a scale-up operation with {op-system} compute nodes. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2056387[*BZ#2056387*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:47:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3338 0 None open Bug 2056387: fix alibaba kubelet node name unit 2022-09-19 15:13:21 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:47:32 UTC

Description Jianli Wei 2022-02-21 06:41:43 UTC
Version:
./openshift-install 4.10.0-0.nightly-2022-02-17-234353
built from commit 5349373764f2957b75448b17005bcc1c1b9a9e8e
release image registry.ci.openshift.org/ocp/release@sha256:5a958e2cea284e33c391dd15383821dd4cfefa747a0fd811f1ea702f1d147870
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI 

What happened?
After scaleup with 2 RHEL compute nodes and setting ingress replicas to 4, the 2 new pods failed to be scheduled and stay Pending.

What did you expect to happen?
The 2 new router-default pods should turn into Running and scheduled onto the 2 RHEL compute nodes, which should be added into the vservergroups of ingress LB.

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
>FYI the flexy-install job and the rhel-scaleup job:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/78364/
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp4-rhel-scaleup/13679/

>after IPI installation:
$ oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
jiwei-103-9b2qw-master-0                  Ready    master   30m   v1.23.3+5642c2c
jiwei-103-9b2qw-master-1                  Ready    master   32m   v1.23.3+5642c2c
jiwei-103-9b2qw-master-2                  Ready    master   30m   v1.23.3+5642c2c
jiwei-103-9b2qw-worker-us-east-1a-kqjrz   Ready    worker   19m   v1.23.3+5642c2c
jiwei-103-9b2qw-worker-us-east-1b-ntwlp   Ready    worker   20m   v1.23.3+5642c2c
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-17-234353   True        False         9m4s    Cluster version is 4.10.0-0.nightly-2022-02-17-234353
$ oc -n openshift-ingress get service router-default
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
router-default   LoadBalancer   172.30.168.75   47.253.103.23   80:31988/TCP,443:31045/TCP   29m
$ oc -n openshift-ingress get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE                                      NOMINATED NODE   READINESS GATES
router-default-bbb87465-fsn7z   1/1     Running   0          29m   10.128.2.5    jiwei-103-9b2qw-worker-us-east-1a-kqjrz   <none>           <none>
router-default-bbb87465-vr6wf   1/1     Running   0          29m   10.131.0.11   jiwei-103-9b2qw-worker-us-east-1b-ntwlp   <none>           <none>
$ 

>after scaleup with 2 RHEL nodes:
$ oc get nodes
NAME                                      STATUS   ROLES    AGE     VERSION
jiwei-103-9b2qw-master-0                  Ready    master   42m     v1.23.3+5642c2c
jiwei-103-9b2qw-master-1                  Ready    master   43m     v1.23.3+5642c2c
jiwei-103-9b2qw-master-2                  Ready    master   42m     v1.23.3+5642c2c
jiwei-103-9b2qw-rhel-worker-0             Ready    worker   5m      v1.23.3+5642c2c
jiwei-103-9b2qw-rhel-worker-1             Ready    worker   4m59s   v1.23.3+5642c2c
jiwei-103-9b2qw-worker-us-east-1a-kqjrz   Ready    worker   31m     v1.23.3+5642c2c
jiwei-103-9b2qw-worker-us-east-1b-ntwlp   Ready    worker   32m     v1.23.3+5642c2c
$ oc -n openshift-ingress get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE                                      NOMINATED NODE   READINESS GATES
router-default-bbb87465-fsn7z   1/1     Running   0          40m   10.128.2.5    jiwei-103-9b2qw-worker-us-east-1a-kqjrz   <none>           <none>
router-default-bbb87465-vr6wf   1/1     Running   0          40m   10.131.0.11   jiwei-103-9b2qw-worker-us-east-1b-ntwlp   <none>           <none>
$ oc get -o yaml deployment/router-default -n openshift-ingress | grep replicas
  replicas: 2
  replicas: 2
$ 

>after setting ingress replicas to 4, the 2 new pods stay Pending:
$ oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 4}}' --type=merge
ingresscontroller.operator.openshift.io/default patched
$ oc get -o yaml deployment/router-default -n openshift-ingress | grep replicas
  replicas: 4
  replicas: 4
$ oc -n openshift-ingress get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE                                      NOMINATED NODE   READINESS GATES
router-default-bbb87465-fsn7z   1/1     Running   0          45m     10.128.2.5    jiwei-103-9b2qw-worker-us-east-1a-kqjrz   <none>           <none>
router-default-bbb87465-tf6rs   0/1     Pending   0          3m44s   <none>        <none>                                    <none>           <none>
router-default-bbb87465-tg5z8   0/1     Pending   0          3m44s   <none>        <none>                                    <none>           <none>
router-default-bbb87465-vr6wf   1/1     Running   0          45m     10.131.0.11   jiwei-103-9b2qw-worker-us-east-1b-ntwlp   <none>           <none>
$ 
$ oc -n openshift-ingress get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE    IP            NODE                                      NOMINATED NODE   READINESS GATES
router-default-bbb87465-fsn7z   1/1     Running   0          111m   10.128.2.5    jiwei-103-9b2qw-worker-us-east-1a-kqjrz   <none>           <none>
router-default-bbb87465-tf6rs   0/1     Pending   0          69m    <none>        <none>                                    <none>           <none>
router-default-bbb87465-tg5z8   0/1     Pending   0          69m    <none>        <none>                                    <none>           <none>
router-default-bbb87465-vr6wf   1/1     Running   0          111m   10.131.0.11   jiwei-103-9b2qw-worker-us-east-1b-ntwlp   <none>           <none>
$ oc -n openshift-ingress describe pod router-default-bbb87465-tf6rs
......
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  69m                 default-scheduler  0/7 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  12m (x65 over 68m)  default-scheduler  0/7 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
$ 

>FYI if scaleup using machineset and RHCOS compute node, the new pod does be scheduled onto new node as expected:
$ oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
jiwei-102-8zh8g-master-0                  Ready    master   34m   v1.23.3+5642c2c
jiwei-102-8zh8g-master-1                  Ready    master   34m   v1.23.3+5642c2c
jiwei-102-8zh8g-master-2                  Ready    master   33m   v1.23.3+5642c2c
jiwei-102-8zh8g-worker-us-east-1a-lswcd   Ready    worker   23m   v1.23.3+5642c2c
jiwei-102-8zh8g-worker-us-east-1b-qvtml   Ready    worker   20m   v1.23.3+5642c2c
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                      NOMINATED NODE   READINESS GATES
router-default-6bd4b85466-9nqkc   1/1     Running   0          31m   10.131.0.7   jiwei-102-8zh8g-worker-us-east-1a-lswcd   <none>           <none>
router-default-6bd4b85466-vtzfp   1/1     Running   0          31m   10.128.2.6   jiwei-102-8zh8g-worker-us-east-1b-qvtml   <none>           <none>
$ oc scale machineset jiwei-102-8zh8g-worker-us-east-1a --replicas=2 -n openshift-machine-api
machineset.machine.openshift.io/jiwei-102-8zh8g-worker-us-east-1a scaled
$ oc get machineset -n openshift-machine-api
NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
jiwei-102-8zh8g-worker-us-east-1a   2         2         2       2           45m
jiwei-102-8zh8g-worker-us-east-1b   1         1         1       1           45m
$ oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
jiwei-102-8zh8g-master-0                  Ready    master   44m   v1.23.3+5642c2c
jiwei-102-8zh8g-master-1                  Ready    master   44m   v1.23.3+5642c2c
jiwei-102-8zh8g-master-2                  Ready    master   43m   v1.23.3+5642c2c
jiwei-102-8zh8g-worker-us-east-1a-2lmvp   Ready    worker   50s   v1.23.3+5642c2c
jiwei-102-8zh8g-worker-us-east-1a-lswcd   Ready    worker   33m   v1.23.3+5642c2c
jiwei-102-8zh8g-worker-us-east-1b-qvtml   Ready    worker   30m   v1.23.3+5642c2c
$ oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
ingresscontroller.operator.openshift.io/default patched
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                      NOMINATED NODE   READINESS GATES
router-default-6bd4b85466-9nqkc   1/1     Running   0          43m   10.131.0.7   jiwei-102-8zh8g-worker-us-east-1a-lswcd   <none>           <none>
router-default-6bd4b85466-vtzfp   1/1     Running   0          43m   10.128.2.6   jiwei-102-8zh8g-worker-us-east-1b-qvtml   <none>           <none>
router-default-6bd4b85466-z9nsn   1/1     Running   0          26s   10.129.2.6   jiwei-102-8zh8g-worker-us-east-1a-2lmvp   <none>           <none>
$

Comment 1 jigu 2022-03-04 10:43:20 UTC
# wrong providerID
$ kubectl get node jiwei-509-bvl6f-rhel-0 -oyaml|grep -i providerid
  providerID: alicloud://

# correct providerID
$ kubectl get node jiwei-509-bvl6f-worker-us-east-1a-q4d4s -oyaml|grep -i providerid
  providerID: alicloud://us-east-1.i-0xi4lwm4mnibodrbga84

The rhel nodes has a wrong providerID, and the expected format is like node jiwei-509-bvl6f-worker-us-east-1a-q4d4s.

Comment 2 Jianli Wei 2022-03-04 10:49:59 UTC
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-03-03-115552   True        False         47m     Error while reconciling 4.10.0-0.nightly-2022-03-03-115552: the cluster operator ingress has not yet successfully rolled out
$ oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
jiwei-509-bvl6f-master-0                  Ready    master   66m   v1.23.3+e419edf
jiwei-509-bvl6f-master-1                  Ready    master   65m   v1.23.3+e419edf
jiwei-509-bvl6f-master-2                  Ready    master   65m   v1.23.3+e419edf
jiwei-509-bvl6f-rhel-0                    Ready    worker   31m   v1.23.3+e419edf
jiwei-509-bvl6f-rhel-1                    Ready    worker   31m   v1.23.3+e419edf
jiwei-509-bvl6f-worker-us-east-1a-q4d4s   Ready    worker   56m   v1.23.3+e419edf
jiwei-509-bvl6f-worker-us-east-1b-lr22m   Ready    worker   53m   v1.23.3+e419edf
$ oc get -o yaml deployment/router-default -n openshift-ingress | grep replicas
  replicas: 4
  replicas: 4
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE                                      NOMINATED NODE   READINESS GATES
router-default-5d5c4466b8-4sc9t   1/1     Running   0          63m   10.128.2.5    jiwei-509-bvl6f-worker-us-east-1b-lr22m   <none>           <none>
router-default-5d5c4466b8-bkskv   0/1     Pending   0          28m   <none>        <none>                                    <none>           <none>
router-default-5d5c4466b8-j8ktf   1/1     Running   0          63m   10.131.0.11   jiwei-509-bvl6f-worker-us-east-1a-q4d4s   <none>           <none>
router-default-5d5c4466b8-n2rmx   0/1     Pending   0          28m   <none>        <none>                                    <none>           <none>
$ oc get node jiwei-509-bvl6f-worker-us-east-1a-q4d4s -oyaml | grep -i providerid
  providerID: alicloud://us-east-1.i-0xi4lwm4mnibodrbga84
$ oc get node jiwei-509-bvl6f-rhel-0 -oyaml | grep -i providerid
  providerID: alicloud://
$ 

FYI the QE flexy jobs:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/81943/
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp4-rhel-scaleup/13874/

Comment 3 Maciej Szulik 2022-03-04 14:41:23 UTC
I'm not quite sure what do you expect here to happen? If the problem is with ingress not bumping the replicas after adding the node, then that's a question for the networking team why this happened? 
If you're asking about the scheduling issues, this warning:

Warning  FailedScheduling  12m (x65 over 68m)  default-scheduler  0/7 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

provides the reason. It looks like the new nodes were not properly initialized.

Comment 4 Jianli Wei 2022-03-08 08:40:28 UTC
(In reply to Maciej Szulik from comment #3)
> I'm not quite sure what do you expect here to happen? If the problem is with
> ingress not bumping the replicas after adding the node, then that's a
> question for the networking team why this happened? 
> If you're asking about the scheduling issues, this warning:
> 
> Warning  FailedScheduling  12m (x65 over 68m)  default-scheduler  0/7 nodes
> are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had
> taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod
> didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: },
> that the pod didn't tolerate.
> 
> provides the reason. It looks like the new nodes were not properly
> initialized.

The expectation is, the 2 new pods should be scheduled onto the 2 new RHEL compute nodes and turn into Running status within reasonable time.

Comment 5 Maciej Szulik 2022-03-08 10:24:42 UTC
In your (In reply to Jianli Wei from comment #4)
> The expectation is, the 2 new pods should be scheduled onto the 2 new RHEL
> compute nodes and turn into Running status within reasonable time.

In that case you'd need to figure out why the 2 new RHEL nodes had node.cloudprovider.kubernetes.io/uninitialized taint set to true,
as from this message: 

Warning  FailedScheduling  12m (x65 over 68m)  default-scheduler  0/7 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

My guess it has something to do with cloud provider initialization, I'm moving this over to cloud team, but they will ask you for logs from cloud controller, for sure.

Comment 6 Joel Speed 2022-03-09 12:42:17 UTC
I have pinged the Alibaba CCM team to take a look at this issue, seems that the CCM is not working as expected

Do we have CCM logs that we can share with the Alibaba team?

Comment 7 Joel Speed 2022-03-10 12:02:05 UTC
Setting this to blocker+ right now as it appears we can't add workers to clusters after bootstrapping

Comment 8 Michael McCune 2022-04-22 13:28:28 UTC
@jspeed curious if there is an update here, also should this be assigned to an Alibaba engineer?

Comment 9 Joel Speed 2022-05-05 13:26:33 UTC
@gausingh Could you please help us to get the right eyes on this bug? Seems it will need some attention from both Alibaba and maybe MCO/Splat

Comment 11 Gaurav Singh 2022-05-26 02:56:54 UTC
@brlu can you please take a look at this bug . i am assigning this bug to you

Comment 12 jigu 2022-05-26 09:51:46 UTC
(In reply to jigu from comment #1)
> # wrong providerID
> $ kubectl get node jiwei-509-bvl6f-rhel-0 -oyaml|grep -i providerid
>   providerID: alicloud://
> 
> # correct providerID
> $ kubectl get node jiwei-509-bvl6f-worker-us-east-1a-q4d4s -oyaml|grep -i
> providerid
>   providerID: alicloud://us-east-1.i-0xi4lwm4mnibodrbga84
> 
> The rhel nodes has a wrong providerID, and the expected format is like node
> jiwei-509-bvl6f-worker-us-east-1a-q4d4s.

The rhel nodes has a wrong providerID. Node providerID is set by kubelet.service. 
see prs: 
https://github.com/openshift/machine-config-operator/pull/2777 
https://github.com/openshift/machine-config-operator/pull/2814

It seems that the providerid is not set correctly in rhel nodes. @

Comment 13 Joel Speed 2022-05-30 14:45:53 UTC
@jiwei Could you please help us to create a cluster with rhel nodes so that we can debug? It would be good to reproduce this and either share a kubeconfig or gather a must gather and SOS reports from the nodes that are broken so that we can inspect the state of the instances and check the files on the disk

Comment 14 Jianli Wei 2022-06-16 13:28:50 UTC
(In reply to Joel Speed from comment #13)
> @jiwei Could you please help us to create a cluster with rhel
> nodes so that we can debug? It would be good to reproduce this and either
> share a kubeconfig or gather a must gather and SOS reports from the nodes
> that are broken so that we can inspect the state of the instances and check
> the files on the disk

Sorry for the late reply. 

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/112560/artifact/workdir/install-dir/auth/kubeconfig

$ oc get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
jiwei-0616-12-56g6k-master-0                  Ready    master   89m   v1.24.0+cb71478
jiwei-0616-12-56g6k-master-1                  Ready    master   89m   v1.24.0+cb71478
jiwei-0616-12-56g6k-master-2                  Ready    master   89m   v1.24.0+cb71478
jiwei-0616-12-56g6k-rhel-0                    Ready    worker   25m   v1.24.0+25f9057
jiwei-0616-12-56g6k-worker-us-east-1a-9fq8h   Ready    worker   80m   v1.24.0+cb71478
jiwei-0616-12-56g6k-worker-us-east-1b-8srfw   Ready    worker   78m   v1.24.0+cb71478
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP           NODE                                          NOMINATED NODE   READINESS GATES
router-default-78c788bf44-2tzbt   1/1     Running   0          86m     10.131.0.5   jiwei-0616-12-56g6k-worker-us-east-1a-9fq8h   <none>           <none>
router-default-78c788bf44-vfhjm   0/1     Pending   0          5m10s   <none>       <none>                                        <none>           <none>
router-default-78c788bf44-xbd6l   1/1     Running   0          86m     10.128.2.7   jiwei-0616-12-56g6k-worker-us-east-1b-8srfw   <none>           <none>
$ oc -n openshift-ingress describe pod router-default-78c788bf44-vfhjm | grep Warning
  Warning  FailedScheduling  5m1s  default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  43s   default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.
$

Comment 15 Jianli Wei 2022-06-16 13:31:57 UTC
FYI
$ oc get node jiwei-0616-12-56g6k-worker-us-east-1a-9fq8h -oyaml | grep -i providerid
  providerID: alicloud://us-east-1.i-0xiffvfapqsq6b2pknlz
$ oc get node jiwei-0616-12-56g6k-rhel-0 | grep -i providerid
$

Comment 16 Jianli Wei 2022-06-16 13:34:22 UTC
Sorry, pls. ignore comment#15, see below instead, thanks

$ oc get node jiwei-0616-12-56g6k-worker-us-east-1a-9fq8h -oyaml | grep -i providerid
  providerID: alicloud://us-east-1.i-0xiffvfapqsq6b2pknlz
$ oc get node jiwei-0616-12-56g6k-rhel-0 -oyaml | grep -i providerid
  providerID: alicloud://
$

Comment 17 Michael McCune 2022-07-14 16:58:36 UTC
we talked about this issue in our team standup today and we are curious about the relationship to RHEL and if we need to have someone from the node or RHEL team join this conversation. we aren't quite sure why there would be a difference between RHEL and RHCOS.

Comment 25 Michael McCune 2022-08-29 15:21:15 UTC
@brlu @gausingh , the cloud team is discussing this bug and it seems like we have some new data but we aren't sure if this requires a change on the Red Hat side or the Alibaba side, any guidance?

Comment 31 Thomas Wiest 2022-09-15 13:40:53 UTC
Fixed in https://github.com/openshift/machine-config-operator/pull/3338

Comment 35 Jianli Wei 2022-09-30 05:00:55 UTC
Tested with a build having the PR (see https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-2-modern/1575677383736823808), and it can work well. 

$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.ci.test-2022-09-30-025100-ci-ln-sqd1vnb-latest   True        False         25m     Cluster version is 4.12.0-0.ci.test-2022-09-30-025100-ci-ln-sqd1vnb-latest
$ 
$ oc get nodes
NAME                                          STATUS   ROLES                  AGE   VERSION
jiwei-0930-01-dck5b-master-0                  Ready    control-plane,master   66m   v1.24.0+8c7c967
jiwei-0930-01-dck5b-master-1                  Ready    control-plane,master   62m   v1.24.0+8c7c967
jiwei-0930-01-dck5b-master-2                  Ready    control-plane,master   65m   v1.24.0+8c7c967
jiwei-0930-01-dck5b-rhel-0                    Ready    worker                 13m   v1.24.0+8c7c967
jiwei-0930-01-dck5b-worker-us-east-1a-82zdp   Ready    worker                 32m   v1.24.0+8c7c967
jiwei-0930-01-dck5b-worker-us-east-1b-fz98j   Ready    worker                 39m   v1.24.0+8c7c967
$ 
$ oc get nodes jiwei-0930-01-dck5b-rhel-0 -oyaml | grep -i providerid
  providerID: alicloud://us-east-1.i-0xi9e42kn4hjz1j3353t
$ 
$ oc get nodes jiwei-0930-01-dck5b-worker-us-east-1a-82zdp -oyaml | grep -i providerid
  providerID: alicloud://us-east-1.i-0xi9e42kn4hjypowet86
$ 
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                          NOMINATED NODE   READINESS GATES
router-default-5fb74f8b7b-5m87k   1/1     Running   0          53m   10.131.0.8   jiwei-0930-01-dck5b-worker-us-east-1b-fz98j   <none>           <none>
router-default-5fb74f8b7b-kwtrk   1/1     Running   0          53m   10.128.2.8   jiwei-0930-01-dck5b-worker-us-east-1a-82zdp   <none>           <none>
$ 
$ oc get -o yaml deployment/router-default -n openshift-ingress | grep replicas
  replicas: 2
  replicas: 2
$ 
$ oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
ingresscontroller.operator.openshift.io/default patched
$ oc get -o yaml deployment/router-default -n openshift-ingress | grep replicas
  replicas: 3
  replicas: 3
$ 
$ oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                          NOMINATED NODE   READINESS GATES
router-default-5fb74f8b7b-4h6cg   1/1     Running   0          20s   10.129.2.6   jiwei-0930-01-dck5b-rhel-0                    <none>           <none>
router-default-5fb74f8b7b-5m87k   1/1     Running   0          54m   10.131.0.8   jiwei-0930-01-dck5b-worker-us-east-1b-fz98j   <none>           <none>
router-default-5fb74f8b7b-kwtrk   1/1     Running   0          54m   10.128.2.8   jiwei-0930-01-dck5b-worker-us-east-1a-82zdp   <none>           <none>
$

Comment 43 errata-xmlrpc 2023-01-17 19:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 44 Red Hat Bugzilla 2023-09-18 04:32:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.