2117439 – change controlplanemachineset machineType to other type trigger RollingUpdate cause cluster error

Bug 2117439 - change controlplanemachineset machineType to other type trigger RollingUpdate cause cluster error

Summary: change controlplanemachineset machineType to other type trigger RollingUpdate...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Joel Speed
QA Contact:	Huali Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-11 01:48 UTC by Huali Liu
Modified:	2023-01-17 19:55 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:54:46 UTC
Target Upstream Version:
Embargoed:
Flags:	huliu: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-control-plane-machine-set-operator pull 84	None	Merged	Bug 2117439: Add webhook validation to prevent scenarios which can create known broken clusters	2022-08-25 12:43:27 UTC
Github	openshift installer pull 6230	None	open	Bug 2117439: Azure masters should publish on an internal load balancer	2022-08-25 12:43:27 UTC
Github	openshift machine-api-provider-azure pull 31	None	open	Bug 2117439: Populate internalLoadBalancer for Control Plane Machines when not set	2022-08-17 13:03:32 UTC
Red Hat Product Errata	RHSA-2022:7399	None	None	None	2023-01-17 19:55:07 UTC

Description Huali Liu 2022-08-11 01:48:02 UTC

Description of problem:
Change controlplanemachineset machineType to other type trigger RollingUpdate cause cluster error on GCP

Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-09-114621

How reproducible:
Always

Steps to Reproduce:
1.Create a cluster on GCP
2.Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests
Then
oc edit deploy control-plane-machine-set-operator
Change the image to the newly build one

liuhuali@Lius-MacBook-Pro huali-test % oc get pod 
NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-6999ddc49d-qh9bc          2/2     Running   0          65m
cluster-baremetal-operator-6cb4947db6-k2ngv           2/2     Running   0          65m
control-plane-machine-set-operator-794f6548bd-4lc48   1/1     Running   0          5m6s
machine-api-controllers-57cc7c4d5c-s9mnf              7/7     Running   0          64m
machine-api-operator-754fc4b895-xptqx                 2/2     Running   0          66m

3.Create a ControlPlaneMachineSet, yaml as below:
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  name: cluster
  namespace: openshift-machine-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      metadata:
        labels:
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
          machine.openshift.io/cluster-api-cluster: huliu-gcp12e-6vt6v
      failureDomains:
        platform: GCP
        gcp:
        - zone: us-central1-a
        - zone: us-central1-b
        - zone: us-central1-c
      spec:
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            canIPForward: false
            credentialsSecret:
              name: gcp-cloud-credentials
            deletionProtection: false
            disks:
            - autoDelete: true
              boot: true
              image: projects/rhcos-cloud/global/images/rhcos-412-86-202207142104-0-gcp-x86-64
              labels: null
              sizeGb: 128
              type: pd-ssd
            kind: GCPMachineProviderSpec
            machineType: n1-standard-4
            metadata:
              creationTimestamp: null
            networkInterfaces:
            - network: huliu-gcp12e-6vt6v-network
              subnetwork: huliu-gcp12e-6vt6v-master-subnet
            projectID: openshift-qe
            region: us-central1
            serviceAccounts:
            - email: huliu-gcp12e-6vt6v-m.gserviceaccount.com
              scopes:
              - https://www.googleapis.com/auth/cloud-platform
            tags:
            - huliu-gcp12e-6vt6v-master
            targetPools:
            - huliu-gcp12e-6vt6v-api
            userDataSecret:
              name: master-user-data

liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-gcp.yaml 
controlplanemachineset.machine.openshift.io/cluster created

The machineType is different from current master machine, so it triggers RollingUpdtate, if creating a CPMS whose configuration is the same as current master machine, then edit CPMS to change machineType, still can reproduce the issue.

4.Check result.
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                PHASE     TYPE            REGION        ZONE            AGE
huliu-gcp12e-6vt6v-master-8pdhm-0   Running   n1-standard-4   us-central1   us-central1-a   39m
huliu-gcp12e-6vt6v-master-9mzfj-2   Running   n1-standard-4   us-central1   us-central1-c   18m
huliu-gcp12e-6vt6v-master-d7rrj-1   Running   n1-standard-4   us-central1   us-central1-b   28m
huliu-gcp12e-6vt6v-worker-a-57vph   Running   n2-standard-4   us-central1   us-central1-a   107m
huliu-gcp12e-6vt6v-worker-b-mkfg6   Running   n2-standard-4   us-central1   us-central1-b   107m
huliu-gcp12e-6vt6v-worker-c-v9kqs   Running   n2-standard-4   us-central1   us-central1-c   107m

RollingUpdate completed. But cause the cluster error.

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-09-114621   True        False         7h45m   Error while reconciling 4.12.0-0.nightly-2022-08-09-114621: an unknown error has occurred: MultipleErrors
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-08-09-114621   True        True          False      7h47m   APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: observed generation is 14, desired generation is 15....
baremetal                                  4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
cloud-controller-manager                   4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
cloud-credential                           4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
cluster-autoscaler                         4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
config-operator                            4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
console                                    4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h52m   
control-plane-machine-set                  0.0.1-snapshot                       True        True          False      7h3m    Observed 3 replica(s) in need of update
csi-snapshot-controller                    4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
dns                                        4.12.0-0.nightly-2022-08-09-114621   True        True          False      8h      DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7."
etcd                                       4.12.0-0.nightly-2022-08-09-114621   True        True          False      8h      NodeInstallerProgressing: 1 nodes are at revision 14; 1 nodes are at revision 17; 1 nodes are at revision 18; 0 nodes have achieved new revision 19
image-registry                             4.12.0-0.nightly-2022-08-09-114621   True        True          False      7h55m   Progressing: The registry is ready...
ingress                                    4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h55m   
insights                                   4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h57m   
kube-apiserver                             4.12.0-0.nightly-2022-08-09-114621   True        True          True       7h59m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 10 on node: "huliu-gcp12e-6vt6v-master-8pdhm-0.c.openshift-qe.internal" didn't show up, waited: 4m45s
kube-controller-manager                    4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
kube-scheduler                             4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
kube-storage-version-migrator              4.12.0-0.nightly-2022-08-09-114621   True        False         False      6h36m   
machine-api                                4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h56m   
machine-approver                           4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
machine-config                             4.12.0-0.nightly-2022-08-09-114621   False       False         True       6h12m   Cluster not available for [{operator 4.12.0-0.nightly-2022-08-09-114621}]
marketplace                                4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
monitoring                                 4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h53m   
network                                    4.12.0-0.nightly-2022-08-09-114621   True        True          False      8h      DaemonSet "/openshift-sdn/sdn" is not available (awaiting 1 nodes)...
node-tuning                                4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
openshift-apiserver                        4.12.0-0.nightly-2022-08-09-114621   True        True          False      7h57m   APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 14, desired generation is 15.
openshift-controller-manager               4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
openshift-samples                          4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h56m   
operator-lifecycle-manager                 4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-08-09-114621   True        False         False      7h57m   
service-ca                                 4.12.0-0.nightly-2022-08-09-114621   True        False         False      8h      
storage                                    4.12.0-0.nightly-2022-08-09-114621   True        True          False      8h      GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-6999ddc49d-shl4v          2/2     Running   0          6h26m
cluster-baremetal-operator-6cb4947db6-clkkj           2/2     Running   0          6h26m
control-plane-machine-set-operator-794f6548bd-5q2pc   1/1     Running   0          6h37m
machine-api-controllers-57cc7c4d5c-cdrnm              7/7     Running   0          6h49m
machine-api-operator-754fc4b895-b65n9                 2/2     Running   0          6h26m
liuhuali@Lius-MacBook-Pro huali-test % oc logs control-plane-machine-set-operator-794f6548bd-5q2pc
Error from server (InternalError): Internal error occurred: Authorization error (user=system:kube-apiserver, verb=get, resource=nodes, subresource=proxy)
liuhuali@Lius-MacBook-Pro huali-test %  

Cannot open pod log.

5.If now you change CPMS configuration to trigger another RollingUpdate, the new machine always stuck in Provisioned, cannot get Running.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                PHASE         TYPE            REGION        ZONE            AGE
huliu-gcp12e-6vt6v-master-8pdhm-0   Running       n1-standard-4   us-central1   us-central1-a   7h10m
huliu-gcp12e-6vt6v-master-9mzfj-2   Running       n1-standard-4   us-central1   us-central1-c   6h49m
huliu-gcp12e-6vt6v-master-d7rrj-1   Running       n1-standard-4   us-central1   us-central1-b   6h59m
huliu-gcp12e-6vt6v-master-db8nx-0   Provisioned   n2-standard-4   us-central1   us-central1-a   99m
huliu-gcp12e-6vt6v-worker-a-57vph   Running       n2-standard-4   us-central1   us-central1-a   8h
huliu-gcp12e-6vt6v-worker-b-mkfg6   Running       n2-standard-4   us-central1   us-central1-b   8h
huliu-gcp12e-6vt6v-worker-c-v9kqs   Running       n2-standard-4   us-central1   us-central1-c   8h

Actual results:
CPMS RollingUpdate cause cluster error

Expected results:
CPMS RollingUpdate don’t cause cluster error

Additional info:
Must-Gather: https://drive.google.com/file/d/1TVGeGH__JOhY2sBZAExaJEtbtICRdKAZ/view?usp=sharing

The must-gather data maybe is not completed, it is still collecting data after a long time(more than 3 hours) till the cluster expired. I guess maybe due to the cluster has error so the must-gather can not work right.

Comment 1 Joel Speed 2022-08-11 11:28:12 UTC

The must-gather didn't gather any logs so we will have to try to reproduce this to work out what's gone wrong there. I think maybe the drain fix hasn't made it to GCP yet so we need to make that update and test again.

Comment 2 Joel Speed 2022-08-15 15:09:44 UTC

I've spent some time to try and reproduce this today and didn't hit the same issues.

Looking at the cluster version you've shown, are you sure the cluster was installed correctly before you started testing? I wouldn't expect it to be showing errors like that in a correctly installed cluster.

Can you try again with a fresh cluster and make sure that the cluster is installed and stable before you install the CPMS? Then we can make sure the symptoms are definitely a factor of the CPMS creating a new master machine

Comment 3 Joel Speed 2022-08-15 15:22:50 UTC

Scratch my previous message, I've noticed the rollout has completed but the cluster is now stuck in the same circumstances. It seems there's an extraneous kube-controller-manager pod that should have been removed but, for some reason wasn't removed correctly. I suspect this is causing everything to block

Comment 4 Joel Speed 2022-08-15 15:54:52 UTC

Ok, so after some further digging, the root cause here is that when new Machines are being created, they are not being added to the instance groups for the internal load balancer API, which means that the internal API (used by things such as KCM) has no backing endpoints and therefore, no KCM can run to schedule pods and the like.

We need to work out why these instance groups aren't being updated when Machine API creates new control plane instances

Comment 5 Joel Speed 2022-08-15 16:08:59 UTC

Instance groups aren't currently supported by Machine API on GCP so this is why this isn't working, we have https://issues.redhat.com/browse/OCPCLOUD-672 and https://issues.redhat.com/browse/OCPCLOUD-1562 in jira tracking the implementation here.

Looking at the upstream code, they check if a Machine is a control plane machine and then, look up the instance group for the subnet and ensure the instance is registered. We could do this automatically in a similar way. Will need someone to backport the feature from upstream.

Comment 6 Huali Liu 2022-08-16 05:45:58 UTC

On azure, CPMS RolingUpdate also cause cluster error.
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                        PHASE     TYPE              REGION   ZONE   AGE
huliu-azure12b-w6zxn-master-bvrvl-0         Running   Standard_D4s_v3   eastus   2      155m
huliu-azure12b-w6zxn-master-gq87p-2         Running   Standard_D4s_v3   eastus   1      123m
huliu-azure12b-w6zxn-master-qd8jg-1         Running   Standard_D4s_v3   eastus   3      140m
huliu-azure12b-w6zxn-worker-eastus1-tvxvr   Running   Standard_D4s_v3   eastus   1      4h
huliu-azure12b-w6zxn-worker-eastus2-dtxrd   Running   Standard_D4s_v3   eastus   2      4h
huliu-azure12b-w6zxn-worker-eastus3-j7qjw   Running   Standard_D4s_v3   eastus   3      4h
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-15-150248   True        False         3h35m   Error while reconciling 4.12.0-0.nightly-2022-08-15-150248: an unknown error has occurred: MultipleErrors
liuhuali@Lius-MacBook-Pro huali-test % oc get co            
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-08-15-150248   False       False         True       105m    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.huliu-azure12b.qe.azure.devcluster.openshift.com/healthz": dial tcp 20.124.44.69:443: connect: connection refused...
baremetal                                  4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h      
cloud-controller-manager                   4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h3m    
cloud-credential                           4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h13m   
cluster-autoscaler                         4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h      
config-operator                            4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h2m    
console                                    4.12.0-0.nightly-2022-08-15-150248   False       False         False      105m    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.huliu-azure12b.qe.azure.devcluster.openshift.com): Get "https://console-openshift-console.apps.huliu-azure12b.qe.azure.devcluster.openshift.com": dial tcp 20.124.44.69:443: connect: connection refused
control-plane-machine-set                  0.0.1-snapshot                       True        False         False      170m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h1m    
dns                                        4.12.0-0.nightly-2022-08-15-150248   True        True          False      4h      DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6."
etcd                                       4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h51m   
image-registry                             4.12.0-0.nightly-2022-08-15-150248   False       True          True       105m    Available: The deployment does not have available replicas...
ingress                                    4.12.0-0.nightly-2022-08-15-150248   False       True          True       105m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h55m   
kube-apiserver                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h51m   
kube-controller-manager                    4.12.0-0.nightly-2022-08-15-150248   True        False         True       3h52m   GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.97.149:9091: i/o timeout
kube-scheduler                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h52m   
kube-storage-version-migrator              4.12.0-0.nightly-2022-08-15-150248   True        False         False      102m    
machine-api                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h47m   
machine-approver                           4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h1m    
machine-config                             4.12.0-0.nightly-2022-08-15-150248   False       False         True       95m     Cluster not available for [{operator 4.12.0-0.nightly-2022-08-15-150248}]
marketplace                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h      
monitoring                                 4.12.0-0.nightly-2022-08-15-150248   False       True          True       88m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.12.0-0.nightly-2022-08-15-150248   True        True          True       4h2m    DaemonSet "/openshift-multus/multus" rollout is not making progress - last change 2022-08-16T04:00:18Z...
node-tuning                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h      
openshift-apiserver                        4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h49m   
openshift-controller-manager               4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h52m   
openshift-samples                          4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h49m   
operator-lifecycle-manager                 4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h1m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h1m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h49m   
service-ca                                 4.12.0-0.nightly-2022-08-15-150248   True        False         False      4h2m    
storage                                    4.12.0-0.nightly-2022-08-15-150248   True        True          False      4h1m    AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test %

Comment 7 Joel Speed 2022-08-16 08:18:05 UTC

On Azure, the installer does not set the `internalLoadBalancer` field on the Machine provider spec.
This means that Azure then presents the same symptoms as GCP.

However, unlike GCP, we do support this.

The public load balancer field will look like `publicLoadBalancer: <cluster-id>`.
If we copy this and add `-internal` to the end we can configure the `internalLoadBalancer: <cluster-id>-internal` and the rollout can proceed correctly.

There are a few actions we need to take knowing this:
- We should document when Azure users are setting up CPMS that they should add the internal load balancer to their existing machines, and the CPMS spec from there on
- We should fix the installer to add this field by default
- We _could_ prevent a CPMS from being installed if there is no internal load balancer, since we know this is required

Comment 8 Joel Speed 2022-08-16 09:00:24 UTC

I think if we repeat the process on Azure with this PR https://github.com/openshift/installer/pull/6230 to the installer it should work. This should add the load balancer field that's missing, and, as long as we add that to the CPMS spec as well, it should be able to complete a control plane replacement

Comment 9 Huali Liu 2022-08-17 05:24:38 UTC

Thanks @jspeed yes, on Azure, after adding `internalLoadBalancer: huliu-azure12c-6mpmd-internal` to the CPMS spec, CPMS RollingUpdate proceed correctly and didn't cause cluster error.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                        PHASE     TYPE              REGION   ZONE   AGE
huliu-azure12c-6mpmd-master-84wqx-1         Running   Standard_D4s_v3   eastus   3      48m
huliu-azure12c-6mpmd-master-sqc6v-0         Running   Standard_D4s_v3   eastus   2      65m
huliu-azure12c-6mpmd-master-vkgsj-2         Running   Standard_D4s_v3   eastus   1      33m
huliu-azure12c-6mpmd-worker-eastus1-87778   Running   Standard_D4s_v3   eastus   1      3h1m
huliu-azure12c-6mpmd-worker-eastus2-28gjm   Running   Standard_D4s_v3   eastus   2      3h1m
huliu-azure12c-6mpmd-worker-eastus3-xnlrx   Running   Standard_D4s_v3   eastus   3      3h1m
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-15-150248   True        False         159m    Cluster version is 4.12.0-0.nightly-2022-08-15-150248
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      19m     
baremetal                                  4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
cloud-controller-manager                   4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h2m    
cloud-credential                           4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h11m   
cluster-autoscaler                         4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
config-operator                            4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h1m    
console                                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      50m     
control-plane-machine-set                  0.0.1-snapshot                       True        False         False      137m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h1m    
dns                                        4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
etcd                                       4.12.0-0.nightly-2022-08-15-150248   True        False         False      179m    
image-registry                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      170m    
ingress                                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      170m    
insights                                   4.12.0-0.nightly-2022-08-15-150248   True        False         False      175m    
kube-apiserver                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      163m    
kube-controller-manager                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      178m    
kube-scheduler                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      178m    
kube-storage-version-migrator              4.12.0-0.nightly-2022-08-15-150248   True        False         False      99m     
machine-api                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      167m    
machine-approver                           4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
machine-config                             4.12.0-0.nightly-2022-08-15-150248   True        False         False      179m    
marketplace                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
monitoring                                 4.12.0-0.nightly-2022-08-15-150248   True        False         False      159m    
network                                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h2m    
node-tuning                                4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
openshift-apiserver                        4.12.0-0.nightly-2022-08-15-150248   True        False         False      50m     
openshift-controller-manager               4.12.0-0.nightly-2022-08-15-150248   True        False         False      171m    
openshift-samples                          4.12.0-0.nightly-2022-08-15-150248   True        False         False      171m    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h      
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h1m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-08-15-150248   True        False         False      174m    
service-ca                                 4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h1m    
storage                                    4.12.0-0.nightly-2022-08-15-150248   True        False         False      3h1m    
liuhuali@Lius-MacBook-Pro huali-test %

Comment 10 Joel Speed 2022-08-17 10:30:49 UTC

The plan forward here is to prevent GCP CPMS from being created until we can fix the load balancing issues in 4.13. Then for Azure, return an error if the user doesn't have the internal load balancer set, this should prompt the user to configure it.

To prevent awkward rollouts on install, we will set up Machine API Azure to populate the load balancer where possible.

Comment 11 Huali Liu 2022-08-24 07:43:46 UTC

Verified the issue before pr merge.

On GCP:
1. Create a new release image from the pull requests using Cluster Bot
build openshift/installer#6230,openshift/machine-api-provider-azure#31,openshift/cluster-control-plane-machine-set-operator#84
2. Install a cluster using the image build in previous step on GCP
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         3h15m   Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-b6c8d658-9rs65            2/2     Running   0          3h36m
cluster-baremetal-operator-7c9cb8d8cb-qfqk2           2/2     Running   0          3h36m
control-plane-machine-set-operator-7fc8897c6b-m4tgh   1/1     Running   0          3h36m
machine-api-controllers-596c49fcd9-x5jzk              7/7     Running   0          3h34m
machine-api-operator-54d9869b57-t6ndz                 2/2     Running   0          3h37m
machine-api-termination-handler-62pm9                 1/1     Running   0          3h27m
machine-api-termination-handler-6pr4c                 1/1     Running   0          3h27m
machine-api-termination-handler-m5bq4                 1/1     Running   0          3h26m
liuhuali@Lius-MacBook-Pro huali-test % 
3. Create a ControlPlaneMachineSet
liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-gcp.yaml
Error from server (spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value: Forbidden: automatic replacement of control plane machines on GCP is not currently supported): error when creating "controlpanemachineset-gcp.yaml": admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value: Forbidden: automatic replacement of control plane machines on GCP is not currently supported
liuhuali@Lius-MacBook-Pro huali-test % 

On Azure:
1. Create a new release image from the pull requests using Cluster Bot
build openshift/installer#6230,openshift/machine-api-provider-azure#31,openshift/cluster-control-plane-machine-set-operator#84
2. Install a cluster using the image build in previous step on Azure
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         67m     Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS    RESTARTS       AGE
cluster-autoscaler-operator-b6c8d658-spdhd            2/2     Running   1 (85m ago)    96m
cluster-baremetal-operator-7c9cb8d8cb-pgcgb           2/2     Running   0              96m
control-plane-machine-set-operator-7fc8897c6b-87jkf   1/1     Running   2 (84m ago)    96m
machine-api-controllers-586b447cdc-d56h4              7/7     Running   11 (84m ago)   89m
machine-api-operator-54d9869b57-gc928                 2/2     Running   1 (85m ago)    96m
machine-api-termination-handler-6mm5m                 1/1     Running   0              79m
machine-api-termination-handler-dbvpl                 1/1     Running   0              72m
machine-api-termination-handler-thmvh                 1/1     Running   0              74m
3. Check `internalLoadBalancer` field is set on the master machine provider spec by default
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep internalLoadBalancer
        internalLoadBalancer: huliu-azure412pr-rqhjd-internal
        internalLoadBalancer: huliu-azure412pr-rqhjd-internal
        internalLoadBalancer: huliu-azure412pr-rqhjd-internal
liuhuali@Lius-MacBook-Pro huali-test % 
4. Create a ControlPlaneMachineSet, same configuration with current master machine
liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-azure.yaml 
controlplanemachineset.machine.openshift.io/cluster created
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   AGE
cluster   3         3         3       3                       15s
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                 
NAME                                          PHASE     TYPE              REGION   ZONE   AGE
huliu-azure412pr-rqhjd-master-0               Running   Standard_D8s_v3   eastus   2      105m
huliu-azure412pr-rqhjd-master-1               Running   Standard_D8s_v3   eastus   3      105m
huliu-azure412pr-rqhjd-master-2               Running   Standard_D8s_v3   eastus   1      105m
huliu-azure412pr-rqhjd-worker-eastus1-bk56k   Running   Standard_D4s_v3   eastus   1      98m
huliu-azure412pr-rqhjd-worker-eastus2-8dv6h   Running   Standard_D4s_v3   eastus   2      98m
huliu-azure412pr-rqhjd-worker-eastus3-p68w2   Running   Standard_D4s_v3   eastus   3      98m
5. Edit ControlPlaneMachineSet, change something to trigger RollingUpdate, RollingUpdate succeed, cluster is still healthy
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                          PHASE     TYPE              REGION   ZONE   AGE
huliu-azure412pr-rqhjd-master-25txd-0         Running   Standard_D4s_v3   eastus   3      44m
huliu-azure412pr-rqhjd-master-5vtbs-1         Running   Standard_D4s_v3   eastus   3      60m
huliu-azure412pr-rqhjd-master-7bjrk-2         Running   Standard_D4s_v3   eastus   3      25m
huliu-azure412pr-rqhjd-worker-eastus1-bk56k   Running   Standard_D4s_v3   eastus   1      3h52m
huliu-azure412pr-rqhjd-worker-eastus2-8dv6h   Running   Standard_D4s_v3   eastus   2      3h52m
huliu-azure412pr-rqhjd-worker-eastus3-p68w2   Running   Standard_D4s_v3   eastus   3      3h52m
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         3h30m   Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      15m     
baremetal                                  4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
cloud-controller-manager                   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h54m   
cloud-credential                           4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h59m   
cluster-autoscaler                         4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
config-operator                            4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h53m   
console                                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      46m     
control-plane-machine-set                  4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      52m     
csi-snapshot-controller                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h40m   
dns                                        4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h40m   
etcd                                       4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h44m   
image-registry                             4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h38m   
ingress                                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h39m   
insights                                   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h46m   
kube-apiserver                             4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h41m   
kube-controller-manager                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h44m   
kube-scheduler                             4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h44m   
kube-storage-version-migrator              4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      107m    
machine-api                                4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h37m   
machine-approver                           4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
machine-config                             4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h51m   
marketplace                                4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
monitoring                                 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h35m   
network                                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h54m   
node-tuning                                4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h40m   
openshift-apiserver                        4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      117m    
openshift-controller-manager               4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h40m   
openshift-samples                          4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h39m   
operator-lifecycle-manager                 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
operator-lifecycle-manager-catalog         4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
operator-lifecycle-manager-packageserver   4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h40m   
service-ca                                 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h53m   
storage                                    4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest   True        False         False      3h52m   
liuhuali@Lius-MacBook-Pro huali-test % 
6. Edit ControlPlaneMachineSet, change `internalLoadBalancer` to an invalid value
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                          PHASE     TYPE              REGION   ZONE   AGE
huliu-azure412pr-rqhjd-master-25txd-0         Running   Standard_D4s_v3   eastus   3      48m
huliu-azure412pr-rqhjd-master-5vtbs-1         Running   Standard_D4s_v3   eastus   3      65m
huliu-azure412pr-rqhjd-master-7bjrk-2         Running   Standard_D4s_v3   eastus   3      30m
huliu-azure412pr-rqhjd-master-zvt22-0         Failed                                      5s
huliu-azure412pr-rqhjd-worker-eastus1-bk56k   Running   Standard_D4s_v3   eastus   1      3h56m
huliu-azure412pr-rqhjd-worker-eastus2-8dv6h   Running   Standard_D4s_v3   eastus   2      3h56m
huliu-azure412pr-rqhjd-worker-eastus3-p68w2   Running   Standard_D4s_v3   eastus   3      3h56m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-azure412pr-rqhjd-master-zvt22-0  -o yaml
...
  errorMessage: 'failed to reconcile machine "huliu-azure412pr-rqhjd-master-zvt22-0":
    network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404
    -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound"
    Message="The Resource ''Microsoft.Network/loadBalancers/invalid'' under resource
    group ''huliu-azure412pr-rqhjd-rg'' was not found. For more details please go
    to https://aka.ms/ARMResourceNotFoundFix"'
  errorReason: InvalidConfiguration
...
7. Edit ControlPlaneMachineSet, remove `internalLoadBalancer` field
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
error: controlplanemachinesets.machine.openshift.io "cluster" could not be patched: admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.internalLoadBalancer: Required value: internalLoadBalancer is required for control plane machines
You can run `oc replace -f /var/folders/yc/y9zy01jn3f51r9knbpsm_55r0000gn/T/oc-edit-889700000.yaml` to try this update again.
liuhuali@Lius-MacBook-Pro huali-test %

Comment 13 Huali Liu 2022-09-05 01:29:26 UTC

Already verified this before pr merge, refer Comment 11, move this to Verified.

Comment 16 errata-xmlrpc 2023-01-17 19:54:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.