Description of problem: Change controlplanemachineset machineType to other type trigger RollingUpdate cause cluster error on GCP Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-09-114621 How reproducible: Always Steps to Reproduce: 1.Create a cluster on GCP 2.Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests Then oc edit deploy control-plane-machine-set-operator Change the image to the newly build one liuhuali@Lius-MacBook-Pro huali-test % oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6999ddc49d-qh9bc 2/2 Running 0 65m cluster-baremetal-operator-6cb4947db6-k2ngv 2/2 Running 0 65m control-plane-machine-set-operator-794f6548bd-4lc48 1/1 Running 0 5m6s machine-api-controllers-57cc7c4d5c-s9mnf 7/7 Running 0 64m machine-api-operator-754fc4b895-xptqx 2/2 Running 0 66m 3.Create a ControlPlaneMachineSet, yaml as below: apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: name: cluster namespace: openshift-machine-api spec: replicas: 3 strategy: type: RollingUpdate selector: matchLabels: machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: metadata: labels: machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master machine.openshift.io/cluster-api-cluster: huliu-gcp12e-6vt6v failureDomains: platform: GCP gcp: - zone: us-central1-a - zone: us-central1-b - zone: us-central1-c spec: providerSpec: value: apiVersion: machine.openshift.io/v1beta1 canIPForward: false credentialsSecret: name: gcp-cloud-credentials deletionProtection: false disks: - autoDelete: true boot: true image: projects/rhcos-cloud/global/images/rhcos-412-86-202207142104-0-gcp-x86-64 labels: null sizeGb: 128 type: pd-ssd kind: GCPMachineProviderSpec machineType: n1-standard-4 metadata: creationTimestamp: null networkInterfaces: - network: huliu-gcp12e-6vt6v-network subnetwork: huliu-gcp12e-6vt6v-master-subnet projectID: openshift-qe region: us-central1 serviceAccounts: - email: huliu-gcp12e-6vt6v-m.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform tags: - huliu-gcp12e-6vt6v-master targetPools: - huliu-gcp12e-6vt6v-api userDataSecret: name: master-user-data liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-gcp.yaml controlplanemachineset.machine.openshift.io/cluster created The machineType is different from current master machine, so it triggers RollingUpdtate, if creating a CPMS whose configuration is the same as current master machine, then edit CPMS to change machineType, still can reproduce the issue. 4.Check result. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp12e-6vt6v-master-8pdhm-0 Running n1-standard-4 us-central1 us-central1-a 39m huliu-gcp12e-6vt6v-master-9mzfj-2 Running n1-standard-4 us-central1 us-central1-c 18m huliu-gcp12e-6vt6v-master-d7rrj-1 Running n1-standard-4 us-central1 us-central1-b 28m huliu-gcp12e-6vt6v-worker-a-57vph Running n2-standard-4 us-central1 us-central1-a 107m huliu-gcp12e-6vt6v-worker-b-mkfg6 Running n2-standard-4 us-central1 us-central1-b 107m huliu-gcp12e-6vt6v-worker-c-v9kqs Running n2-standard-4 us-central1 us-central1-c 107m RollingUpdate completed. But cause the cluster error. liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-08-09-114621 True False 7h45m Error while reconciling 4.12.0-0.nightly-2022-08-09-114621: an unknown error has occurred: MultipleErrors liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-08-09-114621 True True False 7h47m APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: observed generation is 14, desired generation is 15.... baremetal 4.12.0-0.nightly-2022-08-09-114621 True False False 8h cloud-controller-manager 4.12.0-0.nightly-2022-08-09-114621 True False False 8h cloud-credential 4.12.0-0.nightly-2022-08-09-114621 True False False 8h cluster-autoscaler 4.12.0-0.nightly-2022-08-09-114621 True False False 8h config-operator 4.12.0-0.nightly-2022-08-09-114621 True False False 8h console 4.12.0-0.nightly-2022-08-09-114621 True False False 7h52m control-plane-machine-set 0.0.1-snapshot True True False 7h3m Observed 3 replica(s) in need of update csi-snapshot-controller 4.12.0-0.nightly-2022-08-09-114621 True False False 8h dns 4.12.0-0.nightly-2022-08-09-114621 True True False 8h DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7." etcd 4.12.0-0.nightly-2022-08-09-114621 True True False 8h NodeInstallerProgressing: 1 nodes are at revision 14; 1 nodes are at revision 17; 1 nodes are at revision 18; 0 nodes have achieved new revision 19 image-registry 4.12.0-0.nightly-2022-08-09-114621 True True False 7h55m Progressing: The registry is ready... ingress 4.12.0-0.nightly-2022-08-09-114621 True False False 7h55m insights 4.12.0-0.nightly-2022-08-09-114621 True False False 7h57m kube-apiserver 4.12.0-0.nightly-2022-08-09-114621 True True True 7h59m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 10 on node: "huliu-gcp12e-6vt6v-master-8pdhm-0.c.openshift-qe.internal" didn't show up, waited: 4m45s kube-controller-manager 4.12.0-0.nightly-2022-08-09-114621 True False False 8h kube-scheduler 4.12.0-0.nightly-2022-08-09-114621 True False False 8h kube-storage-version-migrator 4.12.0-0.nightly-2022-08-09-114621 True False False 6h36m machine-api 4.12.0-0.nightly-2022-08-09-114621 True False False 7h56m machine-approver 4.12.0-0.nightly-2022-08-09-114621 True False False 8h machine-config 4.12.0-0.nightly-2022-08-09-114621 False False True 6h12m Cluster not available for [{operator 4.12.0-0.nightly-2022-08-09-114621}] marketplace 4.12.0-0.nightly-2022-08-09-114621 True False False 8h monitoring 4.12.0-0.nightly-2022-08-09-114621 True False False 7h53m network 4.12.0-0.nightly-2022-08-09-114621 True True False 8h DaemonSet "/openshift-sdn/sdn" is not available (awaiting 1 nodes)... node-tuning 4.12.0-0.nightly-2022-08-09-114621 True False False 8h openshift-apiserver 4.12.0-0.nightly-2022-08-09-114621 True True False 7h57m APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 14, desired generation is 15. openshift-controller-manager 4.12.0-0.nightly-2022-08-09-114621 True False False 8h openshift-samples 4.12.0-0.nightly-2022-08-09-114621 True False False 7h56m operator-lifecycle-manager 4.12.0-0.nightly-2022-08-09-114621 True False False 8h operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-08-09-114621 True False False 8h operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-08-09-114621 True False False 7h57m service-ca 4.12.0-0.nightly-2022-08-09-114621 True False False 8h storage 4.12.0-0.nightly-2022-08-09-114621 True True False 8h GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods liuhuali@Lius-MacBook-Pro huali-test % oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6999ddc49d-shl4v 2/2 Running 0 6h26m cluster-baremetal-operator-6cb4947db6-clkkj 2/2 Running 0 6h26m control-plane-machine-set-operator-794f6548bd-5q2pc 1/1 Running 0 6h37m machine-api-controllers-57cc7c4d5c-cdrnm 7/7 Running 0 6h49m machine-api-operator-754fc4b895-b65n9 2/2 Running 0 6h26m liuhuali@Lius-MacBook-Pro huali-test % oc logs control-plane-machine-set-operator-794f6548bd-5q2pc Error from server (InternalError): Internal error occurred: Authorization error (user=system:kube-apiserver, verb=get, resource=nodes, subresource=proxy) liuhuali@Lius-MacBook-Pro huali-test % Cannot open pod log. 5.If now you change CPMS configuration to trigger another RollingUpdate, the new machine always stuck in Provisioned, cannot get Running. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp12e-6vt6v-master-8pdhm-0 Running n1-standard-4 us-central1 us-central1-a 7h10m huliu-gcp12e-6vt6v-master-9mzfj-2 Running n1-standard-4 us-central1 us-central1-c 6h49m huliu-gcp12e-6vt6v-master-d7rrj-1 Running n1-standard-4 us-central1 us-central1-b 6h59m huliu-gcp12e-6vt6v-master-db8nx-0 Provisioned n2-standard-4 us-central1 us-central1-a 99m huliu-gcp12e-6vt6v-worker-a-57vph Running n2-standard-4 us-central1 us-central1-a 8h huliu-gcp12e-6vt6v-worker-b-mkfg6 Running n2-standard-4 us-central1 us-central1-b 8h huliu-gcp12e-6vt6v-worker-c-v9kqs Running n2-standard-4 us-central1 us-central1-c 8h Actual results: CPMS RollingUpdate cause cluster error Expected results: CPMS RollingUpdate don’t cause cluster error Additional info: Must-Gather: https://drive.google.com/file/d/1TVGeGH__JOhY2sBZAExaJEtbtICRdKAZ/view?usp=sharing The must-gather data maybe is not completed, it is still collecting data after a long time(more than 3 hours) till the cluster expired. I guess maybe due to the cluster has error so the must-gather can not work right.
The must-gather didn't gather any logs so we will have to try to reproduce this to work out what's gone wrong there. I think maybe the drain fix hasn't made it to GCP yet so we need to make that update and test again.
I've spent some time to try and reproduce this today and didn't hit the same issues. Looking at the cluster version you've shown, are you sure the cluster was installed correctly before you started testing? I wouldn't expect it to be showing errors like that in a correctly installed cluster. Can you try again with a fresh cluster and make sure that the cluster is installed and stable before you install the CPMS? Then we can make sure the symptoms are definitely a factor of the CPMS creating a new master machine
Scratch my previous message, I've noticed the rollout has completed but the cluster is now stuck in the same circumstances. It seems there's an extraneous kube-controller-manager pod that should have been removed but, for some reason wasn't removed correctly. I suspect this is causing everything to block
Ok, so after some further digging, the root cause here is that when new Machines are being created, they are not being added to the instance groups for the internal load balancer API, which means that the internal API (used by things such as KCM) has no backing endpoints and therefore, no KCM can run to schedule pods and the like. We need to work out why these instance groups aren't being updated when Machine API creates new control plane instances
Instance groups aren't currently supported by Machine API on GCP so this is why this isn't working, we have https://issues.redhat.com/browse/OCPCLOUD-672 and https://issues.redhat.com/browse/OCPCLOUD-1562 in jira tracking the implementation here. Looking at the upstream code, they check if a Machine is a control plane machine and then, look up the instance group for the subnet and ensure the instance is registered. We could do this automatically in a similar way. Will need someone to backport the feature from upstream.
On azure, CPMS RolingUpdate also cause cluster error. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure12b-w6zxn-master-bvrvl-0 Running Standard_D4s_v3 eastus 2 155m huliu-azure12b-w6zxn-master-gq87p-2 Running Standard_D4s_v3 eastus 1 123m huliu-azure12b-w6zxn-master-qd8jg-1 Running Standard_D4s_v3 eastus 3 140m huliu-azure12b-w6zxn-worker-eastus1-tvxvr Running Standard_D4s_v3 eastus 1 4h huliu-azure12b-w6zxn-worker-eastus2-dtxrd Running Standard_D4s_v3 eastus 2 4h huliu-azure12b-w6zxn-worker-eastus3-j7qjw Running Standard_D4s_v3 eastus 3 4h liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-08-15-150248 True False 3h35m Error while reconciling 4.12.0-0.nightly-2022-08-15-150248: an unknown error has occurred: MultipleErrors liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-08-15-150248 False False True 105m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.huliu-azure12b.qe.azure.devcluster.openshift.com/healthz": dial tcp 20.124.44.69:443: connect: connection refused... baremetal 4.12.0-0.nightly-2022-08-15-150248 True False False 4h cloud-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 4h3m cloud-credential 4.12.0-0.nightly-2022-08-15-150248 True False False 4h13m cluster-autoscaler 4.12.0-0.nightly-2022-08-15-150248 True False False 4h config-operator 4.12.0-0.nightly-2022-08-15-150248 True False False 4h2m console 4.12.0-0.nightly-2022-08-15-150248 False False False 105m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.huliu-azure12b.qe.azure.devcluster.openshift.com): Get "https://console-openshift-console.apps.huliu-azure12b.qe.azure.devcluster.openshift.com": dial tcp 20.124.44.69:443: connect: connection refused control-plane-machine-set 0.0.1-snapshot True False False 170m csi-snapshot-controller 4.12.0-0.nightly-2022-08-15-150248 True False False 4h1m dns 4.12.0-0.nightly-2022-08-15-150248 True True False 4h DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6." etcd 4.12.0-0.nightly-2022-08-15-150248 True False False 3h51m image-registry 4.12.0-0.nightly-2022-08-15-150248 False True True 105m Available: The deployment does not have available replicas... ingress 4.12.0-0.nightly-2022-08-15-150248 False True True 105m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) insights 4.12.0-0.nightly-2022-08-15-150248 True False False 3h55m kube-apiserver 4.12.0-0.nightly-2022-08-15-150248 True False False 3h51m kube-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False True 3h52m GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.97.149:9091: i/o timeout kube-scheduler 4.12.0-0.nightly-2022-08-15-150248 True False False 3h52m kube-storage-version-migrator 4.12.0-0.nightly-2022-08-15-150248 True False False 102m machine-api 4.12.0-0.nightly-2022-08-15-150248 True False False 3h47m machine-approver 4.12.0-0.nightly-2022-08-15-150248 True False False 4h1m machine-config 4.12.0-0.nightly-2022-08-15-150248 False False True 95m Cluster not available for [{operator 4.12.0-0.nightly-2022-08-15-150248}] marketplace 4.12.0-0.nightly-2022-08-15-150248 True False False 4h monitoring 4.12.0-0.nightly-2022-08-15-150248 False True True 88m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.12.0-0.nightly-2022-08-15-150248 True True True 4h2m DaemonSet "/openshift-multus/multus" rollout is not making progress - last change 2022-08-16T04:00:18Z... node-tuning 4.12.0-0.nightly-2022-08-15-150248 True False False 4h openshift-apiserver 4.12.0-0.nightly-2022-08-15-150248 True False False 3h49m openshift-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 3h52m openshift-samples 4.12.0-0.nightly-2022-08-15-150248 True False False 3h49m operator-lifecycle-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 4h1m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-08-15-150248 True False False 4h1m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-08-15-150248 True False False 3h49m service-ca 4.12.0-0.nightly-2022-08-15-150248 True False False 4h2m storage 4.12.0-0.nightly-2022-08-15-150248 True True False 4h1m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... liuhuali@Lius-MacBook-Pro huali-test %
On Azure, the installer does not set the `internalLoadBalancer` field on the Machine provider spec. This means that Azure then presents the same symptoms as GCP. However, unlike GCP, we do support this. The public load balancer field will look like `publicLoadBalancer: <cluster-id>`. If we copy this and add `-internal` to the end we can configure the `internalLoadBalancer: <cluster-id>-internal` and the rollout can proceed correctly. There are a few actions we need to take knowing this: - We should document when Azure users are setting up CPMS that they should add the internal load balancer to their existing machines, and the CPMS spec from there on - We should fix the installer to add this field by default - We _could_ prevent a CPMS from being installed if there is no internal load balancer, since we know this is required
I think if we repeat the process on Azure with this PR https://github.com/openshift/installer/pull/6230 to the installer it should work. This should add the load balancer field that's missing, and, as long as we add that to the CPMS spec as well, it should be able to complete a control plane replacement
Thanks @jspeed yes, on Azure, after adding `internalLoadBalancer: huliu-azure12c-6mpmd-internal` to the CPMS spec, CPMS RollingUpdate proceed correctly and didn't cause cluster error. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure12c-6mpmd-master-84wqx-1 Running Standard_D4s_v3 eastus 3 48m huliu-azure12c-6mpmd-master-sqc6v-0 Running Standard_D4s_v3 eastus 2 65m huliu-azure12c-6mpmd-master-vkgsj-2 Running Standard_D4s_v3 eastus 1 33m huliu-azure12c-6mpmd-worker-eastus1-87778 Running Standard_D4s_v3 eastus 1 3h1m huliu-azure12c-6mpmd-worker-eastus2-28gjm Running Standard_D4s_v3 eastus 2 3h1m huliu-azure12c-6mpmd-worker-eastus3-xnlrx Running Standard_D4s_v3 eastus 3 3h1m liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-08-15-150248 True False 159m Cluster version is 4.12.0-0.nightly-2022-08-15-150248 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-08-15-150248 True False False 19m baremetal 4.12.0-0.nightly-2022-08-15-150248 True False False 3h cloud-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 3h2m cloud-credential 4.12.0-0.nightly-2022-08-15-150248 True False False 3h11m cluster-autoscaler 4.12.0-0.nightly-2022-08-15-150248 True False False 3h config-operator 4.12.0-0.nightly-2022-08-15-150248 True False False 3h1m console 4.12.0-0.nightly-2022-08-15-150248 True False False 50m control-plane-machine-set 0.0.1-snapshot True False False 137m csi-snapshot-controller 4.12.0-0.nightly-2022-08-15-150248 True False False 3h1m dns 4.12.0-0.nightly-2022-08-15-150248 True False False 3h etcd 4.12.0-0.nightly-2022-08-15-150248 True False False 179m image-registry 4.12.0-0.nightly-2022-08-15-150248 True False False 170m ingress 4.12.0-0.nightly-2022-08-15-150248 True False False 170m insights 4.12.0-0.nightly-2022-08-15-150248 True False False 175m kube-apiserver 4.12.0-0.nightly-2022-08-15-150248 True False False 163m kube-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 178m kube-scheduler 4.12.0-0.nightly-2022-08-15-150248 True False False 178m kube-storage-version-migrator 4.12.0-0.nightly-2022-08-15-150248 True False False 99m machine-api 4.12.0-0.nightly-2022-08-15-150248 True False False 167m machine-approver 4.12.0-0.nightly-2022-08-15-150248 True False False 3h machine-config 4.12.0-0.nightly-2022-08-15-150248 True False False 179m marketplace 4.12.0-0.nightly-2022-08-15-150248 True False False 3h monitoring 4.12.0-0.nightly-2022-08-15-150248 True False False 159m network 4.12.0-0.nightly-2022-08-15-150248 True False False 3h2m node-tuning 4.12.0-0.nightly-2022-08-15-150248 True False False 3h openshift-apiserver 4.12.0-0.nightly-2022-08-15-150248 True False False 50m openshift-controller-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 171m openshift-samples 4.12.0-0.nightly-2022-08-15-150248 True False False 171m operator-lifecycle-manager 4.12.0-0.nightly-2022-08-15-150248 True False False 3h operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-08-15-150248 True False False 3h1m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-08-15-150248 True False False 174m service-ca 4.12.0-0.nightly-2022-08-15-150248 True False False 3h1m storage 4.12.0-0.nightly-2022-08-15-150248 True False False 3h1m liuhuali@Lius-MacBook-Pro huali-test %
The plan forward here is to prevent GCP CPMS from being created until we can fix the load balancing issues in 4.13. Then for Azure, return an error if the user doesn't have the internal load balancer set, this should prompt the user to configure it. To prevent awkward rollouts on install, we will set up Machine API Azure to populate the load balancer where possible.
Verified the issue before pr merge. On GCP: 1. Create a new release image from the pull requests using Cluster Bot build openshift/installer#6230,openshift/machine-api-provider-azure#31,openshift/cluster-control-plane-machine-set-operator#84 2. Install a cluster using the image build in previous step on GCP liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False 3h15m Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest liuhuali@Lius-MacBook-Pro huali-test % oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-b6c8d658-9rs65 2/2 Running 0 3h36m cluster-baremetal-operator-7c9cb8d8cb-qfqk2 2/2 Running 0 3h36m control-plane-machine-set-operator-7fc8897c6b-m4tgh 1/1 Running 0 3h36m machine-api-controllers-596c49fcd9-x5jzk 7/7 Running 0 3h34m machine-api-operator-54d9869b57-t6ndz 2/2 Running 0 3h37m machine-api-termination-handler-62pm9 1/1 Running 0 3h27m machine-api-termination-handler-6pr4c 1/1 Running 0 3h27m machine-api-termination-handler-m5bq4 1/1 Running 0 3h26m liuhuali@Lius-MacBook-Pro huali-test % 3. Create a ControlPlaneMachineSet liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-gcp.yaml Error from server (spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value: Forbidden: automatic replacement of control plane machines on GCP is not currently supported): error when creating "controlpanemachineset-gcp.yaml": admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value: Forbidden: automatic replacement of control plane machines on GCP is not currently supported liuhuali@Lius-MacBook-Pro huali-test % On Azure: 1. Create a new release image from the pull requests using Cluster Bot build openshift/installer#6230,openshift/machine-api-provider-azure#31,openshift/cluster-control-plane-machine-set-operator#84 2. Install a cluster using the image build in previous step on Azure liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False 67m Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest liuhuali@Lius-MacBook-Pro huali-test % oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-b6c8d658-spdhd 2/2 Running 1 (85m ago) 96m cluster-baremetal-operator-7c9cb8d8cb-pgcgb 2/2 Running 0 96m control-plane-machine-set-operator-7fc8897c6b-87jkf 1/1 Running 2 (84m ago) 96m machine-api-controllers-586b447cdc-d56h4 7/7 Running 11 (84m ago) 89m machine-api-operator-54d9869b57-gc928 2/2 Running 1 (85m ago) 96m machine-api-termination-handler-6mm5m 1/1 Running 0 79m machine-api-termination-handler-dbvpl 1/1 Running 0 72m machine-api-termination-handler-thmvh 1/1 Running 0 74m 3. Check `internalLoadBalancer` field is set on the master machine provider spec by default liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep internalLoadBalancer internalLoadBalancer: huliu-azure412pr-rqhjd-internal internalLoadBalancer: huliu-azure412pr-rqhjd-internal internalLoadBalancer: huliu-azure412pr-rqhjd-internal liuhuali@Lius-MacBook-Pro huali-test % 4. Create a ControlPlaneMachineSet, same configuration with current master machine liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlpanemachineset-azure.yaml controlplanemachineset.machine.openshift.io/cluster created liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE AGE cluster 3 3 3 3 15s liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure412pr-rqhjd-master-0 Running Standard_D8s_v3 eastus 2 105m huliu-azure412pr-rqhjd-master-1 Running Standard_D8s_v3 eastus 3 105m huliu-azure412pr-rqhjd-master-2 Running Standard_D8s_v3 eastus 1 105m huliu-azure412pr-rqhjd-worker-eastus1-bk56k Running Standard_D4s_v3 eastus 1 98m huliu-azure412pr-rqhjd-worker-eastus2-8dv6h Running Standard_D4s_v3 eastus 2 98m huliu-azure412pr-rqhjd-worker-eastus3-p68w2 Running Standard_D4s_v3 eastus 3 98m 5. Edit ControlPlaneMachineSet, change something to trigger RollingUpdate, RollingUpdate succeed, cluster is still healthy liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure412pr-rqhjd-master-25txd-0 Running Standard_D4s_v3 eastus 3 44m huliu-azure412pr-rqhjd-master-5vtbs-1 Running Standard_D4s_v3 eastus 3 60m huliu-azure412pr-rqhjd-master-7bjrk-2 Running Standard_D4s_v3 eastus 3 25m huliu-azure412pr-rqhjd-worker-eastus1-bk56k Running Standard_D4s_v3 eastus 1 3h52m huliu-azure412pr-rqhjd-worker-eastus2-8dv6h Running Standard_D4s_v3 eastus 2 3h52m huliu-azure412pr-rqhjd-worker-eastus3-p68w2 Running Standard_D4s_v3 eastus 3 3h52m liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False 3h30m Cluster version is 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 15m baremetal 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m cloud-controller-manager 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h54m cloud-credential 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h59m cluster-autoscaler 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m config-operator 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h53m console 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 46m control-plane-machine-set 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 52m csi-snapshot-controller 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h40m dns 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h40m etcd 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h44m image-registry 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h38m ingress 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h39m insights 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h46m kube-apiserver 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h41m kube-controller-manager 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h44m kube-scheduler 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h44m kube-storage-version-migrator 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 107m machine-api 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h37m machine-approver 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m machine-config 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h51m marketplace 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m monitoring 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h35m network 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h54m node-tuning 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h40m openshift-apiserver 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 117m openshift-controller-manager 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h40m openshift-samples 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h39m operator-lifecycle-manager 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m operator-lifecycle-manager-catalog 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m operator-lifecycle-manager-packageserver 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h40m service-ca 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h53m storage 4.12.0-0.ci.test-2022-08-24-025537-ci-ln-xj7tqhb-latest True False False 3h52m liuhuali@Lius-MacBook-Pro huali-test % 6. Edit ControlPlaneMachineSet, change `internalLoadBalancer` to an invalid value liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure412pr-rqhjd-master-25txd-0 Running Standard_D4s_v3 eastus 3 48m huliu-azure412pr-rqhjd-master-5vtbs-1 Running Standard_D4s_v3 eastus 3 65m huliu-azure412pr-rqhjd-master-7bjrk-2 Running Standard_D4s_v3 eastus 3 30m huliu-azure412pr-rqhjd-master-zvt22-0 Failed 5s huliu-azure412pr-rqhjd-worker-eastus1-bk56k Running Standard_D4s_v3 eastus 1 3h56m huliu-azure412pr-rqhjd-worker-eastus2-8dv6h Running Standard_D4s_v3 eastus 2 3h56m huliu-azure412pr-rqhjd-worker-eastus3-p68w2 Running Standard_D4s_v3 eastus 3 3h56m liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-azure412pr-rqhjd-master-zvt22-0 -o yaml ... errorMessage: 'failed to reconcile machine "huliu-azure412pr-rqhjd-master-zvt22-0": network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource ''Microsoft.Network/loadBalancers/invalid'' under resource group ''huliu-azure412pr-rqhjd-rg'' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"' errorReason: InvalidConfiguration ... 7. Edit ControlPlaneMachineSet, remove `internalLoadBalancer` field liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster error: controlplanemachinesets.machine.openshift.io "cluster" could not be patched: admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.internalLoadBalancer: Required value: internalLoadBalancer is required for control plane machines You can run `oc replace -f /var/folders/yc/y9zy01jn3f51r9knbpsm_55r0000gn/T/oc-edit-889700000.yaml` to try this update again. liuhuali@Lius-MacBook-Pro huali-test %
Already verified this before pr merge, refer Comment 11, move this to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399