With the move to out-of-tree providers in Azure (4.10) and Azure Stack Hub(4.9), the excludeMastersFromLB: true value in the cloud provider config has created an issue where if a master node restarts the service controller will not add it back to the load balancer. This value should be set to false.
Verified fixed. Verified with 4.9 nightly build: 4.9.0-0.nightly-2021-08-01-223336, after restarting the master, the service controller added it back to the load balancer. Created a related test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-4317
updated the test case link: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-43176
Add some more verification steps (per 4.9.0-0.nightly-2021-08-26-040328 build) based on comment 2. [root@preserve-jialiu-ansible ~]# oc debug node/qeci-26032-h5ngk-master-0 Starting pod/qeci-26032-h5ngk-master-0-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.7 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ps -ef|grep kubelet root 1988 1 19 04:20 ? 00:37:05 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip= --minimum-container-ttl-duration=6m0s --cloud-provider=azure --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-config=/etc/kubernetes/cloud.conf --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:347702e4f91395e1f3d4cbae92248fd164e58f577da8c453a9d0b225f867426b --system-reserved=cpu=500m,memory=1Gi --v=2 sh-4.4# cat /etc/kubernetes/cloud.conf { "cloud": "AzurePublicCloud", "tenantId": "6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee", "aadClientId": "", "aadClientSecret": "", "aadClientCertPath": "", "aadClientCertPassword": "", "useManagedIdentityExtension": true, "userAssignedIdentityID": "", "subscriptionId": "53b8f551-f0fc-4bea-8cba-6d1fefd54c8a", "resourceGroup": "qeci-26032-h5ngk-rg", "location": "centralus", "vnetName": "qeci-26032-h5ngk-vnet", "vnetResourceGroup": "qeci-26032-h5ngk-rg", "subnetName": "qeci-26032-h5ngk-worker-subnet", "securityGroupName": "qeci-26032-h5ngk-nsg", "routeTableName": "qeci-26032-h5ngk-node-routetable", "primaryAvailabilitySetName": "", "vmType": "", "primaryScaleSetName": "", "cloudProviderBackoff": true, "cloudProviderBackoffRetries": 0, "cloudProviderBackoffExponent": 0, "cloudProviderBackoffDuration": 6, "cloudProviderBackoffJitter": 0, "cloudProviderRateLimit": false, "cloudProviderRateLimitQPS": 0, "cloudProviderRateLimitBucket": 0, "cloudProviderRateLimitQPSWrite": 0, "cloudProviderRateLimitBucketWrite": 0, "useInstanceMetadata": true, "loadBalancerSku": "standard", "excludeMasterFromStandardLB": false, "disableOutboundSNAT": null, "maximumLoadBalancerRuleCount": 0 }sh-4.4# cat /etc/kubernetes/cloud.conf|grep excludeMasterFromStandardLB "excludeMasterFromStandardLB": false, sh-4.4# exit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759