Description of problem:
ISSUE: After upgrading to 4.3.28 customer is seeing an inconsistency between their node labels and the region affinity specified by their cloud storage provider.
Affinity defined in PV created by Azure storageclass:
Term 0: failure-domain.beta.kubernetes.io/region in [canadacentral]
This obviously causes issues as the affinity/labels do not match anymore. We have recommended the customer change the labels on the node to match the affinity definition defined in the PV as a workaround. Customer is asking why they ran into this issue. From what we can tell, either the node labels changed or the region in Azure changed its format. The latter is unlikely but a possibility.
Referring to https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-customizations.html and the Azure docs, all references to platform.<platform>.region for Azure are lowercase only. So I am doubting that changed. That leaves us with the node label changing. The only thing I can think of would be setting/installing the cluster with the config below. Although if that were the case I would have expected the pods to _never_ have scheduled in the past.
Going forward, we are interested in whether there are any alternative root causes that we have not considered.
Version-Release number of selected component (if applicable):
Only a single env has seen this
Steps to Reproduce:
Affinity definition and node labels mismatch
Affinity definition in PV and node labels match.
The customer just scaled an additional node up and it picked up the node label with CanadaCentral. I am guessing this is a machineconfig or cloudprovider issue now. Still havent pinned down where exactly this is being pulled from.
In addition to this, even after manually editing the region to canadacentral, they now see a new error:
Warning FailedScheduling <unknown> default-scheduler 0/9 nodes are available: 1 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 5 node(s) had volume node affinity conflict.
I havent seen this before, but I believe '1 node(s) had no available volume zone' means the volume is in another AZ than the node is in. Can you confirm my understanding of that?
Could this be related to https://github.com/kubernetes/kubernetes/issues/93421 ?
*** Bug 1860410 has been marked as a duplicate of this bug. ***
This is already included in >= 4.5 https://github.com/openshift/origin/blob/release-4.5/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_standard.go#L492-L494
Moving to MODIFIED for QA to validate.
1. Create a cluster in Azure Canada Central.
2. Check "oc get configmap cloud-provider-config -n openshift-config" `UseInstanceMetadata=True` is set.
3. Create PVC using dynamic storage class.
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pvc1 Bound pvc-a140403f-d276-490b-b413-7c359eb1c7ea 1Gi RWO managed-premium 29s
$ oc get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-a140403f-d276-490b-b413-7c359eb1c7ea 1Gi RWO Delete Bound default/pvc1 managed-premium 29s
$ oc get po
NAME READY STATUS RESTARTS AGE
task-pv-pod 1/1 Running 0 41s
Given we had the region affinity for columes introduced within 4.4 (and apparently within 4.3), thus breaking things. Then to fix it in between, one had to change cloud config to CamelCase. However, with the fix now enforcing all lowercase, shouldn't we backport this to 4.4 and 4.3 so this is getting fixed everywhere and continues to work also after upgrading to 4.5? Otherwise we should at least document it in the release notes, so folks can quickly see how to fix the issue if they start running into it.
Backport - yet.
No manual documentation please as we don't have capacity to deal with those in managed service.
Experienced this in a live customer cluster on 4.4.10 that had been automatically updated from 4.3.18
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.