Bug 1860128
| Summary: | After upgrading to 4.3.28 customer observed mistmatch between node labels (region=CanadaCentral) and pv affinity (region=canadacentral) | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | emahoney | |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | agarcial, aos-bugs, bmilne, jhunter, jokerman, mharri, mjudeiki, rbost, rdave, shea.stewart, sponnaga, zhsun | |
| Version: | 4.3.z | |||
| Target Milestone: | --- | |||
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1860829 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 16:16:56 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1860829 | |||
|
Description
emahoney
2020-07-23 18:20:20 UTC
The customer just scaled an additional node up and it picked up the node label with CanadaCentral. I am guessing this is a machineconfig or cloudprovider issue now. Still havent pinned down where exactly this is being pulled from. In addition to this, even after manually editing the region to canadacentral, they now see a new error: ~~~~ Warning FailedScheduling <unknown> default-scheduler 0/9 nodes are available: 1 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 5 node(s) had volume node affinity conflict. ~~~~ I havent seen this before, but I believe '1 node(s) had no available volume zone' means the volume is in another AZ than the node is in. Can you confirm my understanding of that? -mahoney Could this be related to https://github.com/kubernetes/kubernetes/issues/93421 ? *** Bug 1860410 has been marked as a duplicate of this bug. *** This is already included in >= 4.5 https://github.com/openshift/origin/blob/release-4.5/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_standard.go#L492-L494 Moving to MODIFIED for QA to validate. Verified. clusterversion: 4.6.0-0.nightly-2020-07-28-195907 steps: 1. Create a cluster in Azure Canada Central. 2. Check "oc get configmap cloud-provider-config -n openshift-config" `UseInstanceMetadata=True` is set. 3. Create PVC using dynamic storage class. $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc1 Bound pvc-a140403f-d276-490b-b413-7c359eb1c7ea 1Gi RWO managed-premium 29s $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-a140403f-d276-490b-b413-7c359eb1c7ea 1Gi RWO Delete Bound default/pvc1 managed-premium 29s $ oc get po NAME READY STATUS RESTARTS AGE task-pv-pod 1/1 Running 0 41s Given we had the region affinity for columes introduced within 4.4 (and apparently within 4.3), thus breaking things. Then to fix it in between, one had to change cloud config to CamelCase. However, with the fix now enforcing all lowercase, shouldn't we backport this to 4.4 and 4.3 so this is getting fixed everywhere and continues to work also after upgrading to 4.5? Otherwise we should at least document it in the release notes, so folks can quickly see how to fix the issue if they start running into it. Backport - yet. No manual documentation please as we don't have capacity to deal with those in managed service. Experienced this in a live customer cluster on 4.4.10 that had been automatically updated from 4.3.18 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |