Bug 1860128

Summary:	After upgrading to 4.3.28 customer observed mistmatch between node labels (region=CanadaCentral) and pv affinity (region=canadacentral)
Product:	OpenShift Container Platform	Reporter:	emahoney
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	agarcial, aos-bugs, bmilne, jhunter, jokerman, mharri, mjudeiki, rbost, rdave, shea.stewart, sponnaga, zhsun
Version:	4.3.z
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1860829 (view as bug list)		Environment:
Last Closed:	2020-10-27 16:16:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1860829

Description emahoney 2020-07-23 18:20:20 UTC

Description of problem:
ISSUE: After upgrading to 4.3.28 customer is seeing an inconsistency between their node labels and the region affinity specified by their cloud storage provider.

~~~~
Affinity defined in PV created by Azure storageclass:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [canadacentral]

Node labels:
failure-domain.beta.kubernetes.io/region=CanadaCentral
failure-domain.beta.kubernetes.io/zone=0
~~~~

This obviously causes issues as the affinity/labels do not match anymore. We have recommended the customer change the labels on the node to match the affinity definition defined in the PV as a workaround. Customer is asking why they ran into this issue. From what we can tell, either the node labels changed or the region in Azure changed its format. The latter is unlikely but a possibility.

Referring to https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-customizations.html and the Azure docs, all references to platform.<platform>.region for Azure are lowercase only. So I am doubting that changed. That leaves us with the node label changing. The only thing I can think of would be setting/installing the cluster with the config below. Although if that were the case I would have expected the pods to _never_ have scheduled in the past.

~~~~
platform:
azure:
region: CanadaCentral

https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-private.html#installation-azure-config-yaml_installing-azure-private
~~~~

Going forward, we are interested in whether there are any alternative root causes that we have not considered.

Version-Release number of selected component (if applicable):
4.3.28

How reproducible:
Only a single env has seen this

Steps to Reproduce:
1. n/a
2.
3.

Actual results:
Affinity definition and node labels mismatch

Expected results:
Affinity definition in PV and node labels match.

Additional info:

Comment 1 emahoney 2020-07-23 18:35:41 UTC

The customer just scaled an additional node up and it picked up the node label with CanadaCentral. I am guessing this is a machineconfig or cloudprovider issue now. Still havent pinned down where exactly this is being pulled from. 

In addition to this, even after manually editing the region to canadacentral, they now see a new error:

~~~~
  Warning  FailedScheduling  <unknown>  default-scheduler  0/9 nodes are available: 1 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 5 node(s) had volume node affinity conflict.
~~~~

I havent seen this before, but I believe '1 node(s) had no available volume zone' means the volume is in another AZ than the node is in. Can you confirm my understanding of that?

-mahoney

Comment 7 Robert Bost 2020-07-25 00:43:26 UTC

Could this be related to https://github.com/kubernetes/kubernetes/issues/93421 ?

Comment 13 Alberto 2020-07-27 08:15:00 UTC

*** Bug 1860410 has been marked as a duplicate of this bug. ***

Comment 14 Alberto 2020-07-27 09:10:45 UTC

This is already included in >= 4.5 https://github.com/openshift/origin/blob/release-4.5/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_standard.go#L492-L494

Moving to MODIFIED for QA to validate.

Comment 17 sunzhaohua 2020-07-29 10:04:59 UTC

Verified.
clusterversion: 4.6.0-0.nightly-2020-07-28-195907
steps:
1. Create a cluster in Azure Canada Central. 
2. Check "oc get configmap cloud-provider-config -n openshift-config"  `UseInstanceMetadata=True` is set. 
3. Create PVC using dynamic storage class.
$ oc get pvc
NAME   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
pvc1   Bound    pvc-a140403f-d276-490b-b413-7c359eb1c7ea   1Gi        RWO            managed-premium   29s
$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM          STORAGECLASS      REASON   AGE
pvc-a140403f-d276-490b-b413-7c359eb1c7ea   1Gi        RWO            Delete           Bound    default/pvc1   managed-premium            29s
$ oc get po
NAME          READY   STATUS    RESTARTS   AGE
task-pv-pod   1/1     Running   0          41s

Comment 18 Marcel Härri 2020-08-04 09:09:39 UTC

Given we had the region affinity for columes introduced within 4.4 (and apparently within 4.3), thus breaking things. Then to fix it in between, one had to change cloud config to CamelCase. However, with the fix now enforcing all lowercase, shouldn't we backport this to 4.4 and 4.3 so this is getting fixed everywhere and continues to work also after upgrading to 4.5? Otherwise we should at least document it in the release notes, so folks can quickly see how to fix the issue if they start running into it.

Comment 19 Mangirdas Judeikis 2020-08-04 11:36:53 UTC

Backport - yet. 
No manual documentation please as we don't have capacity to deal with those in managed service.

Comment 20 shea.stewart 2020-08-11 19:30:33 UTC

Experienced this in a live customer cluster on 4.4.10 that had been automatically updated from 4.3.18

Comment 24 errata-xmlrpc 2020-10-27 16:16:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196