Bug 1860128 - After upgrading to 4.3.28 customer observed mistmatch between node labels (region=CanadaCentral) and pv affinity (region=canadacentral)
Summary: After upgrading to 4.3.28 customer observed mistmatch between node labels (re...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
: 1860410 (view as bug list)
Depends On:
Blocks: 1860829
TreeView+ depends on / blocked
 
Reported: 2020-07-23 18:20 UTC by emahoney
Modified: 2020-10-27 16:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1860829 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:16:56 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:17:14 UTC

Description emahoney 2020-07-23 18:20:20 UTC
Description of problem:
ISSUE: After upgrading to 4.3.28 customer is seeing an inconsistency between their node labels and the region affinity specified by their cloud storage provider.

~~~~
Affinity defined in PV created by Azure storageclass:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [canadacentral]

Node labels:
failure-domain.beta.kubernetes.io/region=CanadaCentral
failure-domain.beta.kubernetes.io/zone=0
~~~~

This obviously causes issues as the affinity/labels do not match anymore. We have recommended the customer change the labels on the node to match the affinity definition defined in the PV as a workaround. Customer is asking why they ran into this issue. From what we can tell, either the node labels changed or the region in Azure changed its format. The latter is unlikely but a possibility. 

Referring to https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-customizations.html and the Azure docs, all references to platform.<platform>.region for Azure are lowercase only. So I am doubting that changed. That leaves us with the node label changing. The only thing I can think of would be setting/installing the cluster with the config below. Although if that were the case I would have expected the pods to _never_ have scheduled in the past. 

~~~~
platform:
  azure:
    region: CanadaCentral 

https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-private.html#installation-azure-config-yaml_installing-azure-private
~~~~

Going forward, we are interested in whether there are any alternative root causes that we have not considered. 

Version-Release number of selected component (if applicable):
4.3.28

How reproducible:
Only a single env has seen this


Steps to Reproduce:
1. n/a
2.
3.

Actual results:
Affinity definition and node labels mismatch


Expected results:
Affinity definition in PV and node labels match. 


Additional info:

Comment 1 emahoney 2020-07-23 18:35:41 UTC
The customer just scaled an additional node up and it picked up the node label with CanadaCentral. I am guessing this is a machineconfig or cloudprovider issue now. Still havent pinned down where exactly this is being pulled from. 

In addition to this, even after manually editing the region to canadacentral, they now see a new error:

~~~~
  Warning  FailedScheduling  <unknown>  default-scheduler  0/9 nodes are available: 1 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 5 node(s) had volume node affinity conflict.
~~~~

I havent seen this before, but I believe '1 node(s) had no available volume zone' means the volume is in another AZ than the node is in. Can you confirm my understanding of that?

-mahoney

Comment 7 Robert Bost 2020-07-25 00:43:26 UTC
Could this be related to https://github.com/kubernetes/kubernetes/issues/93421 ?

Comment 13 Alberto 2020-07-27 08:15:00 UTC
*** Bug 1860410 has been marked as a duplicate of this bug. ***

Comment 14 Alberto 2020-07-27 09:10:45 UTC
This is already included in >= 4.5 https://github.com/openshift/origin/blob/release-4.5/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_standard.go#L492-L494

Moving to MODIFIED for QA to validate.

Comment 17 sunzhaohua 2020-07-29 10:04:59 UTC
Verified.
clusterversion: 4.6.0-0.nightly-2020-07-28-195907
steps:
1. Create a cluster in Azure Canada Central. 
2. Check "oc get configmap cloud-provider-config -n openshift-config"  `UseInstanceMetadata=True` is set. 
3. Create PVC using dynamic storage class.
$ oc get pvc
NAME   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
pvc1   Bound    pvc-a140403f-d276-490b-b413-7c359eb1c7ea   1Gi        RWO            managed-premium   29s
$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM          STORAGECLASS      REASON   AGE
pvc-a140403f-d276-490b-b413-7c359eb1c7ea   1Gi        RWO            Delete           Bound    default/pvc1   managed-premium            29s
$ oc get po
NAME          READY   STATUS    RESTARTS   AGE
task-pv-pod   1/1     Running   0          41s

Comment 18 Marcel Härri 2020-08-04 09:09:39 UTC
Given we had the region affinity for columes introduced within 4.4 (and apparently within 4.3), thus breaking things. Then to fix it in between, one had to change cloud config to CamelCase. However, with the fix now enforcing all lowercase, shouldn't we backport this to 4.4 and 4.3 so this is getting fixed everywhere and continues to work also after upgrading to 4.5? Otherwise we should at least document it in the release notes, so folks can quickly see how to fix the issue if they start running into it.

Comment 19 Mangirdas Judeikis 2020-08-04 11:36:53 UTC
Backport - yet. 
No manual documentation please as we don't have capacity to deal with those in managed service.

Comment 20 shea.stewart 2020-08-11 19:30:33 UTC
Experienced this in a live customer cluster on 4.4.10 that had been automatically updated from 4.3.18

Comment 24 errata-xmlrpc 2020-10-27 16:16:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.