Bug 1860832
Summary: | After upgrading to 4.3.28 customer observed mistmatch between node labels (region=CanadaCentral) and pv affinity (region=canadacentral) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alberto <agarcial> |
Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | aarapov, abudavis, agarcial, ansverma, aos-bugs, bmilne, emahoney, erich, jokerman, jolee, mgugino, mjudeiki, rbolling, rbost, rdave, scuppett, shea.stewart, sponnaga, walters, zhsun |
Version: | 4.3.z | Keywords: | Regression |
Target Milestone: | --- | ||
Target Release: | 4.3.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1860830 | Environment: | |
Last Closed: | 2020-09-09 16:24:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1860830 | ||
Bug Blocks: |
Comment 1
Alberto
2020-07-27 09:20:42 UTC
*** Bug 1866312 has been marked as a duplicate of this bug. *** Could someone please confirm if the bugfix has made it to 4.3.31? Verified failed clusterversion: 4.3.0-0.nightly-2020-08-20-225757 Node label is still upper case: failure-domain.beta.kubernetes.io/region=CanadaCentra pod are in pending status I checked this pr has been included in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2020-08-17-103456 $ oc get node --show-labels | grep failure-domain zhsunazure821-6x7jf-master-0 Ready master 53m v1.16.2+295f6e6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=CanadaCentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure821-6x7jf-master-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsunazure821-6x7jf-master-1 Ready master 53m v1.16.2+295f6e6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=CanadaCentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure821-6x7jf-master-1,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsunazure821-6x7jf-master-2 Ready master 53m v1.16.2+295f6e6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=CanadaCentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure821-6x7jf-master-2,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsunazure821-6x7jf-worker-canadacentral-btjxl Ready worker 41m v1.16.2+295f6e6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=CanadaCentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure821-6x7jf-worker-canadacentral-btjxl,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos zhsunazure821-6x7jf-worker-canadacentral-jrd5r Ready worker 41m v1.16.2+295f6e6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=CanadaCentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure821-6x7jf-worker-canadacentral-jrd5r,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-9e470188-0c0e-4c47-b76c-21a1b33ff5c9 1Gi RWO Delete Bound default/pvc1 managed-premium 28m $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc1 Bound pvc-9e470188-0c0e-4c47-b76c-21a1b33ff5c9 1Gi RWO managed-premium 29m $ oc get po NAME READY STATUS RESTARTS AGE task-pv-pod 0/1 Pending 0 29m $ oc describe po | tail QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler Failed to bind volumes: pv "pvc-9e470188-0c0e-4c47-b76c-21a1b33ff5c9" node affinity doesn't match node "zhsunazure821-6x7jf-worker-canadacentral-btjxl": No matching NodeSelectorTerms Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. This is fixed in latest 4.4 if anyone is urgently waiting on the fix, they can upgrade to 4.4. I'm not sure why this didn't pass QA. I've looked into the code, it seems it should have passed to me. I'm not sure how this code actually makes it into the node manager and if the code was actually in the release or not. Sending to Node team to investigate. This is possibly due to the node initially registering with the old kubelet in the base RHCOS before the new osimage is deployed. Please ensure that the kubelet in the base RHCOS image contains the fix since the node can only label itself once at initial registration time. Verified clusterversion: 4.3.0-0.nightly-2020-09-01-015751 $ oc get po NAME READY STATUS RESTARTS AGE task-pv-pod 1/1 Running 0 2m29s $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-8a1566b4-a1fd-4a44-b498-c19557870d41 1Gi RWO Delete Bound default/pvc1 managed-premium 2m32s $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc1 Bound pvc-8a1566b4-a1fd-4a44-b498-c19557870d41 1Gi RWO managed-premium 2m51s $ oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS zhsun93azure-s967t-master-0 Ready master 78m v1.16.2+7279a4a beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsun93azure-s967t-master-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsun93azure-s967t-master-1 Ready master 78m v1.16.2+7279a4a beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsun93azure-s967t-master-1,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsun93azure-s967t-master-2 Ready master 78m v1.16.2+7279a4a beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D8s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsun93azure-s967t-master-2,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos zhsun93azure-s967t-worker-canadacentral-ghszk Ready worker 67m v1.16.2+7279a4a beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsun93azure-s967t-worker-canadacentral-ghszk,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos zhsun93azure-s967t-worker-canadacentral-xhhfk Ready worker 65m v1.16.2+7279a4a beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsun93azure-s967t-worker-canadacentral-xhhfk,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.3.35 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3457 > This is possibly due to the node initially registering with the old kubelet in the base RHCOS before the new osimage is deployed. No, kubelet should't start until we've updated. https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md That said there were bugs in 4.3 which if we encountered an error during that initial upgrade/pivot we would still stumble on and start kubelet anyways which has since been fixed. See e.g. https://github.com/openshift/machine-config-operator/commit/75dbab9c54c6cb3470075af1da1b139ecea02d38 (In reply to Colin Walters from comment #22) > > This is possibly due to the node initially registering with the old kubelet in the base RHCOS before the new osimage is deployed. > > No, kubelet should't start until we've updated. > https://github.com/openshift/machine-config-operator/blob/master/docs/ > OSUpgrades.md > > That said there were bugs in 4.3 which if we encountered an error during > that initial upgrade/pivot we would still stumble on and start kubelet > anyways which has since been fixed. > See e.g. > https://github.com/openshift/machine-config-operator/commit/ > 75dbab9c54c6cb3470075af1da1b139ecea02d38 Okay, looks like that particular fix only landed in 4.5 and newer. So, for most users, latest version of 4.3 or newer should be unaffected. However, in some edge cases in releases older than 4.5, for clusters originally installed with an affected version of 4.3 or below, the labels may have to be manually changed on any hosts that are interrupted during first boot. Some clusters may need to update their boot image manually to permanently fix this solution as that functionality does not exist today. In a future release, we hope to assist with automatically updating boot images for each support platform. I'm unsure how this might be achieved today, however. Unless the MCO/RHCOS team can sort out how to make this change on each platform, then any current and future cases affected by this will still need to manually change the labels. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |