Bug 1994111

Summary: Fix missing NoExecute taint on NotReady nodes for Kubernetes
Product: OpenShift Container Platform Reporter: blpowers <blpowers>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, chhudson, jchaloup, maszulik, mfojtik, rhowe, rmarwaha
Target Milestone: ---Keywords: Automation, Bugfix, EasyFix
Target Release: 4.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-12 19:51:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2003027    
Bug Blocks: 2004836    

Description blpowers@redhat.com 2021-08-16 18:46:40 UTC
1. Proposed title of this feature request

Fix missing NoExecute taint on NotReady nodes for Kubernetes

3. What is the nature and description of the request?

Bugfix
 
4. Why would you like this alteration? (List the business requirements here)

Failover of application pods is not happening.
Impact on resiliency and high availability.
 
5. How would you like to achieve this? (List the functional requirements here)

NoExecute taint must be added to NotReady nodes, so that pods running on a failed node can restart on a healthy node
 
6. For each functional requirement listed in question 5, specify how you can Red Hat can test to confirm the requirement is successfully implemented.

The fix is already available in Kubernetes upstream (https://github.com/kubernetes/kubernetes/pull/98168). Redhat need to make it available downstream for OCP 4.7
 
7. Is there already an existing RFE upstream or in Red Hat Bugzilla that you know of?

No
 
8. Do you have any specific timeline for this update?

17th Aug (As soon as possible, since our beta release is scheduled on OCP 4.7)
 
10. List any affected packages or components.

Kubernetes
 
11. Would you be willing and able to assist in testing this functionality if implemented?

yes

Comment 2 Maciej Szulik 2021-08-19 11:59:58 UTC
I will be working on bringing in the latest version of k8s 1.20 into openshift at the beginning of September.

Comment 3 zhou ying 2021-08-26 08:41:17 UTC
Maciej Szulik:

I've checked the ocp4.7, when node NotReady  I can't reproduce this issue:

[root@localhost ~]# oc get node
NAME                                        STATUS     ROLES    AGE   VERSION
ip-10-0-48-228.us-east-2.compute.internal   Ready      worker   28m   v1.20.0+4593a24
ip-10-0-49-143.us-east-2.compute.internal   Ready      master   37m   v1.20.0+4593a24
ip-10-0-61-211.us-east-2.compute.internal   Ready      master   37m   v1.20.0+4593a24
ip-10-0-69-178.us-east-2.compute.internal   Ready      master   37m   v1.20.0+4593a24
ip-10-0-75-179.us-east-2.compute.internal   NotReady   worker   27m   v1.20.0+4593a24
[root@localhost ~]# oc describe node/ip-10-0-75-179.us-east-2.compute.internal
Name:               ip-10-0-75-179.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-75-179
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-2b
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0a2ea9aeb5ee2e144"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-01dafe43db2faeda0184971ba016d9c0
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-01dafe43db2faeda0184971ba016d9c0
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 26 Aug 2021 16:06:28 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Is this still need update ?

Comment 5 Maciej Szulik 2021-09-01 10:49:05 UTC
(In reply to zhou ying from comment #3)
> Is this still need update ?

The update to latest k8s patch release is still going to happen.

Comment 6 Maciej Szulik 2021-09-08 12:21:23 UTC
Current PR bringing in kubernetes v1.20.10 is https://github.com/openshift/kubernetes/pull/935

This is bringing in the following changes:

Changelog for v1.20.10: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1209
Changelog for v1.20.9: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1208
Changelog for v1.20.8: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1207
Changelog for v1.20.7: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1206
Changelog for v1.20.6: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1206
Changelog for v1.20.5: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1205
Changelog for v1.20.4: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1204
Changelog for v1.20.3: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1203
Changelog for v1.20.2: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1202
Changelog for v1.20.1: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1201
Changelog for v1.20.0: https://github.com/kubernetes/kubernetes/blob/release-1.20/CHANGELOG/CHANGELOG-1.20.md#changelog-since-v1200

Comment 8 Maciej Szulik 2021-09-30 14:03:20 UTC
This should be fixed by now with https://bugzilla.redhat.com/show_bug.cgi?id=2003027 and specifically https://github.com/openshift/kubernetes/pull/935 which brings in k8s 1.20.10. 
Moving to modified.

Comment 15 errata-xmlrpc 2021-10-12 19:51:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3686