Bug 1672894 - [OCP4 Beta] Cluster node status shows Ready after powered off a master node or a worker node
Summary: [OCP4 Beta] Cluster node status shows Ready after powered off a master node o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Seth Jennings
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2019-02-06 07:04 UTC by Selim Jahangir
Modified: 2019-06-04 10:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:42:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:43:52 UTC

Comment 1 Maciej Szulik 2019-02-06 16:05:16 UTC
I've done some more testing, the current situation looks like this:

1. T 0.00s:    Shutdown one of the masters (I ssh-ed into it and invoked shutdown -h now
2. T 0.15s: AWS web console notices the machine is going down within seconds

-> 5 mins wait <- according to Clayton that's too long!

4. T 5.00s: Node becomes NotReady in oc get nodes
5. T 7.30s: New node becomes ready

Moving to cloud team.

Comment 2 Maciej Szulik 2019-02-06 16:07:46 UTC
Seth can you also have a look at this one?

Comment 3 Seth Jennings 2019-02-06 16:13:37 UTC
Nodes should go NotReady after 40s of not reporting status to the apiserver.

All pods are evicted from the node after 5m of not reporting status to the apiserver.  This is so the pods can be started on other nodes.

The cloud provider code removes Nodes that correspond to Terminated instances in AWS, but not Shutdown.

In terms of what the machine API does in response to Nodes that are not Ready, I have no idea.

Comment 4 Jan Chaloupka 2019-02-06 16:15:12 UTC
> 4. T 5.00s: Node becomes NotReady in oc get nodes

We can't do anything until a node goes Unready. Before that, the node is in Ready state and considered healthy. IINM, node controller switches node status to NotReady if it does not get any new node status in 5 minutes. We might set special timeout for master node (or based on some label) if it make sense. Otherwise, I don't see a way how to fix it on the cluster API side.

Or, we might check node status of master nodes and in case the last update timestamp is older than e.g. 2 minutes, we may trigger some recovery procedure that tries to decide if a master node is compromised.

Comment 5 Jan Chaloupka 2019-02-06 16:17:29 UTC
> The cloud provider code removes Nodes that correspond to Terminated instances in AWS, but not Shutdown.

We don't check if corresponding AWS instance are not running. Actuator reconciling loop reacts to machine object changes. So unless machine object is updated, we will not check for instance state until all machine objects are re-listed. Which is 10m by default I think.

Comment 6 Maciej Szulik 2019-02-06 16:21:58 UTC
> We can't do anything until a node goes Unready. Before that, the node is in Ready state and considered healthy. IINM, node controller switches node status to NotReady if it does not get any new node status in 5 minutes.

> Nodes should go NotReady after 40s of not reporting status to the apiserver.

Based on what I've seen the node went Not Ready after 5 mins.

Comment 7 Seth Jennings 2019-02-06 16:30:20 UTC
>> Nodes should go NotReady after 40s of not reporting status to the apiserver.

> Based on what I've seen the node went Not Ready after 5 mins.

determined by the node-monitor-grace-period on the kube-controller-mananger (default 40s)
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

afaik, we do not change this default.

Comment 9 Alberto 2019-03-29 14:37:54 UTC
If you delete a machine, the machine controller will get rid of the backed node
If you a node goes unready (e.g you delete the cloud instance)  for 5 min  the nodelifecycle controller will garbage collect the node object - We set the node-monitor-grace-period to 5 min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29

Comment 10 Alberto 2019-04-04 09:22:39 UTC
This is behaving as expected:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L460-L464

nodeMonitorPeriod (default 5s) define "how often does Controller check node health signal posted from kubelet. This value should be lower than nodeMonitorGracePeriod."

Then node-monitor-grace-period ("Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.") https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L887 is set to 5 min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29

This if any needs to be configured at the kube controller manager flags

Comment 11 Seth Jennings 2019-04-05 14:09:11 UTC
This is a problem. We have changed the defaults without reason.

Comment 14 Xingxing Xia 2019-04-17 09:43:29 UTC
Verified in env of payload 4.0.0-0.nightly-2019-04-10-182914. After powered off a master node about 40s (the default), the master status is shown NotReady. After powered off a worker about 40s, the worker is gone in `oc get no`.

Comment 15 Xingxing Xia 2019-04-17 10:03:53 UTC
Found another issue:
Though the powered-off node (ip-172-31-167-112.us-east-2.compute.internal) is shown NotReady:
NAME                                           STATUS     ROLES    AGE   VERSION
ip-172-31-129-159.us-east-2.compute.internal   Ready      worker   94m   v1.12.4+509916ce1
ip-172-31-135-68.us-east-2.compute.internal    Ready      master   99m   v1.12.4+509916ce1
ip-172-31-151-225.us-east-2.compute.internal   Ready      worker   94m   v1.12.4+509916ce1
ip-172-31-154-9.us-east-2.compute.internal     Ready      master   99m   v1.12.4+509916ce1
ip-172-31-167-112.us-east-2.compute.internal   NotReady   master   99m   v1.12.4+509916ce1
ip-172-31-167-90.us-east-2.compute.internal    Ready      master   16m   v1.12.4+509916ce1

The pod on it is still Running after the node has been powered off long time, e.g.:
oc get po -o wide -n openshift-apiserver
NAME              READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
apiserver-bmgcw   1/1     Running   0          92m   10.130.0.31   ip-172-31-167-112.us-east-2.compute.internal   <none>           <none>
apiserver-g4f4z   1/1     Running   0          91m   10.128.0.29   ip-172-31-154-9.us-east-2.compute.internal     <none>           <none>
apiserver-hv75d   1/1     Running   0          92m   10.129.0.26   ip-172-31-135-68.us-east-2.compute.internal    <none>           <none>
apiserver-vfn8t   1/1     Running   0          18m   10.129.2.4    ip-172-31-167-90.us-east-2.compute.internal    <none>           <none>

oc rsh -n openshift-apiserver apiserver-bmgcw
Error from server: error dialing backend: dial tcp 172.31.167.112:10250: i/o timeout

Comment 19 errata-xmlrpc 2019-06-04 10:42:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.