I've done some more testing, the current situation looks like this:
1. T 0.00s: Shutdown one of the masters (I ssh-ed into it and invoked shutdown -h now
2. T 0.15s: AWS web console notices the machine is going down within seconds
-> 5 mins wait <- according to Clayton that's too long!
4. T 5.00s: Node becomes NotReady in oc get nodes
5. T 7.30s: New node becomes ready
Moving to cloud team.
Seth can you also have a look at this one?
Nodes should go NotReady after 40s of not reporting status to the apiserver.
All pods are evicted from the node after 5m of not reporting status to the apiserver. This is so the pods can be started on other nodes.
The cloud provider code removes Nodes that correspond to Terminated instances in AWS, but not Shutdown.
In terms of what the machine API does in response to Nodes that are not Ready, I have no idea.
> 4. T 5.00s: Node becomes NotReady in oc get nodes
We can't do anything until a node goes Unready. Before that, the node is in Ready state and considered healthy. IINM, node controller switches node status to NotReady if it does not get any new node status in 5 minutes. We might set special timeout for master node (or based on some label) if it make sense. Otherwise, I don't see a way how to fix it on the cluster API side.
Or, we might check node status of master nodes and in case the last update timestamp is older than e.g. 2 minutes, we may trigger some recovery procedure that tries to decide if a master node is compromised.
> The cloud provider code removes Nodes that correspond to Terminated instances in AWS, but not Shutdown.
We don't check if corresponding AWS instance are not running. Actuator reconciling loop reacts to machine object changes. So unless machine object is updated, we will not check for instance state until all machine objects are re-listed. Which is 10m by default I think.
> We can't do anything until a node goes Unready. Before that, the node is in Ready state and considered healthy. IINM, node controller switches node status to NotReady if it does not get any new node status in 5 minutes.
> Nodes should go NotReady after 40s of not reporting status to the apiserver.
Based on what I've seen the node went Not Ready after 5 mins.
>> Nodes should go NotReady after 40s of not reporting status to the apiserver.
> Based on what I've seen the node went Not Ready after 5 mins.
determined by the node-monitor-grace-period on the kube-controller-mananger (default 40s)
afaik, we do not change this default.
If you delete a machine, the machine controller will get rid of the backed node
If you a node goes unready (e.g you delete the cloud instance) for 5 min the nodelifecycle controller will garbage collect the node object - We set the node-monitor-grace-period to 5 min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29
This is behaving as expected:
nodeMonitorPeriod (default 5s) define "how often does Controller check node health signal posted from kubelet. This value should be lower than nodeMonitorGracePeriod."
Then node-monitor-grace-period ("Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.") https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L887 is set to 5 min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29
This if any needs to be configured at the kube controller manager flags
This is a problem. We have changed the defaults without reason.
Verified in env of payload 4.0.0-0.nightly-2019-04-10-182914. After powered off a master node about 40s (the default), the master status is shown NotReady. After powered off a worker about 40s, the worker is gone in `oc get no`.
Found another issue:
Though the powered-off node (ip-172-31-167-112.us-east-2.compute.internal) is shown NotReady:
NAME STATUS ROLES AGE VERSION
ip-172-31-129-159.us-east-2.compute.internal Ready worker 94m v1.12.4+509916ce1
ip-172-31-135-68.us-east-2.compute.internal Ready master 99m v1.12.4+509916ce1
ip-172-31-151-225.us-east-2.compute.internal Ready worker 94m v1.12.4+509916ce1
ip-172-31-154-9.us-east-2.compute.internal Ready master 99m v1.12.4+509916ce1
ip-172-31-167-112.us-east-2.compute.internal NotReady master 99m v1.12.4+509916ce1
ip-172-31-167-90.us-east-2.compute.internal Ready master 16m v1.12.4+509916ce1
The pod on it is still Running after the node has been powered off long time, e.g.:
oc get po -o wide -n openshift-apiserver
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
apiserver-bmgcw 1/1 Running 0 92m 10.130.0.31 ip-172-31-167-112.us-east-2.compute.internal <none> <none>
apiserver-g4f4z 1/1 Running 0 91m 10.128.0.29 ip-172-31-154-9.us-east-2.compute.internal <none> <none>
apiserver-hv75d 1/1 Running 0 92m 10.129.0.26 ip-172-31-135-68.us-east-2.compute.internal <none> <none>
apiserver-vfn8t 1/1 Running 0 18m 10.129.2.4 ip-172-31-167-90.us-east-2.compute.internal <none> <none>
oc rsh -n openshift-apiserver apiserver-bmgcw
Error from server: error dialing backend: dial tcp 172.31.167.112:10250: i/o timeout
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.