1699470 – After shutting down an OOTB 4.1 AWS cluster overnight, all nodes NotReady after restart next day

Bug 1699470 - After shutting down an OOTB 4.1 AWS cluster overnight, all nodes NotReady after restart next day

Summary: After shutting down an OOTB 4.1 AWS cluster overnight, all nodes NotReady aft...

Keywords:
Status:	CLOSED DUPLICATE of bug 1694079
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-12 19:29 UTC by Mike Fiedler
Modified:	2019-04-23 02:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-18 20:17:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journal/kubelet log + kube container logs (891.50 KB, application/gzip) 2019-04-12 19:29 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2019-04-12 19:29:13 UTC

Created attachment 1554863 [details]
journal/kubelet log + kube container logs

Description of problem:

1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install
2. Verify the cluster is healthy (all nodes Ready, all pods Running)
3. Go to the AWS console and stop all the instances
4. Next day, start the instances

All nodes in the cluster are NotReady and are fast looping the following errors in the kubelet journal logs:

Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.308917    1173 kubelet_node_status.go:92] Unable to register node "ip-10-0-171-39.us-east-2.compute.internal" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API
group "" at the cluster scope
Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.310212    1173 kubelet.go:2243] node "ip-10-0-171-39.us-east-2.compute.internal" not found                                                                                                                      
Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.325392    1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:442: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the
cluster scope
Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.326804    1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope
Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.328056    1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-171-39.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope

crictl on a master shows only the static etcd and kube-* containers running - no openshift master containers

The nodes never come ready

Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-10-182914


How reproducible: Always

Attaching full journal + master pod logs from one of the masters.   See the journal for the errors above.

Comment 1 Seth Jennings 2019-04-15 15:03:19 UTC

This happens when the kubelet is offline and is not able to rotate its client cert

        Issuer: OU = openshift, CN = kubelet-signer
        Validity
            Not Before: Apr 15 13:27:00 2019 GMT
            Not After : Apr 16 13:15:02 2019 GMT

kubelet client cert (/var/lib/kubelet/pki/kubelet-client-current.pem) is valid for 24h.  The kubelet attempts to rotate when only 20% of the validity duration remains.  If the kubelet is offline in this window, the client cert can expire leaving the kubelet with no way to get a new client cert.

Comment 2 Seth Jennings 2019-04-18 18:18:18 UTC

Note to self: validity duration for this signer is set here
https://github.com/openshift/installer/blob/master/pkg/asset/tls/kubelet.go#L27

Comment 3 Seth Jennings 2019-04-18 20:17:27 UTC

This is not something we support but something you can get away with on a limited basis.

The first client cert the kubelet gets is good for 24h.

If you wait 20-24h for the kubelet to rotate its client cert for the first time, you should get a cert that is good for 30 days (maybe 60?).  If you shutdown at this point, you can get away with it for a while.

I am going to dup this to https://bugzilla.redhat.com/show_bug.cgi?id=1694079 which is tracking the development of a tool that can re-bootstrap kubelet when their cert expire.

*** This bug has been marked as a duplicate of bug 1694079 ***

Comment 4 Xingxing Xia 2019-04-19 05:57:53 UTC

(In reply to Mike Fiedler from comment #0)
> Description of problem:
> 
> 1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install
> 2. Verify the cluster is healthy (all nodes Ready, all pods Running)
> 3. Go to the AWS console and stop all the instances

FYI, I also hit this bug when I checked bug 1701099. I only powered off _one_ master node (ip-172-31-140-39.ca-central-1.compute.internal) (by ssh to the master and executing `sudo shutdown -h now` on it). After the cluster ran for about 23+ hours, the issue was hit. I monitored the nodes status by `while true; do date; oc get no; echo =============;  sleep 10m; done`. Below is the time point when other nodes began changing to NotReady too.

Fri Apr 19 10:03:06 CST 2019
NAME                                              STATUS     ROLES    AGE   VERSION
ip-172-31-133-165.ca-central-1.compute.internal   Ready      master   14h   v1.13.4+d4ce02c1d  <-- the new master node after I powered off one.
ip-172-31-140-39.ca-central-1.compute.internal    NotReady   master   23h   v1.13.4+d4ce02c1d  <-- the powered-off master node
ip-172-31-141-14.ca-central-1.compute.internal    Ready      master   23h   v1.13.4+d4ce02c1d
ip-172-31-142-155.ca-central-1.compute.internal   Ready      worker   23h   v1.13.4+d4ce02c1d
ip-172-31-148-74.ca-central-1.compute.internal    Ready      master   23h   v1.13.4+d4ce02c1d
ip-172-31-152-232.ca-central-1.compute.internal   Ready      worker   23h   v1.13.4+d4ce02c1d
=============
Fri Apr 19 10:13:09 CST 2019
NAME                                              STATUS     ROLES    AGE   VERSION
ip-172-31-133-165.ca-central-1.compute.internal   NotReady   master   14h   v1.13.4+d4ce02c1d
ip-172-31-140-39.ca-central-1.compute.internal    NotReady   master   23h   v1.13.4+d4ce02c1d
ip-172-31-141-14.ca-central-1.compute.internal    NotReady   master   23h   v1.13.4+d4ce02c1d
ip-172-31-142-155.ca-central-1.compute.internal   NotReady   worker   23h   v1.13.4+d4ce02c1d
ip-172-31-148-74.ca-central-1.compute.internal    NotReady   master   23h   v1.13.4+d4ce02c1d
ip-172-31-152-232.ca-central-1.compute.internal   NotReady   worker   23h   v1.13.4+d4ce02c1d

> 4. Next day, start the instances

Comment 5 Seth Jennings 2019-04-23 02:20:59 UTC

xxia, this bug is closed and the issue you describe is different from the one originally reported.

please open a new bug with the kubelet logs from one of the nodes that was not shutdown and the output of `oc get csr` to the new bug.

Note You need to log in before you can comment on or make changes to this bug.