Created attachment 1554863 [details] journal/kubelet log + kube container logs Description of problem: 1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install 2. Verify the cluster is healthy (all nodes Ready, all pods Running) 3. Go to the AWS console and stop all the instances 4. Next day, start the instances All nodes in the cluster are NotReady and are fast looping the following errors in the kubelet journal logs: Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.308917 1173 kubelet_node_status.go:92] Unable to register node "ip-10-0-171-39.us-east-2.compute.internal" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API group "" at the cluster scope Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.310212 1173 kubelet.go:2243] node "ip-10-0-171-39.us-east-2.compute.internal" not found Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.325392 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:442: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.326804 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope Apr 12 18:37:30 ip-10-0-171-39 hyperkube[1173]: E0412 18:37:30.328056 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-171-39.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope crictl on a master shows only the static etcd and kube-* containers running - no openshift master containers The nodes never come ready Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-10-182914 How reproducible: Always Attaching full journal + master pod logs from one of the masters. See the journal for the errors above.
This happens when the kubelet is offline and is not able to rotate its client cert Issuer: OU = openshift, CN = kubelet-signer Validity Not Before: Apr 15 13:27:00 2019 GMT Not After : Apr 16 13:15:02 2019 GMT kubelet client cert (/var/lib/kubelet/pki/kubelet-client-current.pem) is valid for 24h. The kubelet attempts to rotate when only 20% of the validity duration remains. If the kubelet is offline in this window, the client cert can expire leaving the kubelet with no way to get a new client cert.
Note to self: validity duration for this signer is set here https://github.com/openshift/installer/blob/master/pkg/asset/tls/kubelet.go#L27
This is not something we support but something you can get away with on a limited basis. The first client cert the kubelet gets is good for 24h. If you wait 20-24h for the kubelet to rotate its client cert for the first time, you should get a cert that is good for 30 days (maybe 60?). If you shutdown at this point, you can get away with it for a while. I am going to dup this to https://bugzilla.redhat.com/show_bug.cgi?id=1694079 which is tracking the development of a tool that can re-bootstrap kubelet when their cert expire. *** This bug has been marked as a duplicate of bug 1694079 ***
(In reply to Mike Fiedler from comment #0) > Description of problem: > > 1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install > 2. Verify the cluster is healthy (all nodes Ready, all pods Running) > 3. Go to the AWS console and stop all the instances FYI, I also hit this bug when I checked bug 1701099. I only powered off _one_ master node (ip-172-31-140-39.ca-central-1.compute.internal) (by ssh to the master and executing `sudo shutdown -h now` on it). After the cluster ran for about 23+ hours, the issue was hit. I monitored the nodes status by `while true; do date; oc get no; echo =============; sleep 10m; done`. Below is the time point when other nodes began changing to NotReady too. Fri Apr 19 10:03:06 CST 2019 NAME STATUS ROLES AGE VERSION ip-172-31-133-165.ca-central-1.compute.internal Ready master 14h v1.13.4+d4ce02c1d <-- the new master node after I powered off one. ip-172-31-140-39.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d <-- the powered-off master node ip-172-31-141-14.ca-central-1.compute.internal Ready master 23h v1.13.4+d4ce02c1d ip-172-31-142-155.ca-central-1.compute.internal Ready worker 23h v1.13.4+d4ce02c1d ip-172-31-148-74.ca-central-1.compute.internal Ready master 23h v1.13.4+d4ce02c1d ip-172-31-152-232.ca-central-1.compute.internal Ready worker 23h v1.13.4+d4ce02c1d ============= Fri Apr 19 10:13:09 CST 2019 NAME STATUS ROLES AGE VERSION ip-172-31-133-165.ca-central-1.compute.internal NotReady master 14h v1.13.4+d4ce02c1d ip-172-31-140-39.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d ip-172-31-141-14.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d ip-172-31-142-155.ca-central-1.compute.internal NotReady worker 23h v1.13.4+d4ce02c1d ip-172-31-148-74.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d ip-172-31-152-232.ca-central-1.compute.internal NotReady worker 23h v1.13.4+d4ce02c1d > 4. Next day, start the instances
xxia, this bug is closed and the issue you describe is different from the one originally reported. please open a new bug with the kubelet logs from one of the nodes that was not shutdown and the output of `oc get csr` to the new bug.