Created attachment 1554863 [details]
journal/kubelet log + kube container logs
Description of problem:
1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install
2. Verify the cluster is healthy (all nodes Ready, all pods Running)
3. Go to the AWS console and stop all the instances
4. Next day, start the instances
All nodes in the cluster are NotReady and are fast looping the following errors in the kubelet journal logs:
Apr 12 18:37:30 ip-10-0-171-39 hyperkube: E0412 18:37:30.308917 1173 kubelet_node_status.go:92] Unable to register node "ip-10-0-171-39.us-east-2.compute.internal" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API
group "" at the cluster scope
Apr 12 18:37:30 ip-10-0-171-39 hyperkube: E0412 18:37:30.310212 1173 kubelet.go:2243] node "ip-10-0-171-39.us-east-2.compute.internal" not found
Apr 12 18:37:30 ip-10-0-171-39 hyperkube: E0412 18:37:30.325392 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:442: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the
Apr 12 18:37:30 ip-10-0-171-39 hyperkube: E0412 18:37:30.326804 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope
Apr 12 18:37:30 ip-10-0-171-39 hyperkube: E0412 18:37:30.328056 1173 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-171-39.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
crictl on a master shows only the static etcd and kube-* containers running - no openshift master containers
The nodes never come ready
Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-10-182914
How reproducible: Always
Attaching full journal + master pod logs from one of the masters. See the journal for the errors above.
This happens when the kubelet is offline and is not able to rotate its client cert
Issuer: OU = openshift, CN = kubelet-signer
Not Before: Apr 15 13:27:00 2019 GMT
Not After : Apr 16 13:15:02 2019 GMT
kubelet client cert (/var/lib/kubelet/pki/kubelet-client-current.pem) is valid for 24h. The kubelet attempts to rotate when only 20% of the validity duration remains. If the kubelet is offline in this window, the client cert can expire leaving the kubelet with no way to get a new client cert.
Note to self: validity duration for this signer is set here
This is not something we support but something you can get away with on a limited basis.
The first client cert the kubelet gets is good for 24h.
If you wait 20-24h for the kubelet to rotate its client cert for the first time, you should get a cert that is good for 30 days (maybe 60?). If you shutdown at this point, you can get away with it for a while.
I am going to dup this to https://bugzilla.redhat.com/show_bug.cgi?id=1694079 which is tracking the development of a tool that can re-bootstrap kubelet when their cert expire.
*** This bug has been marked as a duplicate of bug 1694079 ***
(In reply to Mike Fiedler from comment #0)
> Description of problem:
> 1. Install a standard 3 master 3 worker 4.1 cluster with openshift-install
> 2. Verify the cluster is healthy (all nodes Ready, all pods Running)
> 3. Go to the AWS console and stop all the instances
FYI, I also hit this bug when I checked bug 1701099. I only powered off _one_ master node (ip-172-31-140-39.ca-central-1.compute.internal) (by ssh to the master and executing `sudo shutdown -h now` on it). After the cluster ran for about 23+ hours, the issue was hit. I monitored the nodes status by `while true; do date; oc get no; echo =============; sleep 10m; done`. Below is the time point when other nodes began changing to NotReady too.
Fri Apr 19 10:03:06 CST 2019
NAME STATUS ROLES AGE VERSION
ip-172-31-133-165.ca-central-1.compute.internal Ready master 14h v1.13.4+d4ce02c1d <-- the new master node after I powered off one.
ip-172-31-140-39.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d <-- the powered-off master node
ip-172-31-141-14.ca-central-1.compute.internal Ready master 23h v1.13.4+d4ce02c1d
ip-172-31-142-155.ca-central-1.compute.internal Ready worker 23h v1.13.4+d4ce02c1d
ip-172-31-148-74.ca-central-1.compute.internal Ready master 23h v1.13.4+d4ce02c1d
ip-172-31-152-232.ca-central-1.compute.internal Ready worker 23h v1.13.4+d4ce02c1d
Fri Apr 19 10:13:09 CST 2019
NAME STATUS ROLES AGE VERSION
ip-172-31-133-165.ca-central-1.compute.internal NotReady master 14h v1.13.4+d4ce02c1d
ip-172-31-140-39.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d
ip-172-31-141-14.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d
ip-172-31-142-155.ca-central-1.compute.internal NotReady worker 23h v1.13.4+d4ce02c1d
ip-172-31-148-74.ca-central-1.compute.internal NotReady master 23h v1.13.4+d4ce02c1d
ip-172-31-152-232.ca-central-1.compute.internal NotReady worker 23h v1.13.4+d4ce02c1d
> 4. Next day, start the instances
xxia, this bug is closed and the issue you describe is different from the one originally reported.
please open a new bug with the kubelet logs from one of the nodes that was not shutdown and the output of `oc get csr` to the new bug.