Opening this bug for issue observed after recovering from control plane expired cert in https://bugzilla.redhat.com/show_bug.cgi?id=1711910#c26 After recovering from expired control plane certificates, cannot do rsh or check logs for pods. Even not for pods running on master nodes, like console. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-24-040103 True False 6h54m Cluster version is 4.1.0-0.nightly-2019-05-24-040103 After manually approving csr, master nodes changes to Ready state. Also followed step 11 [1] to recover kubelet on worker nodes and manually approved csr for workers. $ oc get csr NAME AGE REQUESTOR CONDITION csr-4sz5l 9m43s system:node:ip-10-0-146-94.us-east-2.compute.internal Approved,Issued csr-64ck7 50s system:node:ip-10-0-128-139.us-east-2.compute.internal Approved,Issued csr-7cdqs 19s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-82xn2 11m system:node:ip-10-0-135-155.us-east-2.compute.internal Approved,Issued csr-bsmvq 10m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-npqfc 11m system:node:ip-10-0-162-135.us-east-2.compute.internal Approved,Issued csr-tkqfb 13m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-v9gf8 94s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xjwmj 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-139.us-east-2.compute.internal Ready worker 6h16m v1.13.4+cb455d664 ip-10-0-135-155.us-east-2.compute.internal Ready master 6h20m v1.13.4+cb455d664 ip-10-0-146-94.us-east-2.compute.internal Ready master 6h20m v1.13.4+cb455d664 ip-10-0-155-72.us-east-2.compute.internal Ready worker 6h16m v1.13.4+cb455d664 ip-10-0-162-135.us-east-2.compute.internal Ready master 6h21m v1.13.4+cb455d664 Successfully recovered from control plane expired cert. However not able to do rsh or check pod logs. $ oc rsh console-5d888874d4-q92hn error: unable to upgrade connection: Unauthorized $ oc logs console-5d888874d4-q92hn error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log console-5d888874d4-q92hn)) I scaled up new worker node with machineset and scheduled a pod on newly scaled up node, but observing same issue. $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-node-64c578bdf8-qg6nm 1/1 Running 0 38s 10.128.2.5 ip-10-0-160-111.us-east-2.compute.internal <none> <none> $ oc rsh hello-node-64c578bdf8-qg6nm error: unable to upgrade connection: Unauthorized $ oc logs hello-node-64c578bdf8-qg6nm error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log hello-node-64c578bdf8-qg6nm)) [1] http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-3-expired-certs.html
I have asked the question about logs on kubelet recovery here: https://docs.google.com/document/d/1-R1uZ_I1ZtA9BXl_Ugl3iH4cSoLnqk18P0TSNpWhRjo/edit?disco=AAAACtldDT8 What I have done before was: oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt # distribute it to other masters and nodes # restart kubelet on all masters and nodes with systemctl restart kubelet That fixed logs for me when I've tried it last time. Sending to pod team to confirm and update kubelet recovery docs if they agree.
We use a 10y root-ca CA at /etc/kubernetes/ca.crt that does not rotate (a target for future work). But we do not need to be changing that file in DR. We need to figure out why root-ca is not on the signing chain for the cert the apiserver is using.
Tomas, sending this back your way to look at the signing chain.
Sorry for the ping pong on this BZ. Seth looked back over this and agreed that Tomas’ suggestion was correct. Moving back to Pod. Also, doc changes posted and being tested by QE. So moving to on_qa also. https://github.com/openshift/openshift-docs/pull/15090
I guess we should closed this bug as with updated doc already live at [1], this is no longer an issue. [1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html