Bug 1715454

Summary: After control plane expired cert recovery, not able to rsh or check pod logs
Product: OpenShift Container Platform Reporter: Sunil Choudhary <schoudha>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED CURRENTRELEASE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: ahoffer, aos-bugs, gblomqui, jokerman, mmccomas, sponnaga, tnozicka
Target Milestone: ---Keywords: OSE41z_next
Target Release: 4.1.z   
Hardware: All   
OS: Linux   
Whiteboard: 4.1.2
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-06 14:25:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1718956    

Description Sunil Choudhary 2019-05-30 12:13:53 UTC
Opening this bug for issue observed after recovering from control plane expired cert in https://bugzilla.redhat.com/show_bug.cgi?id=1711910#c26

After recovering from expired control plane certificates, cannot do rsh or check logs for pods. Even not for pods running on master nodes, like console.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-24-040103   True        False         6h54m   Cluster version is 4.1.0-0.nightly-2019-05-24-040103

After manually approving csr, master nodes changes to Ready state. Also followed step 11 [1] to recover kubelet on worker nodes and manually approved csr for workers.

$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-4sz5l   9m43s   system:node:ip-10-0-146-94.us-east-2.compute.internal                       Approved,Issued
csr-64ck7   50s     system:node:ip-10-0-128-139.us-east-2.compute.internal                      Approved,Issued
csr-7cdqs   19s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-82xn2   11m     system:node:ip-10-0-135-155.us-east-2.compute.internal                      Approved,Issued
csr-bsmvq   10m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-npqfc   11m     system:node:ip-10-0-162-135.us-east-2.compute.internal                      Approved,Issued
csr-tkqfb   13m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-v9gf8   94s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-xjwmj   12m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-139.us-east-2.compute.internal   Ready    worker   6h16m   v1.13.4+cb455d664
ip-10-0-135-155.us-east-2.compute.internal   Ready    master   6h20m   v1.13.4+cb455d664
ip-10-0-146-94.us-east-2.compute.internal    Ready    master   6h20m   v1.13.4+cb455d664
ip-10-0-155-72.us-east-2.compute.internal    Ready    worker   6h16m   v1.13.4+cb455d664
ip-10-0-162-135.us-east-2.compute.internal   Ready    master   6h21m   v1.13.4+cb455d664

Successfully recovered from control plane expired cert. However not able to do rsh or check pod logs.

$ oc rsh console-5d888874d4-q92hn
error: unable to upgrade connection: Unauthorized

$ oc logs console-5d888874d4-q92hn
error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log console-5d888874d4-q92hn))

I scaled up new worker node with machineset and scheduled a pod on newly scaled up node, but observing same issue.

$ oc get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
hello-node-64c578bdf8-qg6nm   1/1     Running   0          38s   ip-10-0-160-111.us-east-2.compute.internal   <none>           <none>

$ oc rsh hello-node-64c578bdf8-qg6nm
error: unable to upgrade connection: Unauthorized

$ oc logs hello-node-64c578bdf8-qg6nm
error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log hello-node-64c578bdf8-qg6nm))

[1] http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-3-expired-certs.html

Comment 1 Tomáš Nožička 2019-05-30 13:18:01 UTC
I have asked the question about logs on kubelet recovery here: https://docs.google.com/document/d/1-R1uZ_I1ZtA9BXl_Ugl3iH4cSoLnqk18P0TSNpWhRjo/edit?disco=AAAACtldDT8

What I have done before was:

oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt
# distribute it to other masters and nodes
# restart kubelet on all masters and nodes with
systemctl restart kubelet

That fixed logs for me when I've tried it last time.

Sending to pod team to confirm and update kubelet recovery docs if they agree.

Comment 2 Seth Jennings 2019-05-30 18:03:44 UTC
We use a 10y root-ca CA at /etc/kubernetes/ca.crt that does not rotate (a target for future work).

But we do not need to be changing that file in DR.  We need to figure out why root-ca is not on the signing chain for the cert the apiserver is using.

Comment 3 Greg Blomquist 2019-05-30 18:27:43 UTC
Tomas, sending this back your way to look at the signing chain.

Comment 4 Greg Blomquist 2019-05-31 01:28:26 UTC
Sorry for the ping pong on this BZ.

Seth looked back over this and agreed that Tomas’ suggestion was correct.

Moving back to Pod.

Also, doc changes posted and being tested by QE.  So moving to on_qa also.


Comment 6 Sunil Choudhary 2019-06-06 04:00:34 UTC
I guess we should closed this bug as with updated doc already live at [1], this is no longer an issue.

[1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html