Bug 1817997

Summary:	CSR signer didn't recover
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	kube-controller-manager	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED DUPLICATE	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	aos-bugs, maszulik, mfojtik, rpattath, scuppett, tnozicka, vareti, zyu
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1818420 (view as bug list)		Environment:
Last Closed:	2020-05-04 14:51:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1818420, 1821680, 1827990
Bug Blocks:

Description Xingxing Xia 2020-03-27 11:53:44 UTC

Description of problem:
Auto recovery does not work after cluster suspended before 12 hours and re-started after 24 hours with nodes NotReady

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-25-223508

How reproducible:
Always

Steps to Reproduce:
Last day 10am, launched a 4.4.0-0.nightly-2020-03-25-223508 cluster in AWS, didn't do other action, then at 7:15 pm stopped all nodes. Here is the screen record, note the shell prompt's timestamp of yesterday and today:

[xxia 2020-03-26 19:12:33 CST my]$ scripts/check_secret_expiry_2.sh
2020-03-26T01:51:36Z  2020-03-27T01:51:36Z  openshift-kube-apiserver-operator   aggregator-client-signer
2020-03-26T02:07:01Z  2020-03-27T01:51:38Z  openshift-kube-apiserver    aggregator-client
2020-03-26T02:07:00Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer
2020-03-26T01:51:41Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer-signer

[xxia 2020-03-27 18:39:55 CST my]$ sleep 5m # re starting

[xxia 2020-03-27 18:51:03 CST my]$ scripts/check_secret_expiry_2.sh
2020-03-26T02:07:00Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer
2020-03-26T01:51:41Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer-signer

[xxia 2020-03-27 18:52:45 CST my]$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-28ms2   13m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2f6xj   25s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2nxq4   7m15s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2q4z7   13m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2s8ck   9m39s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2tkdv   5m55s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-44mn2   60s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

[xxia 2020-03-27 19:03:13 CST my]$ oc get no
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-129-145.us-east-2.compute.internal   NotReady   master   32h   v1.17.1
ip-10-0-143-206.us-east-2.compute.internal   NotReady   worker   32h   v1.17.1
ip-10-0-149-179.us-east-2.compute.internal   NotReady   worker   32h   v1.17.1
ip-10-0-157-56.us-east-2.compute.internal    NotReady   master   32h   v1.17.1
ip-10-0-169-65.us-east-2.compute.internal    NotReady   master   32h   v1.17.1
ip-10-0-175-10.us-east-2.compute.internal    NotReady   worker   32h   v1.17.1
[xxia@laptop 2020-03-27 19:04:24 CST ~]$ 
[xxia@laptop 2020-03-27 19:05:25 CST ~]$ 
[xxia@laptop 2020-03-27 19:05:25 CST ~]$ vi tmp.txt4
[xxia@laptop 2020-03-27 19:07:41 CST ~]$ cat tmp.txt4
asked tomasz https://coreos.slack.com/archives/CB48XQ4KZ/p1583919086295400?thread_ts=1583919037.295300

[xxia 2020-03-26 19:12:33 CST my]$ scripts/check_secret_expiry_2.sh # then stop all nodes on AWS console.
2020-03-26T01:51:36Z  2020-03-27T01:51:36Z  openshift-kube-apiserver-operator   aggregator-client-signer
2020-03-26T02:07:01Z  2020-03-27T01:51:38Z  openshift-kube-apiserver    aggregator-client
2020-03-26T02:07:00Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer
2020-03-26T01:51:41Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer-signer

Today re-starting all nodes in AWS console after cluster's creation 
[xxia 2020-03-27 18:39:55 CST my]$ sleep 10m # re starting

[xxia 2020-03-27 18:52:45 CST my]$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-28ms2   13m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2f6xj   25s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2nxq4   7m15s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2q4z7   13m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2s8ck   9m39s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2tkdv   5m55s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-44mn2   60s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

[xxia 2020-03-27 19:04:31 CST my]$ scripts/check_secret_expiry_2.sh
2020-03-26T02:07:00Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer
2020-03-26T01:51:41Z  2020-03-27T01:51:41Z  openshift-kube-controller-manager-operator  csr-signer-signer

[xxia 2020-03-27 19:03:13 CST my]$ oc get no
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-129-145.us-east-2.compute.internal   NotReady   master   32h   v1.17.1
ip-10-0-143-206.us-east-2.compute.internal   NotReady   worker   32h   v1.17.1
ip-10-0-149-179.us-east-2.compute.internal   NotReady   worker   32h   v1.17.1
ip-10-0-157-56.us-east-2.compute.internal    NotReady   master   32h   v1.17.1
ip-10-0-169-65.us-east-2.compute.internal    NotReady   master   32h   v1.17.1
ip-10-0-175-10.us-east-2.compute.internal    NotReady   worker   32h   v1.17.1

[xxia 2020-03-27 19:05:49 CST my]$ cat scripts/check_secret_expiry_2.sh
oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date='+24hours' +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'

Actual results:
Saw nodes NotReady and csr-signer, csr-signer-signer not recovered.

Expected results:
Auto recovered without above issue.

Additional info:
Though, `oc get co` saw all COs no abnormal, all are "True        False        False".

Comment 2 Xingxing Xia 2020-03-27 12:02:43 UTC

(In reply to Xingxing Xia from comment #0)
> Steps to Reproduce:
> Last day 10am, launched a 4.4.0-0.nightly-2020-03-25-223508 cluster in AWS, didn't do other action, then at 7:15 pm stopped all nodes. Here is the screen record, note the shell prompt's timestamp of yesterday and today:

The content between above line and below line is duplicate entered while posting the bug. Could ignore :) Pls just see content after that.

> [xxia@laptop 2020-03-27 19:05:25 CST ~]$ vi tmp.txt4
> [xxia@laptop 2020-03-27 19:07:41 CST ~]$ cat tmp.txt4

Comment 3 Maciej Szulik 2020-03-27 18:23:36 UTC

The problem is that the CSRSigner wasn't rotated, this is caused by these lines:

https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L496-L514

when we have:
useAfter  Mar 26 02:07:00 
now Mar 27 10:39:12 (roughly)

this leads to us always hitting the condition from line 499 and returning negative delay,
which in turn leads to a hot loop in 

https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/cmd/recoverycontroller/csrcontroller.go#L153-L162

We should:
a) write proper unit tests around the code in targetconfigcontroller.go which outline the cases we want to see happen
b) either add (useAfter.Sub(now) > 0) or check that Not After is before now.

Comment 6 Xingxing Xia 2020-04-14 11:04:08 UTC

Today prepared several DR envs shutdown for verification next day.

Comment 7 Xingxing Xia 2020-04-15 09:22:23 UTC

Tested in 4.4.0-0.nightly-2020-04-13-113747 ipi aws env:
Today re-started all nodes from shutdown status after > 24 hours (30 hours in fact), got:
[xxia 2020-04-15 16:38:42 CST my]$ scripts/check_secret_expiry_within.sh 31days
Checking validity within 31days
2020-04-14T02:08:36Z  2020-05-14T02:08:37Z  openshift-config-managed    kube-controller-manager-client-cert-key
2020-04-14T02:08:33Z  2020-05-14T02:08:34Z  openshift-config-managed    kube-scheduler-client-cert-key
2020-04-15T08:23:24Z  2020-05-15T08:23:25Z  openshift-kube-apiserver-operator   aggregator-client-signer
2020-04-15T08:23:24Z  2020-05-15T08:23:25Z  openshift-kube-apiserver    aggregator-client
2020-04-14T02:08:34Z  2020-05-14T02:08:35Z  openshift-kube-apiserver    external-loadbalancer-serving-certkey
2020-04-14T02:08:34Z  2020-05-14T02:08:35Z  openshift-kube-apiserver    internal-loadbalancer-serving-certkey
2020-04-14T02:08:33Z  2020-05-14T02:08:34Z  openshift-kube-apiserver    kubelet-client
2020-04-14T02:08:33Z  2020-05-14T02:08:34Z  openshift-kube-apiserver    kubelet-client-2
...
2020-04-14T02:08:33Z  2020-05-14T02:08:34Z  openshift-kube-apiserver    kubelet-client-6
2020-04-14T02:08:34Z  2020-05-14T02:08:35Z  openshift-kube-apiserver    localhost-serving-cert-certkey
2020-04-14T02:08:34Z  2020-05-14T02:08:35Z  openshift-kube-apiserver    service-network-serving-certkey
2020-04-15T08:24:45Z  2020-05-15T08:24:46Z  openshift-kube-controller-manager-operator  csr-signer
2020-04-14T02:08:36Z  2020-05-14T02:08:37Z  openshift-kube-controller-manager   kube-controller-manager-client-cert-key
2020-04-14T02:08:33Z  2020-05-14T02:08:34Z  openshift-kube-scheduler    kube-scheduler-client-cert-key

The certs are all renewed. 

[xxia 2020-04-15 16:42:40 CST my]$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-2sckq   3m41s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
...
csr-vpvnj   19m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

[xxia 2020-04-15 16:44:46 CST my]$ oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-2sckq approved
...

After approved CSRs, nodes still NotReady:
[xxia 2020-04-15 16:59:05 CST my]$ oc get node -w
NAME                                              STATUS     ROLES    AGE   VERSION
ip-10-0-133-104.ap-northeast-1.compute.internal   NotReady   master   30h   v1.17.1
ip-10-0-142-46.ap-northeast-1.compute.internal    NotReady   worker   30h   v1.17.1
ip-10-0-145-242.ap-northeast-1.compute.internal   NotReady   master   30h   v1.17.1
ip-10-0-157-193.ap-northeast-1.compute.internal   NotReady   worker   30h   v1.17.1
ip-10-0-160-122.ap-northeast-1.compute.internal   NotReady   worker   30h   v1.17.1
ip-10-0-167-85.ap-northeast-1.compute.internal    NotReady   master   30h   v1.17.1
^C[xxia 2020-04-15 17:16:01 CST my]

Checked oc get co, and oc get po --all-namespaces, no abnormality found, though.

Comment 12 Tomáš Nožička 2020-04-20 14:27:38 UTC

I am looking into this. The QA cluster got pruned but I've managed to reproduce it in my cluster.

Comment 13 Tomáš Nožička 2020-04-20 17:05:27 UTC

I've traced this to kube-apiserver's client CA that isn't live reloaded which is causing CSRs signed with KCM csr-signer not trusted and results in 401 Unauthorized. Kubelet uses CSR for it's credentials and it can't start.

While testing it I've noticed KAS's accepted CAs live reload from client-ca/ca-bundle.crt but the trust isn't affected though.

David referenced this pull https://github.com/kubernetes/kubernetes/pull/83579 which should have made the client CA live-reloaded.

I'll continue root-causing it tomorrow.

Comment 14 Tomáš Nožička 2020-04-21 16:39:21 UTC

I've figured out why it is not reloading the certificates on authentication path - should have a fix tomorrow.

Comment 17 Maciej Szulik 2020-04-22 18:34:42 UTC

This along with https://bugzilla.redhat.com/show_bug.cgi?id=1825987 is required for the automatic cert-rotation to fully work.
Given that there are the following items left:
- https://github.com/openshift/cluster-kube-controller-manager-operator/pull/401 - Sync new client cert-key on recovery for 4.4
- https://github.com/kubernetes/kubernetes/pull/90360 - Fix client-ca dynamic reload in apiserver

With the latter we need to land it upstream first and backport to the current origin master and only then back to 4.4.

With all of the above tasks we'll continue to land this ASAP, but that won't make the 4.4.0 cut.
We still have the manual steps described in 
https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html#dr-recovering-expired-certs
working we have a fallback solution in place for 4.4.0.

Comment 20 Tomáš Nožička 2020-05-04 14:51:41 UTC

I am closing the CSR signer issues as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1821683 where there and the fix were tracked.

The kube-apiserver client CA reloading is a different issue and we are tracking it in its own BZ https://bugzilla.redhat.com/show_bug.cgi?id=1831045

*** This bug has been marked as a duplicate of bug 1821683 ***