Bug 1817997
| Summary: | CSR signer didn't recover | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> | |
| Component: | kube-controller-manager | Assignee: | Tomáš Nožička <tnozicka> | |
| Status: | CLOSED DUPLICATE | QA Contact: | zhou ying <yinzhou> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.4 | CC: | aos-bugs, maszulik, mfojtik, rpattath, scuppett, tnozicka, vareti, zyu | |
| Target Milestone: | --- | |||
| Target Release: | 4.4.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1818420 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-04 14:51:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1818420, 1821680, 1827990 | |||
| Bug Blocks: | ||||
|
Description
Xingxing Xia
2020-03-27 11:53:44 UTC
(In reply to Xingxing Xia from comment #0) > Steps to Reproduce: > Last day 10am, launched a 4.4.0-0.nightly-2020-03-25-223508 cluster in AWS, didn't do other action, then at 7:15 pm stopped all nodes. Here is the screen record, note the shell prompt's timestamp of yesterday and today: The content between above line and below line is duplicate entered while posting the bug. Could ignore :) Pls just see content after that. > [xxia@laptop 2020-03-27 19:05:25 CST ~]$ vi tmp.txt4 > [xxia@laptop 2020-03-27 19:07:41 CST ~]$ cat tmp.txt4 The problem is that the CSRSigner wasn't rotated, this is caused by these lines: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L496-L514 when we have: useAfter Mar 26 02:07:00 now Mar 27 10:39:12 (roughly) this leads to us always hitting the condition from line 499 and returning negative delay, which in turn leads to a hot loop in https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/cmd/recoverycontroller/csrcontroller.go#L153-L162 We should: a) write proper unit tests around the code in targetconfigcontroller.go which outline the cases we want to see happen b) either add (useAfter.Sub(now) > 0) or check that Not After is before now. Today prepared several DR envs shutdown for verification next day. Tested in 4.4.0-0.nightly-2020-04-13-113747 ipi aws env:
Today re-started all nodes from shutdown status after > 24 hours (30 hours in fact), got:
[xxia 2020-04-15 16:38:42 CST my]$ scripts/check_secret_expiry_within.sh 31days
Checking validity within 31days
2020-04-14T02:08:36Z 2020-05-14T02:08:37Z openshift-config-managed kube-controller-manager-client-cert-key
2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-config-managed kube-scheduler-client-cert-key
2020-04-15T08:23:24Z 2020-05-15T08:23:25Z openshift-kube-apiserver-operator aggregator-client-signer
2020-04-15T08:23:24Z 2020-05-15T08:23:25Z openshift-kube-apiserver aggregator-client
2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver external-loadbalancer-serving-certkey
2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver internal-loadbalancer-serving-certkey
2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client
2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client-2
...
2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client-6
2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver localhost-serving-cert-certkey
2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver service-network-serving-certkey
2020-04-15T08:24:45Z 2020-05-15T08:24:46Z openshift-kube-controller-manager-operator csr-signer
2020-04-14T02:08:36Z 2020-05-14T02:08:37Z openshift-kube-controller-manager kube-controller-manager-client-cert-key
2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-scheduler kube-scheduler-client-cert-key
The certs are all renewed.
[xxia 2020-04-15 16:42:40 CST my]$ oc get csr
NAME AGE REQUESTOR CONDITION
csr-2sckq 3m41s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
...
csr-vpvnj 19m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
[xxia 2020-04-15 16:44:46 CST my]$ oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-2sckq approved
...
After approved CSRs, nodes still NotReady:
[xxia 2020-04-15 16:59:05 CST my]$ oc get node -w
NAME STATUS ROLES AGE VERSION
ip-10-0-133-104.ap-northeast-1.compute.internal NotReady master 30h v1.17.1
ip-10-0-142-46.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1
ip-10-0-145-242.ap-northeast-1.compute.internal NotReady master 30h v1.17.1
ip-10-0-157-193.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1
ip-10-0-160-122.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1
ip-10-0-167-85.ap-northeast-1.compute.internal NotReady master 30h v1.17.1
^C[xxia 2020-04-15 17:16:01 CST my]
Checked oc get co, and oc get po --all-namespaces, no abnormality found, though.
I am looking into this. The QA cluster got pruned but I've managed to reproduce it in my cluster. I've traced this to kube-apiserver's client CA that isn't live reloaded which is causing CSRs signed with KCM csr-signer not trusted and results in 401 Unauthorized. Kubelet uses CSR for it's credentials and it can't start. While testing it I've noticed KAS's accepted CAs live reload from client-ca/ca-bundle.crt but the trust isn't affected though. David referenced this pull https://github.com/kubernetes/kubernetes/pull/83579 which should have made the client CA live-reloaded. I'll continue root-causing it tomorrow. I've figured out why it is not reloading the certificates on authentication path - should have a fix tomorrow. This along with https://bugzilla.redhat.com/show_bug.cgi?id=1825987 is required for the automatic cert-rotation to fully work. Given that there are the following items left: - https://github.com/openshift/cluster-kube-controller-manager-operator/pull/401 - Sync new client cert-key on recovery for 4.4 - https://github.com/kubernetes/kubernetes/pull/90360 - Fix client-ca dynamic reload in apiserver With the latter we need to land it upstream first and backport to the current origin master and only then back to 4.4. With all of the above tasks we'll continue to land this ASAP, but that won't make the 4.4.0 cut. We still have the manual steps described in https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html#dr-recovering-expired-certs working we have a fallback solution in place for 4.4.0. I am closing the CSR signer issues as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1821683 where there and the fix were tracked. The kube-apiserver client CA reloading is a different issue and we are tracking it in its own BZ https://bugzilla.redhat.com/show_bug.cgi?id=1831045 *** This bug has been marked as a duplicate of bug 1821683 *** |