Description of problem: Auto recovery does not work after cluster suspended before 12 hours and re-started after 24 hours with nodes NotReady Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-25-223508 How reproducible: Always Steps to Reproduce: Last day 10am, launched a 4.4.0-0.nightly-2020-03-25-223508 cluster in AWS, didn't do other action, then at 7:15 pm stopped all nodes. Here is the screen record, note the shell prompt's timestamp of yesterday and today: [xxia 2020-03-26 19:12:33 CST my]$ scripts/check_secret_expiry_2.sh 2020-03-26T01:51:36Z 2020-03-27T01:51:36Z openshift-kube-apiserver-operator aggregator-client-signer 2020-03-26T02:07:01Z 2020-03-27T01:51:38Z openshift-kube-apiserver aggregator-client 2020-03-26T02:07:00Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer 2020-03-26T01:51:41Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer-signer [xxia 2020-03-27 18:39:55 CST my]$ sleep 5m # re starting [xxia 2020-03-27 18:51:03 CST my]$ scripts/check_secret_expiry_2.sh 2020-03-26T02:07:00Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer 2020-03-26T01:51:41Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer-signer [xxia 2020-03-27 18:52:45 CST my]$ oc get csr NAME AGE REQUESTOR CONDITION csr-28ms2 13m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2f6xj 25s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2nxq4 7m15s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2q4z7 13m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2s8ck 9m39s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2tkdv 5m55s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-44mn2 60s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending [xxia 2020-03-27 19:03:13 CST my]$ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-129-145.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-143-206.us-east-2.compute.internal NotReady worker 32h v1.17.1 ip-10-0-149-179.us-east-2.compute.internal NotReady worker 32h v1.17.1 ip-10-0-157-56.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-169-65.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-175-10.us-east-2.compute.internal NotReady worker 32h v1.17.1 [xxia@laptop 2020-03-27 19:04:24 CST ~]$ [xxia@laptop 2020-03-27 19:05:25 CST ~]$ [xxia@laptop 2020-03-27 19:05:25 CST ~]$ vi tmp.txt4 [xxia@laptop 2020-03-27 19:07:41 CST ~]$ cat tmp.txt4 asked tomasz https://coreos.slack.com/archives/CB48XQ4KZ/p1583919086295400?thread_ts=1583919037.295300 [xxia 2020-03-26 19:12:33 CST my]$ scripts/check_secret_expiry_2.sh # then stop all nodes on AWS console. 2020-03-26T01:51:36Z 2020-03-27T01:51:36Z openshift-kube-apiserver-operator aggregator-client-signer 2020-03-26T02:07:01Z 2020-03-27T01:51:38Z openshift-kube-apiserver aggregator-client 2020-03-26T02:07:00Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer 2020-03-26T01:51:41Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer-signer Today re-starting all nodes in AWS console after cluster's creation [xxia 2020-03-27 18:39:55 CST my]$ sleep 10m # re starting [xxia 2020-03-27 18:52:45 CST my]$ oc get csr NAME AGE REQUESTOR CONDITION csr-28ms2 13m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2f6xj 25s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2nxq4 7m15s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2q4z7 13m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2s8ck 9m39s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2tkdv 5m55s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-44mn2 60s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending [xxia 2020-03-27 19:04:31 CST my]$ scripts/check_secret_expiry_2.sh 2020-03-26T02:07:00Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer 2020-03-26T01:51:41Z 2020-03-27T01:51:41Z openshift-kube-controller-manager-operator csr-signer-signer [xxia 2020-03-27 19:03:13 CST my]$ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-129-145.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-143-206.us-east-2.compute.internal NotReady worker 32h v1.17.1 ip-10-0-149-179.us-east-2.compute.internal NotReady worker 32h v1.17.1 ip-10-0-157-56.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-169-65.us-east-2.compute.internal NotReady master 32h v1.17.1 ip-10-0-175-10.us-east-2.compute.internal NotReady worker 32h v1.17.1 [xxia 2020-03-27 19:05:49 CST my]$ cat scripts/check_secret_expiry_2.sh oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date='+24hours' +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' Actual results: Saw nodes NotReady and csr-signer, csr-signer-signer not recovered. Expected results: Auto recovered without above issue. Additional info: Though, `oc get co` saw all COs no abnormal, all are "True False False".
(In reply to Xingxing Xia from comment #0) > Steps to Reproduce: > Last day 10am, launched a 4.4.0-0.nightly-2020-03-25-223508 cluster in AWS, didn't do other action, then at 7:15 pm stopped all nodes. Here is the screen record, note the shell prompt's timestamp of yesterday and today: The content between above line and below line is duplicate entered while posting the bug. Could ignore :) Pls just see content after that. > [xxia@laptop 2020-03-27 19:05:25 CST ~]$ vi tmp.txt4 > [xxia@laptop 2020-03-27 19:07:41 CST ~]$ cat tmp.txt4
The problem is that the CSRSigner wasn't rotated, this is caused by these lines: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L496-L514 when we have: useAfter Mar 26 02:07:00 now Mar 27 10:39:12 (roughly) this leads to us always hitting the condition from line 499 and returning negative delay, which in turn leads to a hot loop in https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ef3599ba51489e5e891b1a3bc40e60a801edfd01/pkg/cmd/recoverycontroller/csrcontroller.go#L153-L162 We should: a) write proper unit tests around the code in targetconfigcontroller.go which outline the cases we want to see happen b) either add (useAfter.Sub(now) > 0) or check that Not After is before now.
Today prepared several DR envs shutdown for verification next day.
Tested in 4.4.0-0.nightly-2020-04-13-113747 ipi aws env: Today re-started all nodes from shutdown status after > 24 hours (30 hours in fact), got: [xxia 2020-04-15 16:38:42 CST my]$ scripts/check_secret_expiry_within.sh 31days Checking validity within 31days 2020-04-14T02:08:36Z 2020-05-14T02:08:37Z openshift-config-managed kube-controller-manager-client-cert-key 2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-config-managed kube-scheduler-client-cert-key 2020-04-15T08:23:24Z 2020-05-15T08:23:25Z openshift-kube-apiserver-operator aggregator-client-signer 2020-04-15T08:23:24Z 2020-05-15T08:23:25Z openshift-kube-apiserver aggregator-client 2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver external-loadbalancer-serving-certkey 2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver internal-loadbalancer-serving-certkey 2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client 2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client-2 ... 2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-apiserver kubelet-client-6 2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver localhost-serving-cert-certkey 2020-04-14T02:08:34Z 2020-05-14T02:08:35Z openshift-kube-apiserver service-network-serving-certkey 2020-04-15T08:24:45Z 2020-05-15T08:24:46Z openshift-kube-controller-manager-operator csr-signer 2020-04-14T02:08:36Z 2020-05-14T02:08:37Z openshift-kube-controller-manager kube-controller-manager-client-cert-key 2020-04-14T02:08:33Z 2020-05-14T02:08:34Z openshift-kube-scheduler kube-scheduler-client-cert-key The certs are all renewed. [xxia 2020-04-15 16:42:40 CST my]$ oc get csr NAME AGE REQUESTOR CONDITION csr-2sckq 3m41s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending ... csr-vpvnj 19m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending [xxia 2020-04-15 16:44:46 CST my]$ oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-2sckq approved ... After approved CSRs, nodes still NotReady: [xxia 2020-04-15 16:59:05 CST my]$ oc get node -w NAME STATUS ROLES AGE VERSION ip-10-0-133-104.ap-northeast-1.compute.internal NotReady master 30h v1.17.1 ip-10-0-142-46.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1 ip-10-0-145-242.ap-northeast-1.compute.internal NotReady master 30h v1.17.1 ip-10-0-157-193.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1 ip-10-0-160-122.ap-northeast-1.compute.internal NotReady worker 30h v1.17.1 ip-10-0-167-85.ap-northeast-1.compute.internal NotReady master 30h v1.17.1 ^C[xxia 2020-04-15 17:16:01 CST my] Checked oc get co, and oc get po --all-namespaces, no abnormality found, though.
I am looking into this. The QA cluster got pruned but I've managed to reproduce it in my cluster.
I've traced this to kube-apiserver's client CA that isn't live reloaded which is causing CSRs signed with KCM csr-signer not trusted and results in 401 Unauthorized. Kubelet uses CSR for it's credentials and it can't start. While testing it I've noticed KAS's accepted CAs live reload from client-ca/ca-bundle.crt but the trust isn't affected though. David referenced this pull https://github.com/kubernetes/kubernetes/pull/83579 which should have made the client CA live-reloaded. I'll continue root-causing it tomorrow.
I've figured out why it is not reloading the certificates on authentication path - should have a fix tomorrow.
This along with https://bugzilla.redhat.com/show_bug.cgi?id=1825987 is required for the automatic cert-rotation to fully work. Given that there are the following items left: - https://github.com/openshift/cluster-kube-controller-manager-operator/pull/401 - Sync new client cert-key on recovery for 4.4 - https://github.com/kubernetes/kubernetes/pull/90360 - Fix client-ca dynamic reload in apiserver With the latter we need to land it upstream first and backport to the current origin master and only then back to 4.4. With all of the above tasks we'll continue to land this ASAP, but that won't make the 4.4.0 cut. We still have the manual steps described in https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html#dr-recovering-expired-certs working we have a fallback solution in place for 4.4.0.
I am closing the CSR signer issues as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1821683 where there and the fix were tracked. The kube-apiserver client CA reloading is a different issue and we are tracking it in its own BZ https://bugzilla.redhat.com/show_bug.cgi?id=1831045 *** This bug has been marked as a duplicate of bug 1821683 ***