Verified Result: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-12-21-130047 True False 304d Cluster version is 4.10.0-0.nightly-2021-12-21-130047 After change the ntp server time and sync the master and worker nodes's time, no csr generated. [root@ip-10-0-26-251 .ssh]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-119.us-east-2.compute.internal Ready worker 304d v1.22.1+6859754 ip-10-0-135-42.us-east-2.compute.internal Ready master 304d v1.22.1+6859754 ip-10-0-166-194.us-east-2.compute.internal Ready master 304d v1.22.1+6859754 ip-10-0-191-218.us-east-2.compute.internal Ready worker 304d v1.22.1+6859754 ip-10-0-194-98.us-east-2.compute.internal Ready worker 304d v1.22.1+6859754 ip-10-0-214-31.us-east-2.compute.internal Ready master 304d v1.22.1+6859754 The certificate didn't rotate automatically. and kubelet couldn't startup with below error: Oct 23 12:26:05 ip-10-0-135-42 hyperkube[1588]: E1023 12:26:05.793101 1588 transport.go:112] kubernetes.io/kube-apiserver-client-kubelet: Current certificate is expired ansport.go:112] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2021-12-23 12:08:55.149157194 +0000 UTC m=+0.584949896" shutdownThreshold="5m0s ~ No pending csr was found. oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION system:openshift:openshift-authenticator-b69rj 37m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-authentication-operator:authentication-operator <none> Approved system:openshift:openshift-monitoring-7qdqs 37m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-monitoring:cluster-monitoring-operator <none> Approved
Fail to reproduce on AWS, open a new bug to track this issue https://bugzilla.redhat.com/show_bug.cgi?id=2036361 I tried to reproduce the BZ several times, the kubelet couldn't startup after certificate rotation. the openshift console couldn't load and login, threw 401 error. Console Error: {"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"39a034bf"} My main test steps: 1. On NTP server: [root@ip-10-0-31-53 ec2-user]# cat /etc/chrony.conf driftfile /var/lib/chrony/drift makestep 1.0 3 allow 10.0.0.0/12 local stratum 1 logdir /var/log/chrony manual [root@ip-10-0-31-53 ec2-user]# systemctl restart chronyd 2. Create MachineConfiguration. [ocpadmin@ec2-18-217-45-133 nto]$ oc create -f master-mc.yaml machineconfig.machineconfiguration.openshift.io/99-master-chrony created [ocpadmin@ec2-18-217-45-133 nto]$ oc create -f worker-mc.yaml machineconfig.machineconfiguration.openshift.io/99-worker-chrony created [ocpadmin@ec2-18-217-45-133 nto]$ cat worker-mc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-chrony spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,pool%20ip-10-0-31-53.us-east-2.compute.internal%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A mode: 420 overwrite: true path: /etc/chrony.conf [ocpadmin@ec2-18-217-45-133 nto]$ cat master-mc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-master-chrony spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,pool%20ip-10-0-31-53.us-east-2.compute.internal%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A mode: 420 overwrite: true path: /etc/chrony.conf 3. Change NTP Server Date [root@ip-10-0-31-53 ec2-user]# date Mon Oct 31 01:34:15 EDT 2022 4. Restart chronyd and kubelet on master server. 5. Check the csr has been approved. [root@ip-10-0-31-53 ec2-user]# oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION system:openshift:openshift-authenticator-998s5 14m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-authentication-operator:authentication-operator <none> Approved system:openshift:openshift-monitoring-mn9xj 14m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-monitoring:cluster-monitoring-operator <none> Approved The kubelet couldn't startup, fail to reproduce. Any specified hardware that customer deploy the OCP?
Thanks Andreas and David's suggestion. I have verified on my cluster as below [ocpadmin@ec2-18-217-45-133 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-07-004348 True False 41m Cluster version is 4.10.0-0.nightly-2022-01-07-004348 oc describe service/node-tuning-operator -n openshift-cluster-node-tuning-operator | grep Endpoints Endpoints: 10.129.0.21:60000 export METRICS_ENDPOINT="10.129.0.21:60000" oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null" | tee openssl_output_before.txt oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_before.txt Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ... To use host binaries, run `chroot /host` notBefore=Jan 7 11:04:47 2022 GMT notAfter=Jan 7 11:04:48 2024 GMT $ oc get pods -n openshift-cluster-node-tuning-operator NAME READY STATUS RESTARTS AGE cluster-node-tuning-operator-57878bc8f5-66p6r 1/1 Running 0 54m tuned-4tdqr 1/1 Running 0 44m tuned-gxxm4 1/1 Running 0 51m tuned-mjnnh 1/1 Running 0 44m tuned-tz4bz 1/1 Running 0 51m tuned-wxvvp 1/1 Running 0 51m $ oc get secret -n openshift-cluster-node-tuning-operator NAME TYPE DATA AGE builder-dockercfg-np4g6 kubernetes.io/dockercfg 1 48m builder-token-qsv2z kubernetes.io/service-account-token 4 48m builder-token-wc2pj kubernetes.io/service-account-token 4 48m cluster-node-tuning-operator-dockercfg-fp2b5 kubernetes.io/dockercfg 1 48m cluster-node-tuning-operator-token-9b4zs kubernetes.io/service-account-token 4 48m cluster-node-tuning-operator-token-md65z kubernetes.io/service-account-token 4 57m default-dockercfg-kvwd8 kubernetes.io/dockercfg 1 48m default-token-4gz8m kubernetes.io/service-account-token 4 58m default-token-c6tld kubernetes.io/service-account-token 4 48m deployer-dockercfg-xxkwc kubernetes.io/dockercfg 1 48m deployer-token-76xsd kubernetes.io/service-account-token 4 48m deployer-token-fjkmk kubernetes.io/service-account-token 4 48m node-tuning-operator-tls kubernetes.io/tls 2 55m tuned-dockercfg-55gms kubernetes.io/dockercfg 1 48m tuned-token-hc8vm kubernetes.io/service-account-token 4 57m tuned-token-sfswb kubernetes.io/service-account-token 4 48m [ocpadmin@ec2-18-217-45-133 ~]$ oc delete secret node-tuning-operator-tls -n openshift-cluster-node-tuning-operator secret "node-tuning-operator-tls" deleted oc logs cluster-node-tuning-operator-57878bc8f5-66p6r -n openshift-cluster-node-tuning-operator|tail -10 I0107 11:10:22.425041 1 trace.go:205] Trace[691670059]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (07-Jan-2022 11:09:52.423) (total time: 30001ms): Trace[691670059]: ---"Objects listed" error:Get "https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-cluster-node-tuning-operator/daemonsets?resourceVersion=12519": dial tcp 172.30.0.1:443: i/o timeout 30001ms (11:10:22.424) Trace[691670059]: [30.001192828s] [30.001192828s] END E0107 11:10:22.425061 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: Get "https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-cluster-node-tuning-operator/daemonsets?resourceVersion=12519": dial tcp 172.30.0.1:443: i/o timeout I0107 11:12:34.929912 1 controller.go:595] created profile liqcui-vmc410-bqjzp-worker-f5k78 [openshift-node] I0107 11:12:36.653687 1 controller.go:595] created profile liqcui-vmc410-bqjzp-worker-s4c5k [openshift-node] I0107 12:01:06.917646 1 server.go:144] cert and key changed, need to restart the server. I0107 12:01:06.917698 1 server.go:107] restarting metrics server to rotate certificates I0107 12:01:06.917706 1 server.go:60] stopping metrics server I0107 12:01:07.991194 1 server.go:51] starting metrics server [ocpadmin@ec2-18-217-45-133 ~]$ oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ... To use host binaries, run `chroot /host` notBefore=Jan 7 12:00:05 2022 GMT notAfter=Jan 7 12:00:06 2024 GMT oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null" | tee openssl_output_after.txt oc get secret/node-tuning-operator-tls -o json -n openshift-cluster-node-tuning-operator | jq -r '.data | ."tls.crt"' | base64 -d | sed '/-END CERTIFICATE-/q' > cert_secret_after.txt [ocpadmin@ec2-18-217-45-133 ~]$ diff cert_after.txt cert_secret_after.txt oc delete secret node-tuning-operator-tls -n openshift-cluster-node-tuning-operator secret "node-tuning-operator-tls" deleted [ocpadmin@ec2-18-217-45-133 ~]$ date Fri Jan 7 12:12:23 UTC 2022 [ocpadmin@ec2-18-217-45-133 ~]$ oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt2 Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ... To use host binaries, run `chroot /host` notBefore=Jan 7 12:12:20 2022 GMT notAfter=Jan 7 12:12:21 2024 GMT ########################################## ocpadmin@ec2-18-217-45-133 ~]$ oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ... To use host binaries, run `chroot /host` notBefore=Jan 7 12:25:07 2022 GMT notAfter=Jan 7 12:25:08 2024 GMT [ocpadmin@ec2-18-217-45-133 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION liqcui-vmc410-bqjzp-master-0 Ready master 107m v1.22.1+6859754 liqcui-vmc410-bqjzp-master-1 Ready master 107m v1.22.1+6859754 liqcui-vmc410-bqjzp-master-2 Ready master 107m v1.22.1+6859754 liqcui-vmc410-bqjzp-worker-f5k78 Ready worker 97m v1.22.1+6859754 liqcui-vmc410-bqjzp-worker-s4c5k Ready worker 97m v1.22.1+6859754 [ocpadmin@ec2-18-217-45-133 ~]$ oc delete secret/signing-key -n openshift-service-ca secret "signing-key" deleted The certifiate was rotated by delete signing-key oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ... To use host binaries, run `chroot /host` notBefore=Jan 7 12:52:00 2022 GMT notAfter=Jan 7 12:52:01 2024 GMT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056