Description of problem: the mcp worker will roll out to a new mc after kubelet cert rotate, but the description still show old mc info. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-03-221810 How reproducible: always Steps to Reproduce: 1.create a sno cluster on baremental platform 2.check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate # oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml ... kind: Secret metadata: annotations: auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer auth.openshift.io/certificate-not-after: "2022-06-04T03:50:54Z" auth.openshift.io/certificate-not-before: "2021-06-04T03:50:54Z" 3.ssh to node, reset system time= current time + 291days12hours #./move_ahead_time.sh 291days12hours move_ahead_time.sh: # usage: ./move_ahead_time.sh 1year # you can also use 12hours, 30days, 2month, 1year date sudo systemctl disable chronyd sudo systemctl stop crio && sudo systemctl stop kubelet MOVED_AHEAD_TIME=$1 future=$(date --date "+$MOVED_AHEAD_TIME" -u "+%Y-%m-%dT%H:%M:%SZ") sudo date --set '$future' date sudo systemctl start crio && sudo systemctl start kubelet 4.check if the secret kube-apiserver-to-kubelet-signer get updated in the next 24 hours # oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml kind: Secret metadata: annotations: auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1648007460 auth.openshift.io/certificate-not-after: "2023-03-23T03:51:00Z" auth.openshift.io/certificate-not-before: "2022-03-23T03:50:59Z" 5.#oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 00-worker f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-master-container-runtime f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-master-kubelet f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-worker-container-runtime f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-worker-kubelet f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-master-generated-registries f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-master-ssh 3.2.0 293d 99-worker-generated-registries f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-worker-ssh 3.2.0 293d rendered-master-078343783f187d05b77e1c7eeaead3b7 f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 35h // for mcp master rendered-master-1837be79407b056e8fae1d37cbf5459e f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d rendered-worker-47864b741bb0b21bb726ced0df195252 f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 35h // for mcp worker rendered-worker-8f9a0067e6e573e3a79185d8e64974ae f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 6 # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-078343783f187d05b77e1c7eeaead3b7 True False False 1 1 1 0 293d worker rendered-worker-47864b741bb0b21bb726ced0df195252 True False False 0 0 0 0 293d 7 oc get mcp worker -o yaml ... spec: configuration: name: rendered-worker-47864b741bb0b21bb726ced0df195252 source: .... status: conditions: - lastTransitionTime: "2021-06-04T04:24:31Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-06-04T04:24:32Z" message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae reason: "" status: "True" type: Updated Actual results: 4 the secret kube-apiserver-to-kubelet-signer updated 5 generate 2 new mc, mc-1, mc-2 6 mcp master roll out to mc-1, mcp worker roll out to mc-2 7 the mcp worker description show old mc in message "All nodes are updated with" Expected results: 7 the mcp worker description show correct mc info in message "All nodes are updated with" Additional info:
# oc get mcp worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2021-06-04T04:19:25Z" generation: 3 labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" name: worker resourceVersion: "281426" uid: 64074cc2-dabd-4aaa-ab7c-e9a96bb26f6d spec: configuration: name: rendered-worker-47864b741bb0b21bb726ced0df195252 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2021-06-04T04:24:31Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-06-04T04:24:32Z" message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae reason: "" status: "True" type: Updated - lastTransitionTime: "2021-06-04T04:24:32Z" message: "" reason: "" status: "False" type: Updating - lastTransitionTime: "2021-06-04T04:24:32Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2021-06-04T04:24:32Z" message: "" reason: "" status: "False" type: Degraded configuration: name: rendered-worker-47864b741bb0b21bb726ced0df195252 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 0 machineCount: 0 observedGeneration: 3 readyMachineCount: 0 unavailableMachineCount: 0 updatedMachineCount: 0
Will look into this. Do you know if this behaviour is SingleNode only?
This kubelet cert rotate case can only be tested on sno baremetal cluster, so I just find it on SNO env.
We don't have hardware to test it in baremetal environment but We can try to reproduce this issue in aws/gcp SNO cluster or on a regular cluster. @MinLi Meanwhile, can you please provide must-gather of the cluster where you can reproduce this issue?
must-gather failed due to mutiple errors such as ImagePullBackOff for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38b87c0e3b0e4ffa55888.., and some containers not found. I need more time to rebuild a new cluster and get the must-gather.
I can't reproduce it so far due to other issues After changing the date on node and rebooting the system, node is not accessible from cluster.
Hi, Sinny Kumari this is my case, you can do as the steps : https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-40820
case: Validate kube-apiserver-to-kubelet-signer cert rotate without node rebooting in SNO step1: create a sno cluster on baremental env (for cloud provider platform such as aws can't support setting time ahead of time) [root@ocp-edge49 ~]# oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-04-19T02:33:44Z" generation: 1 name: cluster resourceVersion: "636" uid: e45bc488-7e94-40cf-9d3a-5dde09ddb460 spec: cloudConfig: name: "" platformSpec: type: None status: apiServerInternalURI: https://api-int.minli-sno1-0.qe.lab.redhat.com:6443 apiServerURL: https://api.minli-sno1-0.qe.lab.redhat.com:6443 controlPlaneTopology: SingleReplica etcdDiscoveryDomain: "" infrastructureName: minli-sno1-0-4ngfq infrastructureTopology: SingleReplica // take care platform: None platformStatus: type: None step2: check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate $ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml ... kind: Secret metadata: annotations: auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer auth.openshift.io/certificate-not-after: "2022-04-19T02:23:29Z" auth.openshift.io/certificate-not-before: "2021-04-19T02:23:29Z" creationTimestamp: "2021-04-19T02:33:40Z" step3: check all the pods , nodes, co status are normal And you need to check cert file timestemp on node and copy this file to compare with the one after 292days [core@sno-0-0 ~]$ ll /etc/kubernetes/kubelet-ca.crt step4: ssh to node, reset system time= current time + 291days12hours ./move_ahead_time.sh 291days12hours note: after reset server time, you also need to reset the same time on the client host which oc client run, or else the client's cert is expired, and get the following error: # oc get node Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2021-04-21T12:22:35+03:00 is before 2022-02-04T14:18:26Z And after reset client time(# date --set '2022-02-04T14:59:17Z'), you need to approve csr [root@ocp-edge49 ~]# oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-6j45j 38m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-g7brm 23m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-vt5dn 53m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-zhzdn 8m30s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending [root@ocp-edge49 ~]# oc adm certificate approve csr-zhzdn certificatesigningrequest.certificates.k8s.io/csr-zhzdn approved [root@ocp-edge49 ~]# oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-6j45j 39m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-g7brm 24m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-vt5dn 54m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-zhzdn 9m10s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued [root@ocp-edge49 ~]# oc get node The connection to the server api.minli-sno1-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port? [root@ocp-edge49 ~]# oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-6j45j 42m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-g7brm 27m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-h5hkm 3m11s kubernetes.io/kubelet-serving system:node:sno-0-0 Pending // need to approve csr-vt5dn 57m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-zhzdn 12m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued [root@ocp-edge49 ~]# oc adm certificate approve csr-h5hkm step5: check if the secret ube-apiserver-to-kubelet-signer get updated in the next 24 hours and check if nodes reboot during this period. $ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml metadata: annotations: auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1644027819 auth.openshift.io/certificate-not-after: "2023-02-05T02:23:39Z" auth.openshift.io/certificate-not-before: "2022-02-05T02:23:38Z" creationTimestamp: "2021-04-19T02:33:40Z" $ oc get mc $ oc get mcp worker -o yaml $ oc get mcp master -o yaml ...
Hi, Sinny Kumari I list the steps in detail, please reach out to me if you still can't reproduce this bug.
Since I can't create my own SNO cluster on baremetal, working with Min Li to get access to a cluster where bug is reproducible.
Looked into logs from reproducible cluster that MinLi shared. I think we are seeing that status message of Updated in worker pool is not getting updated because there is no node in worker pool to update. This issue is not specific to kube cert rotation. we can reproduced this issue on any SNO cluster or a cluster with 0 worker nodes and apply a MachineConfig targeted to worker pool. With current behavior, status will be updated accordingly once one or more node are present/added into worker pool. Lowering the priority because this doesn't impact SNO or regular cluster in any way i.e. applying a MachineConfig, scale up or scale down of node in a pool(or worker) should work fine. We can look at updating status message in future when we have some free cycle.
This doesn't impact cluster behavior in any way. Nice to get fixed but not necessary because there are no node in worker pool. With other high priorities bug and new feature work, MCO team won't be able to fix it soon. Closing this for now, new bug can be reopened if this has direct known impact from customer end.