Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1968409

Summary:	SNO: mcp worker's description show outdated info after a new MachineConfig has been applied
Product:	OpenShift Container Platform	Reporter:	MinLi <minmli>
Component:	Machine Config Operator	Assignee:	MCO Team <team-mco>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	low	CC:	aos-bugs, mkrejci, rioliu, skumari
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-08 17:58:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description MinLi 2021-06-07 10:46:27 UTC

Description of problem:
the mcp worker will roll out to a new mc after kubelet cert rotate, but the description still show old mc info.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-03-221810

How reproducible:
always

Steps to Reproduce:
1.create a sno cluster on baremental platform

2.check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate
# oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
...
kind: Secret
metadata:
  annotations:
    auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer
    auth.openshift.io/certificate-not-after: "2022-06-04T03:50:54Z"
    auth.openshift.io/certificate-not-before: "2021-06-04T03:50:54Z"

3.ssh to node, reset system time= current time + 291days12hours
#./move_ahead_time.sh 291days12hours

move_ahead_time.sh:
# usage: ./move_ahead_time.sh 1year # you can also use 12hours, 30days, 2month, 1year
date
sudo systemctl disable chronyd
sudo systemctl stop crio && sudo systemctl stop kubelet
MOVED_AHEAD_TIME=$1
future=$(date --date "+$MOVED_AHEAD_TIME" -u "+%Y-%m-%dT%H:%M:%SZ")
sudo date --set '$future'
date
sudo systemctl start crio && sudo systemctl start kubelet

4.check if the secret kube-apiserver-to-kubelet-signer get updated in the next 24 hours
# oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
kind: Secret
metadata:
  annotations:
    auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1648007460 
    auth.openshift.io/certificate-not-after: "2023-03-23T03:51:00Z"
    auth.openshift.io/certificate-not-before: "2022-03-23T03:50:59Z"

5.#oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
00-worker                                          f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
01-master-container-runtime                        f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
01-master-kubelet                                  f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
01-worker-container-runtime                        f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
01-worker-kubelet                                  f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
99-master-generated-registries                     f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
99-master-ssh                                                                                 3.2.0             293d
99-worker-generated-registries                     f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
99-worker-ssh                                                                                 3.2.0             293d
rendered-master-078343783f187d05b77e1c7eeaead3b7   f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             35h // for mcp master
rendered-master-1837be79407b056e8fae1d37cbf5459e   f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d
rendered-worker-47864b741bb0b21bb726ced0df195252   f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             35h // for mcp worker
rendered-worker-8f9a0067e6e573e3a79185d8e64974ae   f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c   3.2.0             293d

6 # oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-078343783f187d05b77e1c7eeaead3b7   True      False      False      1              1                   1                     0                      293d
worker   rendered-worker-47864b741bb0b21bb726ced0df195252   True      False      False      0              0                   0                     0                      293d


7 oc get mcp worker -o yaml 
...
spec:
  configuration:
    name: rendered-worker-47864b741bb0b21bb726ced0df195252
    source:
....
status:
  conditions:
  - lastTransitionTime: "2021-06-04T04:24:31Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-06-04T04:24:32Z"
    message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae
    reason: ""
    status: "True"
    type: Updated


Actual results:
4 the secret kube-apiserver-to-kubelet-signer updated
5 generate 2 new mc, mc-1, mc-2
6 mcp master roll out to mc-1, mcp worker roll out to mc-2
7 the mcp worker description show old mc in message "All nodes are updated with"

Expected results:
7 the mcp worker description show correct mc info in message "All nodes are updated with"

Additional info:

Comment 1 MinLi 2021-06-07 10:47:17 UTC

# oc get mcp worker  -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2021-06-04T04:19:25Z"
  generation: 3
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  name: worker
  resourceVersion: "281426"
  uid: 64074cc2-dabd-4aaa-ab7c-e9a96bb26f6d
spec:
  configuration:
    name: rendered-worker-47864b741bb0b21bb726ced0df195252
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2021-06-04T04:24:31Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-06-04T04:24:32Z"
    message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae
    reason: ""
    status: "True"
    type: Updated
  - lastTransitionTime: "2021-06-04T04:24:32Z"
    message: ""
    reason: ""
    status: "False"
    type: Updating
  - lastTransitionTime: "2021-06-04T04:24:32Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2021-06-04T04:24:32Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  configuration:
    name: rendered-worker-47864b741bb0b21bb726ced0df195252
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 0
  machineCount: 0
  observedGeneration: 3
  readyMachineCount: 0
  unavailableMachineCount: 0
  updatedMachineCount: 0

Comment 2 Yu Qi Zhang 2021-06-08 04:36:30 UTC

Will look into this. Do you know if this behaviour is SingleNode only?

Comment 3 MinLi 2021-06-08 10:03:51 UTC

This kubelet cert rotate case can only be tested on sno baremetal cluster, so I just find it on SNO env.

Comment 4 Sinny Kumari 2021-06-09 08:31:51 UTC

We don't have hardware to test it in baremetal environment but We can try to reproduce this issue in aws/gcp SNO cluster or on a regular cluster. 
@MinLi Meanwhile, can you please provide must-gather of the cluster where you can reproduce this issue?

Comment 5 MinLi 2021-06-10 10:21:46 UTC

must-gather failed due to mutiple errors such as ImagePullBackOff for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38b87c0e3b0e4ffa55888.., and some containers not found. 
I need more time to rebuild a new cluster and get the must-gather.

Comment 6 Sinny Kumari 2021-06-11 16:02:21 UTC

I can't reproduce it so far due to other issues After changing the date on node and rebooting the system, node is not accessible from cluster.

Comment 7 MinLi 2021-06-15 10:18:23 UTC

Hi, Sinny Kumari

this is my case, you can do as the steps : https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-40820

Comment 9 MinLi 2021-06-17 09:43:58 UTC

case: Validate kube-apiserver-to-kubelet-signer cert rotate without node rebooting in SNO	

step1:
create a sno cluster on baremental env (for cloud provider platform such as aws can't support setting time ahead of time)
[root@ocp-edge49 ~]# oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2021-04-19T02:33:44Z"
  generation: 1
  name: cluster
  resourceVersion: "636"
  uid: e45bc488-7e94-40cf-9d3a-5dde09ddb460
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: None
status:
  apiServerInternalURI: https://api-int.minli-sno1-0.qe.lab.redhat.com:6443
  apiServerURL: https://api.minli-sno1-0.qe.lab.redhat.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: minli-sno1-0-4ngfq
  infrastructureTopology: SingleReplica // take care
  platform: None
  platformStatus:
    type: None

step2:
check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate
$ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
...
kind: Secret
metadata:
  annotations:
    auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer
    auth.openshift.io/certificate-not-after: "2022-04-19T02:23:29Z"
    auth.openshift.io/certificate-not-before: "2021-04-19T02:23:29Z"
  creationTimestamp: "2021-04-19T02:33:40Z"

step3:
check all the pods , nodes, co status are normal
And you need to check cert file timestemp on node and copy this file to compare with the one after 292days
[core@sno-0-0 ~]$ ll /etc/kubernetes/kubelet-ca.crt

step4:
ssh to node, reset system time= current time + 291days12hours
./move_ahead_time.sh 291days12hours

note: after reset server time, you also need to reset the same time on the client host which oc client run,  or else the client's cert is expired, and get the following error:
# oc get node
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2021-04-21T12:22:35+03:00 is before 2022-02-04T14:18:26Z

And after reset client time(# date --set '2022-02-04T14:59:17Z'), you need to approve csr

[root@ocp-edge49 ~]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-6j45j   38m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-g7brm   23m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-vt5dn   53m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-zhzdn   8m30s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

[root@ocp-edge49 ~]# oc adm certificate approve csr-zhzdn
certificatesigningrequest.certificates.k8s.io/csr-zhzdn approved

[root@ocp-edge49 ~]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-6j45j   39m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-g7brm   24m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-vt5dn   54m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-zhzdn   9m10s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

[root@ocp-edge49 ~]# oc get node
The connection to the server api.minli-sno1-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?

[root@ocp-edge49 ~]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-6j45j   42m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-g7brm   27m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-h5hkm   3m11s   kubernetes.io/kubelet-serving                 system:node:sno-0-0                                                         Pending // need to approve
csr-vt5dn   57m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-zhzdn   12m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

[root@ocp-edge49 ~]# oc adm certificate approve csr-h5hkm

step5:
check if the secret ube-apiserver-to-kubelet-signer get updated in the next 24 hours and check if nodes reboot during this period.
$ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
metadata:
  annotations:
    auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1644027819
    auth.openshift.io/certificate-not-after: "2023-02-05T02:23:39Z"
    auth.openshift.io/certificate-not-before: "2022-02-05T02:23:38Z"
  creationTimestamp: "2021-04-19T02:33:40Z"

$ oc get mc
$ oc get mcp worker -o yaml
$ oc get mcp master -o yaml

...

Comment 10 MinLi 2021-06-17 09:47:29 UTC

Hi, Sinny Kumari 
I list the steps in detail, please reach out to me if you still can't reproduce this bug.

Comment 11 Sinny Kumari 2021-06-18 07:44:34 UTC

Since I can't create my own SNO cluster on baremetal, working with Min Li to get access to a cluster where bug is reproducible.

Comment 12 Sinny Kumari 2021-06-22 15:25:53 UTC

Looked into logs from reproducible cluster that MinLi shared. I think we are seeing that status message of Updated in worker pool is not getting updated because there is no node in worker pool to update.
This issue is not specific to kube cert rotation. we can reproduced this issue on any SNO cluster or a cluster with 0 worker nodes and apply a MachineConfig targeted to worker pool.

With current behavior, status will be updated accordingly once one or more node are present/added into worker pool.

Lowering the priority because this doesn't impact SNO or regular cluster in any way i.e. applying a MachineConfig, scale up or scale down of node in a pool(or worker) should work fine. We can look at updating status message in future when we have some free cycle.

Comment 13 Sinny Kumari 2021-11-08 17:58:02 UTC

This doesn't impact cluster behavior in any way. Nice to get fixed but not necessary because there are no node in worker pool. With other high priorities bug and new feature work, MCO team won't be able to fix it soon. Closing this for now, new bug can be reopened if this has direct known impact from customer end.