Bug 1968409
| Summary: | SNO: mcp worker's description show outdated info after a new MachineConfig has been applied | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | MinLi <minmli> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | aos-bugs, mkrejci, rioliu, skumari |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-08 17:58:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
# oc get mcp worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2021-06-04T04:19:25Z"
generation: 3
labels:
machineconfiguration.openshift.io/mco-built-in: ""
pools.operator.machineconfiguration.openshift.io/worker: ""
name: worker
resourceVersion: "281426"
uid: 64074cc2-dabd-4aaa-ab7c-e9a96bb26f6d
spec:
configuration:
name: rendered-worker-47864b741bb0b21bb726ced0df195252
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
status:
conditions:
- lastTransitionTime: "2021-06-04T04:24:31Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2021-06-04T04:24:32Z"
message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae
reason: ""
status: "True"
type: Updated
- lastTransitionTime: "2021-06-04T04:24:32Z"
message: ""
reason: ""
status: "False"
type: Updating
- lastTransitionTime: "2021-06-04T04:24:32Z"
message: ""
reason: ""
status: "False"
type: NodeDegraded
- lastTransitionTime: "2021-06-04T04:24:32Z"
message: ""
reason: ""
status: "False"
type: Degraded
configuration:
name: rendered-worker-47864b741bb0b21bb726ced0df195252
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
degradedMachineCount: 0
machineCount: 0
observedGeneration: 3
readyMachineCount: 0
unavailableMachineCount: 0
updatedMachineCount: 0
Will look into this. Do you know if this behaviour is SingleNode only? This kubelet cert rotate case can only be tested on sno baremetal cluster, so I just find it on SNO env. We don't have hardware to test it in baremetal environment but We can try to reproduce this issue in aws/gcp SNO cluster or on a regular cluster. @MinLi Meanwhile, can you please provide must-gather of the cluster where you can reproduce this issue? must-gather failed due to mutiple errors such as ImagePullBackOff for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38b87c0e3b0e4ffa55888.., and some containers not found. I need more time to rebuild a new cluster and get the must-gather. I can't reproduce it so far due to other issues After changing the date on node and rebooting the system, node is not accessible from cluster. Hi, Sinny Kumari this is my case, you can do as the steps : https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-40820 case: Validate kube-apiserver-to-kubelet-signer cert rotate without node rebooting in SNO
step1:
create a sno cluster on baremental env (for cloud provider platform such as aws can't support setting time ahead of time)
[root@ocp-edge49 ~]# oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
creationTimestamp: "2021-04-19T02:33:44Z"
generation: 1
name: cluster
resourceVersion: "636"
uid: e45bc488-7e94-40cf-9d3a-5dde09ddb460
spec:
cloudConfig:
name: ""
platformSpec:
type: None
status:
apiServerInternalURI: https://api-int.minli-sno1-0.qe.lab.redhat.com:6443
apiServerURL: https://api.minli-sno1-0.qe.lab.redhat.com:6443
controlPlaneTopology: SingleReplica
etcdDiscoveryDomain: ""
infrastructureName: minli-sno1-0-4ngfq
infrastructureTopology: SingleReplica // take care
platform: None
platformStatus:
type: None
step2:
check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate
$ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
...
kind: Secret
metadata:
annotations:
auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer
auth.openshift.io/certificate-not-after: "2022-04-19T02:23:29Z"
auth.openshift.io/certificate-not-before: "2021-04-19T02:23:29Z"
creationTimestamp: "2021-04-19T02:33:40Z"
step3:
check all the pods , nodes, co status are normal
And you need to check cert file timestemp on node and copy this file to compare with the one after 292days
[core@sno-0-0 ~]$ ll /etc/kubernetes/kubelet-ca.crt
step4:
ssh to node, reset system time= current time + 291days12hours
./move_ahead_time.sh 291days12hours
note: after reset server time, you also need to reset the same time on the client host which oc client run, or else the client's cert is expired, and get the following error:
# oc get node
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2021-04-21T12:22:35+03:00 is before 2022-02-04T14:18:26Z
And after reset client time(# date --set '2022-02-04T14:59:17Z'), you need to approve csr
[root@ocp-edge49 ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-6j45j 38m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-g7brm 23m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-vt5dn 53m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-zhzdn 8m30s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
[root@ocp-edge49 ~]# oc adm certificate approve csr-zhzdn
certificatesigningrequest.certificates.k8s.io/csr-zhzdn approved
[root@ocp-edge49 ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-6j45j 39m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-g7brm 24m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-vt5dn 54m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-zhzdn 9m10s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
[root@ocp-edge49 ~]# oc get node
The connection to the server api.minli-sno1-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?
[root@ocp-edge49 ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-6j45j 42m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-g7brm 27m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-h5hkm 3m11s kubernetes.io/kubelet-serving system:node:sno-0-0 Pending // need to approve
csr-vt5dn 57m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-zhzdn 12m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
[root@ocp-edge49 ~]# oc adm certificate approve csr-h5hkm
step5:
check if the secret ube-apiserver-to-kubelet-signer get updated in the next 24 hours and check if nodes reboot during this period.
$ oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml
metadata:
annotations:
auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1644027819
auth.openshift.io/certificate-not-after: "2023-02-05T02:23:39Z"
auth.openshift.io/certificate-not-before: "2022-02-05T02:23:38Z"
creationTimestamp: "2021-04-19T02:33:40Z"
$ oc get mc
$ oc get mcp worker -o yaml
$ oc get mcp master -o yaml
...
Hi, Sinny Kumari I list the steps in detail, please reach out to me if you still can't reproduce this bug. Since I can't create my own SNO cluster on baremetal, working with Min Li to get access to a cluster where bug is reproducible. Looked into logs from reproducible cluster that MinLi shared. I think we are seeing that status message of Updated in worker pool is not getting updated because there is no node in worker pool to update. This issue is not specific to kube cert rotation. we can reproduced this issue on any SNO cluster or a cluster with 0 worker nodes and apply a MachineConfig targeted to worker pool. With current behavior, status will be updated accordingly once one or more node are present/added into worker pool. Lowering the priority because this doesn't impact SNO or regular cluster in any way i.e. applying a MachineConfig, scale up or scale down of node in a pool(or worker) should work fine. We can look at updating status message in future when we have some free cycle. This doesn't impact cluster behavior in any way. Nice to get fixed but not necessary because there are no node in worker pool. With other high priorities bug and new feature work, MCO team won't be able to fix it soon. Closing this for now, new bug can be reopened if this has direct known impact from customer end. |
Description of problem: the mcp worker will roll out to a new mc after kubelet cert rotate, but the description still show old mc info. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-03-221810 How reproducible: always Steps to Reproduce: 1.create a sno cluster on baremental platform 2.check the info of secret kube-apiserver-to-kubelet-signer, take care of the start time and end time of the certificate # oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml ... kind: Secret metadata: annotations: auth.openshift.io/certificate-issuer: kube-apiserver-to-kubelet-signer auth.openshift.io/certificate-not-after: "2022-06-04T03:50:54Z" auth.openshift.io/certificate-not-before: "2021-06-04T03:50:54Z" 3.ssh to node, reset system time= current time + 291days12hours #./move_ahead_time.sh 291days12hours move_ahead_time.sh: # usage: ./move_ahead_time.sh 1year # you can also use 12hours, 30days, 2month, 1year date sudo systemctl disable chronyd sudo systemctl stop crio && sudo systemctl stop kubelet MOVED_AHEAD_TIME=$1 future=$(date --date "+$MOVED_AHEAD_TIME" -u "+%Y-%m-%dT%H:%M:%SZ") sudo date --set '$future' date sudo systemctl start crio && sudo systemctl start kubelet 4.check if the secret kube-apiserver-to-kubelet-signer get updated in the next 24 hours # oc get secret kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator -o yaml kind: Secret metadata: annotations: auth.openshift.io/certificate-issuer: openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1648007460 auth.openshift.io/certificate-not-after: "2023-03-23T03:51:00Z" auth.openshift.io/certificate-not-before: "2022-03-23T03:50:59Z" 5.#oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 00-worker f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-master-container-runtime f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-master-kubelet f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-worker-container-runtime f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 01-worker-kubelet f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-master-generated-registries f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-master-ssh 3.2.0 293d 99-worker-generated-registries f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 99-worker-ssh 3.2.0 293d rendered-master-078343783f187d05b77e1c7eeaead3b7 f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 35h // for mcp master rendered-master-1837be79407b056e8fae1d37cbf5459e f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d rendered-worker-47864b741bb0b21bb726ced0df195252 f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 35h // for mcp worker rendered-worker-8f9a0067e6e573e3a79185d8e64974ae f289dc9a2ba85bcfadbdb6ddbddbb5fa6278428c 3.2.0 293d 6 # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-078343783f187d05b77e1c7eeaead3b7 True False False 1 1 1 0 293d worker rendered-worker-47864b741bb0b21bb726ced0df195252 True False False 0 0 0 0 293d 7 oc get mcp worker -o yaml ... spec: configuration: name: rendered-worker-47864b741bb0b21bb726ced0df195252 source: .... status: conditions: - lastTransitionTime: "2021-06-04T04:24:31Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-06-04T04:24:32Z" message: All nodes are updated with rendered-worker-8f9a0067e6e573e3a79185d8e64974ae reason: "" status: "True" type: Updated Actual results: 4 the secret kube-apiserver-to-kubelet-signer updated 5 generate 2 new mc, mc-1, mc-2 6 mcp master roll out to mc-1, mcp worker roll out to mc-2 7 the mcp worker description show old mc in message "All nodes are updated with" Expected results: 7 the mcp worker description show correct mc info in message "All nodes are updated with" Additional info: