+++ This bug was initially created as a clone of Bug #1940899 +++ https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-57df68f778-kxvw7_machine-approver-controller.log Unlike other certificates, csr-qcnzr was approved twice according to logs: I0319 06:50:03.569825 1 controller.go:174] CSR csr-qcnzr approved W0319 06:50:04.468312 1 warnings.go:70] certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest I0319 06:50:04.468604 1 controller.go:174] CSR csr-qcnzr approved --- Additional comment from Michael Gugino on 2021-03-19 15:38:42 UTC --- So, these are a dump of the client cert requests from each node. These are the certs that are needed to join the cluster. There should only be 1 approved CSR for each host, however, there are 2. The masters were approved during the bootstrap process. They all have 2 csrs with very close timestamps, and they are approved within 1 minute. This tells me there's a problem with the kubelet issuing duplicate CSR requests. The workers also have 2 csrs with very close creation timestamps, however, they are not approved until about 10 minutes later. There are a few issues here. 1) The kubelet is wrong, I'll clone this bug to them. 2) The CSR approver a single CSR twice: csr-qcnzr, this should never happen. 3) The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes (no indication in the controller as to why), it looks like our object cache was never updated or there's some bug in our controller 4) we approved more than one client cert for a given node, this should never happen. For number 4), I thought maybe the certs were rotated, and new one just happened to be approved, however, my investigation reveals this is not the case. csr-6sd2m cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:47Z approved Timestampe: 2021-03-19T06:31:01Z csr-dznfw cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:25Z approved Timestampe: 2021-03-19T06:50:03Z csr-fhc57 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:40:04Z approved Timestampe: 2021-03-19T06:50:03Z csr-k9zvc cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:48Z approved Timestampe: 2021-03-19T06:31:01Z csr-l9ltl cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:43Z approved Timestampe: 2021-03-19T06:31:01Z csr-m6559 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:29Z approved Timestampe: 2021-03-19T06:30:41Z csr-p464c cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:30Z approved Timestampe: 2021-03-19T06:30:41Z csr-pglzf cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:48Z approved Timestampe: 2021-03-19T06:50:03Z csr-pls9t cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:26Z approved Timestampe: 2021-03-19T06:30:41Z csr-qcnzr cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:08Z approved Timestampe: 2021-03-19T06:50:03Z
TL/DR: each node is issueing 2 client certs within seconds of each other, this is unrelated to the CSR approval (which is also buggy on vsphere).
Looking at the journals from an affected worker node ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/nodes/ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d/journal It appears that the kubelet started before crio, then died, then crio started, then everything worked fine. I think it's expected for the kubelet to create a CSR whenever it's in bootstrap mode and it starts, so that's fine. The bug is (probably?) why is the kubelet starting after CRIO? --snip-- Mar 19 06:39:13.993551 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993524 1739 kubelet.go:453] Kubelet client is not nil Mar 19 06:39:13.993922 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993901 1739 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134 Mar 19 06:39:13.994790 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.994750 1739 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134 Mar 19 06:39:13.998086 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998006 1739 remote_runtime.go:86] Version from runtime service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998193 1739 kuberuntime_manager.go:205] Get runtime version failed: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: F0319 06:39:13.998221 1739 server.go:269] failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" --snip-- Mar 19 06:39:24.100515 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Stopped Kubernetes Kubelet. Mar 19 06:39:24.100665 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: kubelet.service: Consumed 0 CPU time Mar 19 06:39:24.102432 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Starting CRI-O Auto Update Script... Mar 19 06:39:24.191706 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19T06:39:24Z" level=info msg="Starting CRI-O, version: 1.21.0-29.rhaos4.8.git4fff699.el8-dev, git: ()" Mar 19 06:39:24.197448 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19 06:39:24.197388576Z" level=info msg="File /var/lib/crio/clean.shutdown not found. Wiping storage directory /var/lib/containers/storage because of suspected dirty shutdown"
I don't see similar error in recent 4.8 job : https://search.ci.openshift.org/?search=crio-wipe.service%3A+Main+process+exited&maxAge=336h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job also can't see error with the following key words: Failed to wipe storage cleanly Failed to shutdown storage before wiping crio-wipe.service: Failed with result crio.service: Job crio.service/start failed with result 'dependency'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438