Bug 1940950

Summary:	vsphere: client/bootstrap CSR double create
Product:	OpenShift Container Platform	Reporter:	Michael Gugino <mgugino>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	MinLi <minmli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, minmli, tsweeney, zhsun
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1940899
Clones:	1943145 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:54:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1943145

Description Michael Gugino 2021-03-19 15:43:47 UTC

+++ This bug was initially created as a clone of Bug #1940899 +++

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-57df68f778-kxvw7_machine-approver-controller.log

Unlike other certificates, csr-qcnzr was approved twice according to logs:

I0319 06:50:03.569825       1 controller.go:174] CSR csr-qcnzr approved
W0319 06:50:04.468312       1 warnings.go:70] certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
I0319 06:50:04.468604       1 controller.go:174] CSR csr-qcnzr approved

--- Additional comment from Michael Gugino on 2021-03-19 15:38:42 UTC ---

So, these are a dump of the client cert requests from each node.  These are the certs that are needed to join the cluster.  There should only be 1 approved CSR for each host, however, there are 2.

The masters were approved during the bootstrap process.  They all have 2 csrs with very close timestamps, and they are approved within 1 minute.  This tells me there's a problem with the kubelet issuing duplicate CSR requests.

The workers also have 2 csrs with very close creation timestamps, however, they are not approved until about 10 minutes later.

There are a few issues here.  1) The kubelet is wrong, I'll clone this bug to them.  2) The CSR approver a single CSR twice: csr-qcnzr, this should never happen.  3) The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes (no indication in the controller as to why), it looks like our object cache was never updated or there's some bug in our controller 4) we approved more than one client cert for a given node, this should never happen.

For number 4), I thought maybe the certs were rotated, and new one just happened to be approved, however, my investigation reveals this is not the case.


csr-6sd2m
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:47Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-dznfw
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:25Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-fhc57
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:40:04Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-k9zvc
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:48Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-l9ltl
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:43Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-m6559
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:29Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-p464c
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:30Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-pglzf
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:48Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-pls9t
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:26Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-qcnzr
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:08Z
   approved Timestampe: 2021-03-19T06:50:03Z

Comment 1 Michael Gugino 2021-03-19 15:44:52 UTC

TL/DR: each node is issueing 2 client certs within seconds of each other, this is unrelated to the CSR approval (which is also buggy on vsphere).

Comment 4 Michael Gugino 2021-03-22 12:43:07 UTC

Looking at the journals from an affected worker node ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/nodes/ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d/journal


It appears that the kubelet started before crio, then died, then crio started, then everything worked fine.  I think it's expected for the kubelet to create a CSR whenever it's in bootstrap mode and it starts, so that's fine.  The bug is (probably?) why is the kubelet starting after CRIO?

--snip--

Mar 19 06:39:13.993551 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993524    1739 kubelet.go:453] Kubelet client is not nil
Mar 19 06:39:13.993922 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993901    1739 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
Mar 19 06:39:13.994790 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.994750    1739 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
Mar 19 06:39:13.998086 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998006    1739 remote_runtime.go:86] Version from runtime service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998193    1739 kuberuntime_manager.go:205] Get runtime version failed: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: F0319 06:39:13.998221    1739 server.go:269] failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"

--snip--

Mar 19 06:39:24.100515 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Stopped Kubernetes Kubelet.
Mar 19 06:39:24.100665 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: kubelet.service: Consumed 0 CPU time
Mar 19 06:39:24.102432 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Starting CRI-O Auto Update Script...
Mar 19 06:39:24.191706 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19T06:39:24Z" level=info msg="Starting CRI-O, version: 1.21.0-29.rhaos4.8.git4fff699.el8-dev, git: ()"
Mar 19 06:39:24.197448 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19 06:39:24.197388576Z" level=info msg="File /var/lib/crio/clean.shutdown not found. Wiping storage directory /var/lib/containers/storage because of suspected dirty shutdown"

Comment 8 MinLi 2021-04-07 09:55:06 UTC

I don't see similar error in recent 4.8 job : https://search.ci.openshift.org/?search=crio-wipe.service%3A+Main+process+exited&maxAge=336h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

also can't see error with the following key words: 
Failed to wipe storage cleanly
Failed to shutdown storage before wiping
crio-wipe.service: Failed with result
crio.service: Job crio.service/start failed with result 'dependency'

Comment 11 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438