1940899 – CSR approver double approve

Bug 1940899 - CSR approver double approve

Summary: CSR approver double approve

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Alberto
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-19 14:02 UTC by Michael Gugino
Modified:	2021-04-28 11:01 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1940950 (view as bug list)
Environment:
Last Closed:	2021-04-28 11:01:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michael Gugino 2021-03-19 14:02:54 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-57df68f778-kxvw7_machine-approver-controller.log

Unlike other certificates, csr-qcnzr was approved twice according to logs:

I0319 06:50:03.569825       1 controller.go:174] CSR csr-qcnzr approved
W0319 06:50:04.468312       1 warnings.go:70] certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
I0319 06:50:04.468604       1 controller.go:174] CSR csr-qcnzr approved

Comment 1 Michael Gugino 2021-03-19 15:38:42 UTC

So, these are a dump of the client cert requests from each node.  These are the certs that are needed to join the cluster.  There should only be 1 approved CSR for each host, however, there are 2.

The masters were approved during the bootstrap process.  They all have 2 csrs with very close timestamps, and they are approved within 1 minute.  This tells me there's a problem with the kubelet issuing duplicate CSR requests.

The workers also have 2 csrs with very close creation timestamps, however, they are not approved until about 10 minutes later.

There are a few issues here.  1) The kubelet is wrong, I'll clone this bug to them.  2) The CSR approver a single CSR twice: csr-qcnzr, this should never happen.  3) The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes (no indication in the controller as to why), it looks like our object cache was never updated or there's some bug in our controller 4) we approved more than one client cert for a given node, this should never happen.

For number 4), I thought maybe the certs were rotated, and new one just happened to be approved, however, my investigation reveals this is not the case.


csr-6sd2m
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:47Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-dznfw
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:25Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-fhc57
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:40:04Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-k9zvc
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:48Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-l9ltl
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:43Z
   approved Timestampe: 2021-03-19T06:31:01Z
csr-m6559
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:29Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-p464c
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:30Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-pglzf
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:48Z
   approved Timestampe: 2021-03-19T06:50:03Z
csr-pls9t
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')>
   approver: KubectlApprove
   creationTimestamp: 2021-03-19T06:30:26Z
   approved Timestampe: 2021-03-19T06:30:41Z
csr-qcnzr
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')>
   approver: NodeCSRApprove
   creationTimestamp: 2021-03-19T06:39:08Z
   approved Timestampe: 2021-03-19T06:50:03Z

Comment 2 Michael Gugino 2021-03-19 17:30:58 UTC

This test displays symptom 3 (The CSR approver didn't bother to do anything with CSRs for x minutes):

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568

csr-5z8zn
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:31:25Z
   no status info found
csr-6zxxb
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:00:53Z
   no status info found
csr-9x2qs
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:46:57Z
   no status info found
csr-ckbfc
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-168-242.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:17:22Z
   approver: NodeCSRApprove
   approved Timestampe: 2021-03-18T15:17:22Z
csr-cl6qf
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:01:42Z
   no status info found
csr-fnmbr
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-139-100.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:10:49Z
   approver: KubectlApprove
   approved Timestampe: 2021-03-18T15:10:54Z
csr-g4rjw
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:46:33Z
   no status info found
csr-hw7nh
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-130-192.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:16:41Z
   approver: NodeCSRApprove
   approved Timestampe: 2021-03-18T15:16:41Z
csr-lqbjc
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-213-43.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:09:52Z
   approver: KubectlApprove
   approved Timestampe: 2021-03-18T15:09:53Z
csr-nqwj8
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-250-199.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:16:24Z
   approver: NodeCSRApprove
   approved Timestampe: 2021-03-18T15:16:24Z
csr-pggmb
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:16:45Z
   no status info found
csr-rzl75
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:16:20Z
   no status info found
csr-vh85w
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-170-17.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T15:10:51Z
   approver: KubectlApprove
   approved Timestampe: 2021-03-18T15:10:54Z
csr-zrk4j
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')>
   creationTimestamp: 2021-03-18T16:31:49Z
   no status info found


The CSRs in question are for two newly created masters (the old ones appear to have been deleted via the machine-api): ip-10-0-232-128.us-west-1.compute.internal and ip-10-0-247-235.us-west-1.compute.internal

These were both created around "creationTimestamp": "2021-03-18T15:59:04Z"

CSR approver logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568/artifacts/e2e-aws-disruptive/pods/openshift-cluster-machine-approver_machine-approver-586685f686-9gq7b_machine-approver-controller.log

The API server appears to be disrupted around 15:47:

15:47:04.646) (total time: 60044ms):
Trace[1568160156]: [1m0.044043849s] [1m0.044043849s] END
E0318 15:48:04.690950       1 reflector.go:138] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: the server was unable to return a response in the time allotted, but may still be processing the request (get clusteroperators.config.openshift.io)
I0318 15:48:47.079915       1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0318 15:49:18.096941       1 trace.go:205] Trace[867160953]: "Reflector ListAndWatch" name:github.com/openshift/cluster-machine-approver/status.go:99 (18-Mar-2021 15:48:06.583) (total time: 71513ms):

And then seems to recover around 15:58.  Since this is a disruptive test, this is probably desired.  Since the replacement master machines are in the Provisioned state, and cloud status indicates "Running", it appears the machine-controller was working correctly.

Machine controller logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568/artifacts/e2e-aws-disruptive/pods/openshift-machine-api_machine-api-controllers-958894c8-cg94r_machine-controller.log

These indicate that master-0 and master-2 were indeed deleted via the machine-api.

Reviewing the machine-api logs, everything seemed to go fine in delete and recreating the instance with the same name.  The finalizer keeps the existing machine-object around until the instance state goes to 'terminated', then the CI re-creates the machine object with the same name.  The controller then correctly ignores the 'terminated' instance and creates a new one.  Eventually, it's just stuck in "machine has no node, requeing" loop.  At this point, the CSR should have been approved, but they weren't, and the node correctly re-issues a CSR request every 15 minutes until the test terminates.

Comment 3 Joel Speed 2021-03-22 10:12:21 UTC

> There should only be 1 approved CSR for each host, however, there are 2.

I was under the impression that kubelet requests two CSRs per node, one for the client certificate and one for the serving certificate. AFAIK this is true in openshift as well as regular Kubernetes

> The CSR approver a single CSR twice: csr-qcnzr, this should never happen.

While ideally this shouldn't happen, is it actually harmful/detrimental? This is an artifact of using informers and creating controllers which are eventually consistent. If we using a cached client, this can and may happen and the system should be able to account for this (I actually don't think there's an issue saying it approved the same cert twice, the API request would be a no-op the second time). If we absolutely must not do this for some reason then we should make sure we aren't using a cached API reader and always fetch the latest version

> The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes

This sounds like a real bug, sounds like the watch has not been set up correctly. I think we should focus on finding the issue here first rather than looking into the other two issues highlighted

Comment 4 Michael Gugino 2021-03-22 22:14:28 UTC

(In reply to Joel Speed from comment #3)
> > There should only be 1 approved CSR for each host, however, there are 2.
> 
> I was under the impression that kubelet requests two CSRs per node, one for
> the client certificate and one for the serving certificate. AFAIK this is
> true in openshift as well as regular Kubernetes
> 

Yes, I slightly misspoke.  There are typically two requests, 1 client, and 1 server.  I'm referring just to client certs in these cases.


> While ideally this shouldn't happen, is it actually harmful/detrimental?
> This is an artifact of using informers and creating controllers which are
> eventually consistent. If we using a cached client, this can and may happen
> and the system should be able to account for this (I actually don't think
> there's an issue saying it approved the same cert twice, the API request
> would be a no-op the second time). If we absolutely must not do this for
> some reason then we should make sure we aren't using a cached API reader and
> always fetch the latest version

Yeah, the severity on these is pretty low IMO.  Need to evaluate level of effort for a fix.  I think keeping the informer cache and then fetching the CSR without the cache at some point if it's unapproved might make sense.  We could definitely do a hard-get on whether or not a node already exists, that would the biggest security win there (someone could submit a CSR request just after a node joins the cluster, but before our cache syncs if they can monitor node names or our cache fails to sync for some reason like it does today).

> > The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes
> 
> This sounds like a real bug, sounds like the watch has not been set up
> correctly. I think we should focus on finding the issue here first rather
> than looking into the other two issues highlighted

I agree, this is the higher priority.

Comment 5 Joel Speed 2021-03-23 12:14:00 UTC

> Yes, I slightly misspoke.  There are typically two requests, 1 client, and 1 server.  I'm referring just to client certs in these cases.

Ahh I didn't notice that nuance! I wonder if this could be something to do with Kubelet not completing the cycle correctly, ie not getting to store the created cert before it is restarted as part of an MCO update or something. If we were to move to individual machine bootstrap tokens (which I think we should eventually) then this would cause issue, but I don't think it's huge/disruptive right now, just odd. Would be good to find out what's gone wrong here.

> Yeah, the severity on these is pretty low IMO.  Need to evaluate level of effort for a fix.  I think keeping the informer cache and then fetching the CSR without the cache at some point if it's unapproved might make sense.  We could definitely do a hard-get on whether or not a node already exists, that would the biggest security win there (someone could submit a CSR request just after a node joins the cluster, but before our cache syncs if they can monitor node names or our cache fails to sync for some reason like it does today).

From the controller runtime manager we can get an uncached client that is set up ready for us, so agreed, let's update it to use that to fetch the cert if we think it's unapproved to double check. That should be a relatively straightforward fix.

As for the not reconciling for 10 minutes, this sounds like it somehow missed the events and/or just didn't put them on the queue for some reason, will need to do some testing to work out why

Comment 6 Joel Speed 2021-03-25 14:41:17 UTC

I tried this out yesterday on AWS and was unable to reproduce, for completeness, we should also try to reproduce on vSphere and see if there are continuing failures in vSphere. That said, if we are not seeing the wait issue on other platforms, I don't think this is a cluster machine approver issue

Comment 7 Michael Gugino 2021-04-22 14:33:19 UTC

I'm probably not going to get to this soon.  Feel free to take a look or decide if we want to address this.  It's pretty minor.

Comment 8 Joel Speed 2021-04-28 11:01:17 UTC

The double approve issue isn't really an issue.

We are aware of the not being approved for longer periods issues and are tracking this in other bugs.

I don't think we need to keep this one open separately for now.

Note You need to log in before you can comment on or make changes to this bug.