https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-57df68f778-kxvw7_machine-approver-controller.log Unlike other certificates, csr-qcnzr was approved twice according to logs: I0319 06:50:03.569825 1 controller.go:174] CSR csr-qcnzr approved W0319 06:50:04.468312 1 warnings.go:70] certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest I0319 06:50:04.468604 1 controller.go:174] CSR csr-qcnzr approved
So, these are a dump of the client cert requests from each node. These are the certs that are needed to join the cluster. There should only be 1 approved CSR for each host, however, there are 2. The masters were approved during the bootstrap process. They all have 2 csrs with very close timestamps, and they are approved within 1 minute. This tells me there's a problem with the kubelet issuing duplicate CSR requests. The workers also have 2 csrs with very close creation timestamps, however, they are not approved until about 10 minutes later. There are a few issues here. 1) The kubelet is wrong, I'll clone this bug to them. 2) The CSR approver a single CSR twice: csr-qcnzr, this should never happen. 3) The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes (no indication in the controller as to why), it looks like our object cache was never updated or there's some bug in our controller 4) we approved more than one client cert for a given node, this should never happen. For number 4), I thought maybe the certs were rotated, and new one just happened to be approved, however, my investigation reveals this is not the case. csr-6sd2m cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:47Z approved Timestampe: 2021-03-19T06:31:01Z csr-dznfw cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:25Z approved Timestampe: 2021-03-19T06:50:03Z csr-fhc57 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:40:04Z approved Timestampe: 2021-03-19T06:50:03Z csr-k9zvc cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:48Z approved Timestampe: 2021-03-19T06:31:01Z csr-l9ltl cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:43Z approved Timestampe: 2021-03-19T06:31:01Z csr-m6559 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-1')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:29Z approved Timestampe: 2021-03-19T06:30:41Z csr-p464c cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-0')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:30Z approved Timestampe: 2021-03-19T06:30:41Z csr-pglzf cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-7knv5')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:48Z approved Timestampe: 2021-03-19T06:50:03Z csr-pls9t cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-master-2')> approver: KubectlApprove creationTimestamp: 2021-03-19T06:30:26Z approved Timestampe: 2021-03-19T06:30:41Z csr-qcnzr cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d')> approver: NodeCSRApprove creationTimestamp: 2021-03-19T06:39:08Z approved Timestampe: 2021-03-19T06:50:03Z
This test displays symptom 3 (The CSR approver didn't bother to do anything with CSRs for x minutes): https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568 csr-5z8zn cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:31:25Z no status info found csr-6zxxb cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:00:53Z no status info found csr-9x2qs cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:46:57Z no status info found csr-ckbfc cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-168-242.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:17:22Z approver: NodeCSRApprove approved Timestampe: 2021-03-18T15:17:22Z csr-cl6qf cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:01:42Z no status info found csr-fnmbr cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-139-100.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:10:49Z approver: KubectlApprove approved Timestampe: 2021-03-18T15:10:54Z csr-g4rjw cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:46:33Z no status info found csr-hw7nh cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-130-192.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:16:41Z approver: NodeCSRApprove approved Timestampe: 2021-03-18T15:16:41Z csr-lqbjc cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-213-43.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:09:52Z approver: KubectlApprove approved Timestampe: 2021-03-18T15:09:53Z csr-nqwj8 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-250-199.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:16:24Z approver: NodeCSRApprove approved Timestampe: 2021-03-18T15:16:24Z csr-pggmb cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:16:45Z no status info found csr-rzl75 cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-232-128.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:16:20Z no status info found csr-vh85w cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-170-17.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T15:10:51Z approver: KubectlApprove approved Timestampe: 2021-03-18T15:10:54Z csr-zrk4j cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ip-10-0-247-235.us-west-1.compute.internal')> creationTimestamp: 2021-03-18T16:31:49Z no status info found The CSRs in question are for two newly created masters (the old ones appear to have been deleted via the machine-api): ip-10-0-232-128.us-west-1.compute.internal and ip-10-0-247-235.us-west-1.compute.internal These were both created around "creationTimestamp": "2021-03-18T15:59:04Z" CSR approver logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568/artifacts/e2e-aws-disruptive/pods/openshift-cluster-machine-approver_machine-approver-586685f686-9gq7b_machine-approver-controller.log The API server appears to be disrupted around 15:47: 15:47:04.646) (total time: 60044ms): Trace[1568160156]: [1m0.044043849s] [1m0.044043849s] END E0318 15:48:04.690950 1 reflector.go:138] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: the server was unable to return a response in the time allotted, but may still be processing the request (get clusteroperators.config.openshift.io) I0318 15:48:47.079915 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0318 15:49:18.096941 1 trace.go:205] Trace[867160953]: "Reflector ListAndWatch" name:github.com/openshift/cluster-machine-approver/status.go:99 (18-Mar-2021 15:48:06.583) (total time: 71513ms): And then seems to recover around 15:58. Since this is a disruptive test, this is probably desired. Since the replacement master machines are in the Provisioned state, and cloud status indicates "Running", it appears the machine-controller was working correctly. Machine controller logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1372555862157037568/artifacts/e2e-aws-disruptive/pods/openshift-machine-api_machine-api-controllers-958894c8-cg94r_machine-controller.log These indicate that master-0 and master-2 were indeed deleted via the machine-api. Reviewing the machine-api logs, everything seemed to go fine in delete and recreating the instance with the same name. The finalizer keeps the existing machine-object around until the instance state goes to 'terminated', then the CI re-creates the machine object with the same name. The controller then correctly ignores the 'terminated' instance and creates a new one. Eventually, it's just stuck in "machine has no node, requeing" loop. At this point, the CSR should have been approved, but they weren't, and the node correctly re-issues a CSR request every 15 minutes until the test terminates.
> There should only be 1 approved CSR for each host, however, there are 2. I was under the impression that kubelet requests two CSRs per node, one for the client certificate and one for the serving certificate. AFAIK this is true in openshift as well as regular Kubernetes > The CSR approver a single CSR twice: csr-qcnzr, this should never happen. While ideally this shouldn't happen, is it actually harmful/detrimental? This is an artifact of using informers and creating controllers which are eventually consistent. If we using a cached client, this can and may happen and the system should be able to account for this (I actually don't think there's an issue saying it approved the same cert twice, the API request would be a no-op the second time). If we absolutely must not do this for some reason then we should make sure we aren't using a cached API reader and always fetch the latest version > The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes This sounds like a real bug, sounds like the watch has not been set up correctly. I think we should focus on finding the issue here first rather than looking into the other two issues highlighted
(In reply to Joel Speed from comment #3) > > There should only be 1 approved CSR for each host, however, there are 2. > > I was under the impression that kubelet requests two CSRs per node, one for > the client certificate and one for the serving certificate. AFAIK this is > true in openshift as well as regular Kubernetes > Yes, I slightly misspoke. There are typically two requests, 1 client, and 1 server. I'm referring just to client certs in these cases. > While ideally this shouldn't happen, is it actually harmful/detrimental? > This is an artifact of using informers and creating controllers which are > eventually consistent. If we using a cached client, this can and may happen > and the system should be able to account for this (I actually don't think > there's an issue saying it approved the same cert twice, the API request > would be a no-op the second time). If we absolutely must not do this for > some reason then we should make sure we aren't using a cached API reader and > always fetch the latest version Yeah, the severity on these is pretty low IMO. Need to evaluate level of effort for a fix. I think keeping the informer cache and then fetching the CSR without the cache at some point if it's unapproved might make sense. We could definitely do a hard-get on whether or not a node already exists, that would the biggest security win there (someone could submit a CSR request just after a node joins the cluster, but before our cache syncs if they can monitor node names or our cache fails to sync for some reason like it does today). > > The CSR approver didn't bother to do anything with the worker CSRs for 10 minutes > > This sounds like a real bug, sounds like the watch has not been set up > correctly. I think we should focus on finding the issue here first rather > than looking into the other two issues highlighted I agree, this is the higher priority.
> Yes, I slightly misspoke. There are typically two requests, 1 client, and 1 server. I'm referring just to client certs in these cases. Ahh I didn't notice that nuance! I wonder if this could be something to do with Kubelet not completing the cycle correctly, ie not getting to store the created cert before it is restarted as part of an MCO update or something. If we were to move to individual machine bootstrap tokens (which I think we should eventually) then this would cause issue, but I don't think it's huge/disruptive right now, just odd. Would be good to find out what's gone wrong here. > Yeah, the severity on these is pretty low IMO. Need to evaluate level of effort for a fix. I think keeping the informer cache and then fetching the CSR without the cache at some point if it's unapproved might make sense. We could definitely do a hard-get on whether or not a node already exists, that would the biggest security win there (someone could submit a CSR request just after a node joins the cluster, but before our cache syncs if they can monitor node names or our cache fails to sync for some reason like it does today). From the controller runtime manager we can get an uncached client that is set up ready for us, so agreed, let's update it to use that to fetch the cert if we think it's unapproved to double check. That should be a relatively straightforward fix. As for the not reconciling for 10 minutes, this sounds like it somehow missed the events and/or just didn't put them on the queue for some reason, will need to do some testing to work out why
I tried this out yesterday on AWS and was unable to reproduce, for completeness, we should also try to reproduce on vSphere and see if there are continuing failures in vSphere. That said, if we are not seeing the wait issue on other platforms, I don't think this is a cluster machine approver issue
I'm probably not going to get to this soon. Feel free to take a look or decide if we want to address this. It's pretty minor.
The double approve issue isn't really an issue. We are aware of the not being approved for longer periods issues and are tracking this in other bugs. I don't think we need to keep this one open separately for now.