Bug 1940972

Summary:	vsphere: [4.6] CSR approval delayed for unknown reason
Product:	OpenShift Container Platform	Reporter:	Michael Gugino <mgugino>
Component:	Cloud Compute	Assignee:	Danil Grigorev <dgrigore>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	agarcial, aos-bugs, ccoleman, dgrigore, ehashman, mfojtik, mgugino, wking, xxia, zhsun
Version:	4.6
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: CSR approval delayed for unknown reason Consequence: New machines appearing in cluster during installation were not approved quick enough, prolonging cluster installation. Fix: Increase cache resync period to 10 minutes instead of default 10 hours, to mitigate occasional api server unavaliability in early installation phases. Result: Master machines are approved in time.	Story Points:	---
Clone Of:	1940551	Environment:
Last Closed:	2021-07-27 22:54:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1940551, 1941107
Bug Blocks:	1950114

Description Michael Gugino 2021-03-19 16:53:48 UTC

+++ This bug was initially created as a clone of Bug #1940551 +++

https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

--- New Info ---


We can see, the older CSRs are never approved, but the 3rd (newest) one is for a particular worker:

csr-hntnq
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')>
   creationTimestamp: 2021-03-19T08:45:57Z
   no status info found
csr-hs9rt
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')>
   creationTimestamp: 2021-03-19T08:30:55Z
   no status info found
csr-mb2tf
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')>
   creationTimestamp: 2021-03-19T09:01:01Z
   approver: NodeCSRApprove
   approved Timestampe: 2021-03-19T09:01:01Z

Logs from machine-approver:

I0319 08:30:56.088043       1 main.go:147] CSR csr-hs9rt added
I0319 08:30:56.164759       1 main.go:182] CSR csr-hs9rt not authorized: failed to find machine for node ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm
I0319 08:30:56.164779       1 main.go:218] Error syncing csr csr-hs9rt: failed to find machine for node ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm


Machine creation timestamp: "creationTimestamp": "2021-03-19T08:27:17Z"


A different worker CSR was approved just seconds before the first failed CSR for worker-kjftm:

csr-24bhn
   cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-f9l4s')>
   creationTimestamp: 2021-03-19T08:30:52Z
   approver: NodeCSRApprove
   approved Timestampe: 2021-03-19T08:30:56Z



Machine successfully provisioned at: "lastTransitionTime": "2021-03-19T08:29:02Z"

Clearly the machine already existed.  The CSRs are not being generated too fast, there is one created every 15 minutes until approval, which is expected.

Duping the CSR data with openssl command doesn't show any changes between the approved or unapproved ones in the subjects.

Either there is a bug in the logic, or we're failing to sync machines into the controller cache for some reason.

Comment 1 Michael Gugino 2021-03-23 19:36:43 UTC

Link to artifacts from failed job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere/1372820393508737024/artifacts/e2e-vsphere/gather-extra/artifacts/

Comment 5 Michael Gugino 2021-04-14 15:08:22 UTC

After some discussion with Stefan, there are a multitude of compounding factors here.

First, our machine-controller initially connects to the bootstrap node's API server. This works for some time, and then the installer cleans up the the bootstrap node. The resulted in the following:

I0319 08:30:07.116056 1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists
I0319 08:30:54.691441 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0319 08:30:54.691843 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0319 08:30:57.313905 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I0319 08:40:38.506786 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I0319 08:45:55.651595 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c

So, we get the EOF from the streamwatcher, the streamwatcher then re-initiates the 'listing and watching.' It either is reusing an existing tcp/http connection, or is racing and the LB routes the connection back to the bootstrap node by chance (or it's sticky, etc), TBD.

Next, that stream seems to have a tiemout of some unknown lenght, it appears to be just less than 10 minutes, and we see the "Listing and watching" line again. Since the boostrap node has gone away, we're connecting to a real API server at this point.

Next, since the events from updating the machine to re-reconcile were lost when the bootstrap node was stopped uncleanly, we don't reconcile anything until the 15 minute default timeout.

If we miss a reconcile event for any reason after creation, that is the amount of time same as the kubelet re-issuing the client CSR.

Recommended actions:

We could bump our reconcile default timeout to 10 minutes. That should close the loop of double CSR create with the kubelet. We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time. This might need to be handled upstream with controller runtime or another component, TBD.

Comment 6 Joel Speed 2021-04-14 17:11:13 UTC

I've spent a little time to look through the informer code and, given we are using controller runtime, it seems the only real way to get a reasonable timeout is to set a low resync period. Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes (there's jitter on this anyway), which should allow us to get at least 2 resyncs within any 15 minute period. This should ensure we get to approve all CSRS within our timeout

> We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time.  This might need to be handled upstream with controller runtime or another component, TBD.

The code for this stream being restarted is deeply embedded in client-go code, it would be very difficult for us to get anything into this part if we want to continue using controller-runtime to manage our informers.
If we were to ditch controller runtime we could set up our own listerwatcher which would mean we could add some logic here in every time the List is called

Comment 7 Michael Gugino 2021-04-14 17:55:35 UTC

> Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes

This watch timeout seems to be independent of the reconcile timeout.

We want to avoid requeuing a stale object and doing lots of extra work.  If we set requeue time to 11 minutes, that should close the gap.  Whenever the watch reconnects, that will update the cache.  If the watch times out after 5-10 minutes, then provided we get a working one after this timeout (which would be the case in the bootstrap node scenario described here), by the time we requeue, we'll have the fresh object.

The only other question is whether or not cluster machine approver hits an incremental backoff (it used to) that will push us outside the 15 minute window for CSR approval (otherwise the kubelet will create a subsequent CSR.  This won't break anything unless we add a test or there's an alert for pending CSRs already, and those should probably exist in the future, at least for client bootstrap CSRs).

Comment 8 Joel Speed 2021-04-15 10:18:58 UTC

> This watch timeout seems to be independent of the reconcile timeout.

I thought we were talking about the resync period not the reconcile timeout? When I say resync period I mean the maximal time between a full reconcile of each resource?

If you look at the informer code in client-go, you'll see that the way this works is that there's a call to a function that starts the ListWatch, in here, if the resync channel receives (which it does after the timout) then everything is cancelled and the ListWatch is started again.
It's that initial listing operation that causes the full reconcile. And since this list operation isn't cached (in fact it updates the cache) it should be safe to requeue here as the cache will be up to date.

That said, it looks as though the resync period only actually starts **after** the list completes.
This means that changing the resync period and expecting that to be an absolute cap (eg setting it to something less than 15 minutes but more than 10) is not helpful if the List operation takes 10 minutes to timeout.

Client-go code I'm referring to: https://github.com/kubernetes/client-go/blob/44bedf8611f162b2a2433e520cf5568f999124f4/tools/cache/reflector.go#L251-L447

Comment 9 Michael Gugino 2021-04-15 14:25:27 UTC

So, looks like the option I'm talking about is already implemented at 10 minutes, so we shouldn't even be hitting this: https://github.com/openshift/machine-api-operator/pull/764/files

Looking at the machine controller logs more closely, not only do the watches die and then reconnect and silently timeout, but the reconcile is paused in the middle of processing:

I0319 08:29:57.065010       1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-1: patching machine
I0319 08:30:07.092703       1 controller.go:425] ci-op-91xvwzi0-ec220-s9j6f-master-1: going into phase "Running"
I0319 08:30:07.115156       1 controller.go:170] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling Machine
I0319 08:30:07.116056       1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists
I0319 08:30:54.691441       1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0319 08:30:54.691843       1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0319 08:30:57.313905       1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I0319 08:40:38.506786       1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I0319 08:45:55.651595       1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c
I0319 08:45:56.402719       1 reconciler.go:179] ci-op-91xvwzi0-ec220-s9j6f-master-2: already exists
I0319 08:45:56.402765       1 controller.go:278] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling machine triggers idempotent update
I0319 08:45:56.402771       1 actuator.go:123] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator updating machine
I0319 08:45:56.485890       1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c
I0319 08:46:14.854384       1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I0319 08:46:27.065446       1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-2: patching machine
E0319 08:46:27.105347       1 actuator.go:57] ci-op-91xvwzi0-ec220-s9j6f-master-2 error: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout
E0319 08:46:27.105385       1 controller.go:280] ci-op-91xvwzi0-ec220-s9j6f-master-2: error updating machine: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout

So, there must be a lock or bug somewhere in controller-runtime to stops the execution of the reconcile, or a bug in the vSphere SDK that has/causes an extremely high timeout.  I've seen similar reconcile pauses on Azure previously, but I don't recall if there are these watch stream issues.

Comment 10 Michael Gugino 2021-04-19 20:36:37 UTC

*** Bug 1951213 has been marked as a duplicate of this bug. ***

Comment 12 Elana Hashman 2021-04-27 21:42:04 UTC

I'm still seeing 15m timeouts on CSR approvals for kubelet serving certs in CI: https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none

Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000

On node log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/nodes/ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf/journal

Apr 23 04:28:33.749586 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.749483    1499 certificate_manager.go:412] Rotating certificates
Apr 23 04:28:33.763631 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.763587    1499 log.go:184] http: TLS handshake error from 10.0.0.5:50142: no serving certificate available for the kubelet
...
Apr 23 04:43:33.763808 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: E0423 04:43:33.763802    1499 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition

(15m timeout)


From approver log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-66f68b694d-tpm28_machine-approver-controller.log

I0423 04:28:33.760349       1 controller.go:114] Reconciling CSR: csr-fpmvx
I0423 04:28:33.761130       1 csr_check.go:150] csr-fpmvx: CSR does not appear to be client csr
I0423 04:28:33.761200       1 csr_check.go:441] retrieving serving cert from ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf (10.0.32.5:10250)
I0423 04:28:33.763579       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0423 04:28:33.763611       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf
E0423 04:28:33.763620       1 csr_check.go:196] csr-fpmvx: Serving Cert: No target machine for node "ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf"
I0423 04:28:33.763627       1 controller.go:172] csr-fpmvx: CSR not authorized

Comment 13 Michael Gugino 2021-04-28 01:45:17 UTC

Based on the controller log, it looks like it's missing the other patch: https://github.com/openshift/cluster-machine-approver/pull/118

That is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1952282

Looking at the query, I don't see anything from the last 72 hours besides OpenStack, so I think it's fixed now.

Comment 14 Milind Yadav 2021-05-12 12:53:53 UTC

Validated on - 4.8.0-0.nightly-2021-05-12-072240




Logs : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/nodes/ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw/
Journal logs 

.
.
.
.
.
May 12 07:57:59.360354 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:57:59.360300    1425 log.go:184] http: TLS handshake error from 10.0.0.5:43022: no serving certificate available for the kubelet
May 12 07:58:56.972828 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:58:56.972791    1425 log.go:184] http: TLS handshake error from 10.0.32.5:49816: no serving certificate available for the kubelet
May 12 07:59:09.716126 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:09.716059    1425 log.go:184] http: TLS handshake error from 10.0.32.5:50056: no serving certificate available for the kubelet
May 12 07:59:11.433589 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:11.433545    1425 log.go:184] http: TLS handshake error from 10.128.2.9:36954: no serving certificate available for the kubelet
May 12 07:59:16.203601 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:16.203562    1425 log.go:184] http: TLS handshake error from 10.0.32.5:50204: no serving certificate available for the kubelet
May 12 07:59:21.284202 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.284157    1425 log.go:184] http: TLS handshake error from 10.0.0.5:48476: no serving certificate available for the kubelet
May 12 07:59:21.293279 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.293240    1425 csr.go:249] certificate signing request csr-z7z44 is approved, waiting to be issued
May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341279    1425 csr.go:245] certificate signing request csr-z7z44 is issued
May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341381    1425 reflector.go:225] Stopping reflector *v1.CertificateSigningRe.
.
.


I could validate it  for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log
.
.
I0512 07:57:59.359043       1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:57:59.361271       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:57:59.361300       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
E0512 07:57:59.361310       1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw"
I0512 07:57:59.361325       1 controller.go:172] csr-z7z44: CSR not authorized
..
.
.
.
I0512 07:59:21.282281       1 controller.go:114] Reconciling CSR: csr-z7z44
I0512 07:59:21.282797       1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr
I0512 07:59:21.282851       1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:59:21.285221       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:59:21.285244       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
I0512 07:59:21.297589       1 controller.go:179] CSR csr-z7z44 approved.
.
.
.


Additional Info:

Moved to verified based on above results

Comment 17 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438