Bug 1940972
Summary: | vsphere: [4.6] CSR approval delayed for unknown reason | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> |
Component: | Cloud Compute | Assignee: | Danil Grigorev <dgrigore> |
Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | agarcial, aos-bugs, ccoleman, dgrigore, ehashman, mfojtik, mgugino, wking, xxia, zhsun |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: CSR approval delayed for unknown reason
Consequence: New machines appearing in cluster during installation were not approved quick enough, prolonging cluster installation.
Fix: Increase cache resync period to 10 minutes instead of default 10 hours, to mitigate occasional api server unavaliability in early installation phases.
Result: Master machines are approved in time.
|
Story Points: | --- |
Clone Of: | 1940551 | Environment: | |
Last Closed: | 2021-07-27 22:54:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1940551, 1941107 | ||
Bug Blocks: | 1950114 |
Description
Michael Gugino
2021-03-19 16:53:48 UTC
Link to artifacts from failed job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere/1372820393508737024/artifacts/e2e-vsphere/gather-extra/artifacts/ After some discussion with Stefan, there are a multitude of compounding factors here. First, our machine-controller initially connects to the bootstrap node's API server. This works for some time, and then the installer cleans up the the bootstrap node. The resulted in the following: I0319 08:30:07.116056 1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists I0319 08:30:54.691441 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:54.691843 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:57.313905 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:40:38.506786 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:45:55.651595 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c So, we get the EOF from the streamwatcher, the streamwatcher then re-initiates the 'listing and watching.' It either is reusing an existing tcp/http connection, or is racing and the LB routes the connection back to the bootstrap node by chance (or it's sticky, etc), TBD. Next, that stream seems to have a tiemout of some unknown lenght, it appears to be just less than 10 minutes, and we see the "Listing and watching" line again. Since the boostrap node has gone away, we're connecting to a real API server at this point. Next, since the events from updating the machine to re-reconcile were lost when the bootstrap node was stopped uncleanly, we don't reconcile anything until the 15 minute default timeout. If we miss a reconcile event for any reason after creation, that is the amount of time same as the kubelet re-issuing the client CSR. Recommended actions: We could bump our reconcile default timeout to 10 minutes. That should close the loop of double CSR create with the kubelet. We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time. This might need to be handled upstream with controller runtime or another component, TBD. I've spent a little time to look through the informer code and, given we are using controller runtime, it seems the only real way to get a reasonable timeout is to set a low resync period. Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes (there's jitter on this anyway), which should allow us to get at least 2 resyncs within any 15 minute period. This should ensure we get to approve all CSRS within our timeout
> We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time. This might need to be handled upstream with controller runtime or another component, TBD.
The code for this stream being restarted is deeply embedded in client-go code, it would be very difficult for us to get anything into this part if we want to continue using controller-runtime to manage our informers.
If we were to ditch controller runtime we could set up our own listerwatcher which would mean we could add some logic here in every time the List is called
> Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes
This watch timeout seems to be independent of the reconcile timeout.
We want to avoid requeuing a stale object and doing lots of extra work. If we set requeue time to 11 minutes, that should close the gap. Whenever the watch reconnects, that will update the cache. If the watch times out after 5-10 minutes, then provided we get a working one after this timeout (which would be the case in the bootstrap node scenario described here), by the time we requeue, we'll have the fresh object.
The only other question is whether or not cluster machine approver hits an incremental backoff (it used to) that will push us outside the 15 minute window for CSR approval (otherwise the kubelet will create a subsequent CSR. This won't break anything unless we add a test or there's an alert for pending CSRs already, and those should probably exist in the future, at least for client bootstrap CSRs).
> This watch timeout seems to be independent of the reconcile timeout. I thought we were talking about the resync period not the reconcile timeout? When I say resync period I mean the maximal time between a full reconcile of each resource? If you look at the informer code in client-go, you'll see that the way this works is that there's a call to a function that starts the ListWatch, in here, if the resync channel receives (which it does after the timout) then everything is cancelled and the ListWatch is started again. It's that initial listing operation that causes the full reconcile. And since this list operation isn't cached (in fact it updates the cache) it should be safe to requeue here as the cache will be up to date. That said, it looks as though the resync period only actually starts **after** the list completes. This means that changing the resync period and expecting that to be an absolute cap (eg setting it to something less than 15 minutes but more than 10) is not helpful if the List operation takes 10 minutes to timeout. Client-go code I'm referring to: https://github.com/kubernetes/client-go/blob/44bedf8611f162b2a2433e520cf5568f999124f4/tools/cache/reflector.go#L251-L447 So, looks like the option I'm talking about is already implemented at 10 minutes, so we shouldn't even be hitting this: https://github.com/openshift/machine-api-operator/pull/764/files Looking at the machine controller logs more closely, not only do the watches die and then reconnect and silently timeout, but the reconcile is paused in the middle of processing: I0319 08:29:57.065010 1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-1: patching machine I0319 08:30:07.092703 1 controller.go:425] ci-op-91xvwzi0-ec220-s9j6f-master-1: going into phase "Running" I0319 08:30:07.115156 1 controller.go:170] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling Machine I0319 08:30:07.116056 1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists I0319 08:30:54.691441 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:54.691843 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:57.313905 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:40:38.506786 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:45:55.651595 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c I0319 08:45:56.402719 1 reconciler.go:179] ci-op-91xvwzi0-ec220-s9j6f-master-2: already exists I0319 08:45:56.402765 1 controller.go:278] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling machine triggers idempotent update I0319 08:45:56.402771 1 actuator.go:123] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator updating machine I0319 08:45:56.485890 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c I0319 08:46:14.854384 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:46:27.065446 1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-2: patching machine E0319 08:46:27.105347 1 actuator.go:57] ci-op-91xvwzi0-ec220-s9j6f-master-2 error: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout E0319 08:46:27.105385 1 controller.go:280] ci-op-91xvwzi0-ec220-s9j6f-master-2: error updating machine: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout So, there must be a lock or bug somewhere in controller-runtime to stops the execution of the reconcile, or a bug in the vSphere SDK that has/causes an extremely high timeout. I've seen similar reconcile pauses on Azure previously, but I don't recall if there are these watch stream issues. *** Bug 1951213 has been marked as a duplicate of this bug. *** I'm still seeing 15m timeouts on CSR approvals for kubelet serving certs in CI: https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000 On node log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/nodes/ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf/journal Apr 23 04:28:33.749586 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.749483 1499 certificate_manager.go:412] Rotating certificates Apr 23 04:28:33.763631 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.763587 1499 log.go:184] http: TLS handshake error from 10.0.0.5:50142: no serving certificate available for the kubelet ... Apr 23 04:43:33.763808 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: E0423 04:43:33.763802 1499 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition (15m timeout) From approver log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-66f68b694d-tpm28_machine-approver-controller.log I0423 04:28:33.760349 1 controller.go:114] Reconciling CSR: csr-fpmvx I0423 04:28:33.761130 1 csr_check.go:150] csr-fpmvx: CSR does not appear to be client csr I0423 04:28:33.761200 1 csr_check.go:441] retrieving serving cert from ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf (10.0.32.5:10250) I0423 04:28:33.763579 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0423 04:28:33.763611 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf E0423 04:28:33.763620 1 csr_check.go:196] csr-fpmvx: Serving Cert: No target machine for node "ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf" I0423 04:28:33.763627 1 controller.go:172] csr-fpmvx: CSR not authorized Based on the controller log, it looks like it's missing the other patch: https://github.com/openshift/cluster-machine-approver/pull/118 That is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1952282 Looking at the query, I don't see anything from the last 72 hours besides OpenStack, so I think it's fixed now. Validated on - 4.8.0-0.nightly-2021-05-12-072240 Logs : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/nodes/ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw/ Journal logs . . . . . May 12 07:57:59.360354 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:57:59.360300 1425 log.go:184] http: TLS handshake error from 10.0.0.5:43022: no serving certificate available for the kubelet May 12 07:58:56.972828 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:58:56.972791 1425 log.go:184] http: TLS handshake error from 10.0.32.5:49816: no serving certificate available for the kubelet May 12 07:59:09.716126 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:09.716059 1425 log.go:184] http: TLS handshake error from 10.0.32.5:50056: no serving certificate available for the kubelet May 12 07:59:11.433589 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:11.433545 1425 log.go:184] http: TLS handshake error from 10.128.2.9:36954: no serving certificate available for the kubelet May 12 07:59:16.203601 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:16.203562 1425 log.go:184] http: TLS handshake error from 10.0.32.5:50204: no serving certificate available for the kubelet May 12 07:59:21.284202 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.284157 1425 log.go:184] http: TLS handshake error from 10.0.0.5:48476: no serving certificate available for the kubelet May 12 07:59:21.293279 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.293240 1425 csr.go:249] certificate signing request csr-z7z44 is approved, waiting to be issued May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341279 1425 csr.go:245] certificate signing request csr-z7z44 is issued May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341381 1425 reflector.go:225] Stopping reflector *v1.CertificateSigningRe. . . I could validate it for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log . . I0512 07:57:59.359043 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:57:59.361271 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:57:59.361300 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw E0512 07:57:59.361310 1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw" I0512 07:57:59.361325 1 controller.go:172] csr-z7z44: CSR not authorized .. . . . I0512 07:59:21.282281 1 controller.go:114] Reconciling CSR: csr-z7z44 I0512 07:59:21.282797 1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr I0512 07:59:21.282851 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:59:21.285221 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:59:21.285244 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw I0512 07:59:21.297589 1 controller.go:179] CSR csr-z7z44 approved. . . . Additional Info: Moved to verified based on above results Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |