Hide Forgot
+++ This bug was initially created as a clone of Bug #1940551 +++ https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job --- New Info --- We can see, the older CSRs are never approved, but the 3rd (newest) one is for a particular worker: csr-hntnq cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')> creationTimestamp: 2021-03-19T08:45:57Z no status info found csr-hs9rt cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')> creationTimestamp: 2021-03-19T08:30:55Z no status info found csr-mb2tf cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm')> creationTimestamp: 2021-03-19T09:01:01Z approver: NodeCSRApprove approved Timestampe: 2021-03-19T09:01:01Z Logs from machine-approver: I0319 08:30:56.088043 1 main.go:147] CSR csr-hs9rt added I0319 08:30:56.164759 1 main.go:182] CSR csr-hs9rt not authorized: failed to find machine for node ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm I0319 08:30:56.164779 1 main.go:218] Error syncing csr csr-hs9rt: failed to find machine for node ci-op-91xvwzi0-ec220-s9j6f-worker-kjftm Machine creation timestamp: "creationTimestamp": "2021-03-19T08:27:17Z" A different worker CSR was approved just seconds before the first failed CSR for worker-kjftm: csr-24bhn cn name: <NameAttribute(oid=<ObjectIdentifier(oid=2.5.4.3, name=commonName)>, value='system:node:ci-op-91xvwzi0-ec220-s9j6f-worker-f9l4s')> creationTimestamp: 2021-03-19T08:30:52Z approver: NodeCSRApprove approved Timestampe: 2021-03-19T08:30:56Z Machine successfully provisioned at: "lastTransitionTime": "2021-03-19T08:29:02Z" Clearly the machine already existed. The CSRs are not being generated too fast, there is one created every 15 minutes until approval, which is expected. Duping the CSR data with openssl command doesn't show any changes between the approved or unapproved ones in the subjects. Either there is a bug in the logic, or we're failing to sync machines into the controller cache for some reason.
Link to artifacts from failed job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere/1372820393508737024/artifacts/e2e-vsphere/gather-extra/artifacts/
After some discussion with Stefan, there are a multitude of compounding factors here. First, our machine-controller initially connects to the bootstrap node's API server. This works for some time, and then the installer cleans up the the bootstrap node. The resulted in the following: I0319 08:30:07.116056 1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists I0319 08:30:54.691441 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:54.691843 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:57.313905 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:40:38.506786 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:45:55.651595 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c So, we get the EOF from the streamwatcher, the streamwatcher then re-initiates the 'listing and watching.' It either is reusing an existing tcp/http connection, or is racing and the LB routes the connection back to the bootstrap node by chance (or it's sticky, etc), TBD. Next, that stream seems to have a tiemout of some unknown lenght, it appears to be just less than 10 minutes, and we see the "Listing and watching" line again. Since the boostrap node has gone away, we're connecting to a real API server at this point. Next, since the events from updating the machine to re-reconcile were lost when the bootstrap node was stopped uncleanly, we don't reconcile anything until the 15 minute default timeout. If we miss a reconcile event for any reason after creation, that is the amount of time same as the kubelet re-issuing the client CSR. Recommended actions: We could bump our reconcile default timeout to 10 minutes. That should close the loop of double CSR create with the kubelet. We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time. This might need to be handled upstream with controller runtime or another component, TBD.
I've spent a little time to look through the informer code and, given we are using controller runtime, it seems the only real way to get a reasonable timeout is to set a low resync period. Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes (there's jitter on this anyway), which should allow us to get at least 2 resyncs within any 15 minute period. This should ensure we get to approve all CSRS within our timeout > We should investigate the possibility of catching the stream being recreated for Listing/Watching machines trigger full reconcile at that time. This might need to be handled upstream with controller runtime or another component, TBD. The code for this stream being restarted is deeply embedded in client-go code, it would be very difficult for us to get anything into this part if we want to continue using controller-runtime to manage our informers. If we were to ditch controller runtime we could set up our own listerwatcher which would mean we could add some logic here in every time the List is called
> Given we have a 15 minute limit and the watch timeout is randomly between 5 and 10 minutes anyway, my suggestion would be to go for something like either 5 or 7 minutes This watch timeout seems to be independent of the reconcile timeout. We want to avoid requeuing a stale object and doing lots of extra work. If we set requeue time to 11 minutes, that should close the gap. Whenever the watch reconnects, that will update the cache. If the watch times out after 5-10 minutes, then provided we get a working one after this timeout (which would be the case in the bootstrap node scenario described here), by the time we requeue, we'll have the fresh object. The only other question is whether or not cluster machine approver hits an incremental backoff (it used to) that will push us outside the 15 minute window for CSR approval (otherwise the kubelet will create a subsequent CSR. This won't break anything unless we add a test or there's an alert for pending CSRs already, and those should probably exist in the future, at least for client bootstrap CSRs).
> This watch timeout seems to be independent of the reconcile timeout. I thought we were talking about the resync period not the reconcile timeout? When I say resync period I mean the maximal time between a full reconcile of each resource? If you look at the informer code in client-go, you'll see that the way this works is that there's a call to a function that starts the ListWatch, in here, if the resync channel receives (which it does after the timout) then everything is cancelled and the ListWatch is started again. It's that initial listing operation that causes the full reconcile. And since this list operation isn't cached (in fact it updates the cache) it should be safe to requeue here as the cache will be up to date. That said, it looks as though the resync period only actually starts **after** the list completes. This means that changing the resync period and expecting that to be an absolute cap (eg setting it to something less than 15 minutes but more than 10) is not helpful if the List operation takes 10 minutes to timeout. Client-go code I'm referring to: https://github.com/kubernetes/client-go/blob/44bedf8611f162b2a2433e520cf5568f999124f4/tools/cache/reflector.go#L251-L447
So, looks like the option I'm talking about is already implemented at 10 minutes, so we shouldn't even be hitting this: https://github.com/openshift/machine-api-operator/pull/764/files Looking at the machine controller logs more closely, not only do the watches die and then reconnect and silently timeout, but the reconcile is paused in the middle of processing: I0319 08:29:57.065010 1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-1: patching machine I0319 08:30:07.092703 1 controller.go:425] ci-op-91xvwzi0-ec220-s9j6f-master-1: going into phase "Running" I0319 08:30:07.115156 1 controller.go:170] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling Machine I0319 08:30:07.116056 1 actuator.go:109] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator checking if machine exists I0319 08:30:54.691441 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:54.691843 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0319 08:30:57.313905 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:40:38.506786 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:45:55.651595 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c I0319 08:45:56.402719 1 reconciler.go:179] ci-op-91xvwzi0-ec220-s9j6f-master-2: already exists I0319 08:45:56.402765 1 controller.go:278] ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciling machine triggers idempotent update I0319 08:45:56.402771 1 actuator.go:123] ci-op-91xvwzi0-ec220-s9j6f-master-2: actuator updating machine I0319 08:45:56.485890 1 session.go:113] Find template by instance uuid: 897c36d4-7ae5-49e2-9220-72e684ef793c I0319 08:46:14.854384 1 reflector.go:243] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I0319 08:46:27.065446 1 machine_scope.go:102] ci-op-91xvwzi0-ec220-s9j6f-master-2: patching machine E0319 08:46:27.105347 1 actuator.go:57] ci-op-91xvwzi0-ec220-s9j6f-master-2 error: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout E0319 08:46:27.105385 1 controller.go:280] ci-op-91xvwzi0-ec220-s9j6f-master-2: error updating machine: ci-op-91xvwzi0-ec220-s9j6f-master-2: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: i/o timeout So, there must be a lock or bug somewhere in controller-runtime to stops the execution of the reconcile, or a bug in the vSphere SDK that has/causes an extremely high timeout. I've seen similar reconcile pauses on Azure previously, but I don't recall if there are these watch stream issues.
*** Bug 1951213 has been marked as a duplicate of this bug. ***
I'm still seeing 15m timeouts on CSR approvals for kubelet serving certs in CI: https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000 On node log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/nodes/ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf/journal Apr 23 04:28:33.749586 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.749483 1499 certificate_manager.go:412] Rotating certificates Apr 23 04:28:33.763631 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: I0423 04:28:33.763587 1499 log.go:184] http: TLS handshake error from 10.0.0.5:50142: no serving certificate available for the kubelet ... Apr 23 04:43:33.763808 ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf hyperkube[1499]: E0423 04:43:33.763802 1499 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition (15m timeout) From approver log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1385443101878784000/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-66f68b694d-tpm28_machine-approver-controller.log I0423 04:28:33.760349 1 controller.go:114] Reconciling CSR: csr-fpmvx I0423 04:28:33.761130 1 csr_check.go:150] csr-fpmvx: CSR does not appear to be client csr I0423 04:28:33.761200 1 csr_check.go:441] retrieving serving cert from ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf (10.0.32.5:10250) I0423 04:28:33.763579 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0423 04:28:33.763611 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf E0423 04:28:33.763620 1 csr_check.go:196] csr-fpmvx: Serving Cert: No target machine for node "ci-op-3kp20sxt-ce9aa-94m4k-worker-centralus1-rdcbf" I0423 04:28:33.763627 1 controller.go:172] csr-fpmvx: CSR not authorized
Based on the controller log, it looks like it's missing the other patch: https://github.com/openshift/cluster-machine-approver/pull/118 That is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1952282 Looking at the query, I don't see anything from the last 72 hours besides OpenStack, so I think it's fixed now.
Validated on - 4.8.0-0.nightly-2021-05-12-072240 Logs : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/nodes/ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw/ Journal logs . . . . . May 12 07:57:59.360354 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:57:59.360300 1425 log.go:184] http: TLS handshake error from 10.0.0.5:43022: no serving certificate available for the kubelet May 12 07:58:56.972828 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:58:56.972791 1425 log.go:184] http: TLS handshake error from 10.0.32.5:49816: no serving certificate available for the kubelet May 12 07:59:09.716126 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:09.716059 1425 log.go:184] http: TLS handshake error from 10.0.32.5:50056: no serving certificate available for the kubelet May 12 07:59:11.433589 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:11.433545 1425 log.go:184] http: TLS handshake error from 10.128.2.9:36954: no serving certificate available for the kubelet May 12 07:59:16.203601 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:16.203562 1425 log.go:184] http: TLS handshake error from 10.0.32.5:50204: no serving certificate available for the kubelet May 12 07:59:21.284202 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.284157 1425 log.go:184] http: TLS handshake error from 10.0.0.5:48476: no serving certificate available for the kubelet May 12 07:59:21.293279 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.293240 1425 csr.go:249] certificate signing request csr-z7z44 is approved, waiting to be issued May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341279 1425 csr.go:245] certificate signing request csr-z7z44 is issued May 12 07:59:21.344658 ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw hyperkube[1425]: I0512 07:59:21.341381 1425 reflector.go:225] Stopping reflector *v1.CertificateSigningRe. . . I could validate it for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log . . I0512 07:57:59.359043 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:57:59.361271 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:57:59.361300 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw E0512 07:57:59.361310 1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw" I0512 07:57:59.361325 1 controller.go:172] csr-z7z44: CSR not authorized .. . . . I0512 07:59:21.282281 1 controller.go:114] Reconciling CSR: csr-z7z44 I0512 07:59:21.282797 1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr I0512 07:59:21.282851 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:59:21.285221 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:59:21.285244 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw I0512 07:59:21.297589 1 controller.go:179] CSR csr-z7z44 approved. . . . Additional Info: Moved to verified based on above results
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438