Bug 1755469
| Summary: | All nodes go NotReady after 24 hours, several pending CSRs | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> | |
| Component: | kube-controller-manager | Assignee: | David Eads <deads> | |
| Status: | CLOSED ERRATA | QA Contact: | Walid A. <wabouham> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.2.0 | CC: | aos-bugs, brad.ison, deads, dmoessne, gblomqui, jbasquil, jhou, joedward, knarra, maszulik, mfojtik, mgugino, mifiedle, schoudha, scuppett, sdodson, sttts, xtian | |
| Target Milestone: | --- | Keywords: | Regression, Reopened, TestBlocker | |
| Target Release: | 4.3.z | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1755608 1779801 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-13 21:25:14 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1755608 | |||
| Bug Blocks: | 1779801 | |||
|
Description
Walid A.
2019-09-25 14:39:32 UTC
According to this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c27 It looks like the CA is rotated after 24 hours, I'm unsure how this might impact the library. This line should prevent any problems with mismatch CA's and cause a fall-back to machine-api-based validation: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R156 But I'm not 100% certain that the handshake will fail for the purposes of retrieving a serving cert. There are a variety of checks that should be non-fatal, but if any of them are encountered, we'll never try the machine-api as a fall back: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R278 It's unclear if csr address lists, emails, and names will be in the identical order there were once in, for example. I recommend we always try to validate the machine-api bits in case of any error, rather than returning early if we hit an error in that function. PR for 4.2: https://github.com/openshift/cluster-machine-approver/pull/49 Bugzilla bot doesn't want to cooperate, thinks we should target this bug at 4.2.z instead of 4.2.0. As Mike pointed machine-approver logs went missing so is difficult to understand the root of the problem here. Interesting `oc get csr` gives a list of pending CSR requested by `system:serviceaccount:openshift-machine-config-operator:node-bootstrapper` which should be only the case for the bootstrapping client certs. So one guess is that something getting corrupted at some point for a node might be generating indefinite CSRs causing the machine approver to reach the short circuiting limit, hence preventing it from handling the renewal properly We saw this once before but weren't able to reproduce it at the time. We can talk about the PR to fallback to the machine-api flow separately, but there's essentially no way that will solve this problem. Like Alberto said, these CSRs are coming from the bootstrap user, which should *only* be true for new first-time client certificates on new nodes. Those CSRs have never gone through the flow affected by the recent changes, and wouldn't hit the clause changed in the PR. However, I think I may have just seen the beginning of this on an AWS cluster. I scaled up a worker MachineSet, and the actuator somehow created 2 instances for the new Machine. The one created slightly later was the one associated with the Machine object and had its CSRs approved correctly. The earlier one is effectively "leaked" -- it has no backing Machine object and is stuck in a state where it's just repeatedly creating CSRs for its initial client certificate. These come from the bootstrap user, and once enough of them have been created the machine-approver will stop approving new certificates. Seems exactly like what we're seeing here. Just a note: The CA is not normally rotated after 24 hours. The comment in the other ticket is because we were testing with a short-lived CA. I think the 24 hour timeframe here comes from the fact that new CSRs are created every 15-20 minutes, and it takes about 24 hours to hit the limits of the machine-approver. QE, can you confirm if you see any "leaked" instances for these clusters? On AWS you would see multiple EC2 instances in the console with the same name for example, but one would have an IP address / hostname for which no corresponding Node object exists. We worked with Walid to get access to the cluster. The leaked instance was a red herring. That is a separate issue that causes similar symptoms, but doesn't seem to result in the cluster becoming entirely unhealthy. We will file a another bug for that. Also, I misspoke with regards to the 24 hour validity of the CA. I forgot that the installer *does* create an initial CA valid for 24 hours, and the kube-controller-manager-operator creates the longer-lived CA afterwards. At this point I'm convinced this is *not* an issue with the cluster-machine-approver. Here's what we're seeing: - Cluster comes online with initial 24 hour CA. - The machine-approver approves new client certs and serving renewals as expected. - As the 24 hour limit approaches the serving certs renew more and more frequently. - Eventually the 24 hour CA expires, which means client certs expire and new ones are requested with the bootstrap credentials. - The machine-approver correctly refuses to approve these because a Node object already exists with that name. It seems the kube-controller-manager-operator *does* generate new CAs and puts them into the ConfigMap, but the kube-controller-manager is not restarted and so continues signing with the short-lived CA until it expires. Later manually approving the pending CSRs results in this log from the kube-controller-manager: 2019-09-27T13:02:22.144091695+00:00 stderr F E0927 13:02:22.144058 1 certificate_controller.go:147] Sync csr-f9rgf failed with : error auto signing csr: the signer has expired: 2019-09-25 09:51:08 +0000 UTC I've seen some issues related to "backpressure" for certificate rotation. Could this be related? If the timing is right, could the CAs be updated in an order that doesn't allow the kube-controller-manager to be restarted properly? To make progress we'll need to see 1. `oc get secrets -A -oyaml` 2. `oc get configmaps -A -oyaml` 3. `oc get csrs -oyaml` 4. `oc get clusteroperators -oyaml` 5. `oc adm must-gather` output if it works. In 4.2, we *thought* we automatically reloaded the kube-controller-manager with https://github.com/openshift/origin/pull/23717. if must-gather does not work, plan to leave the environments up with an ssh bastion and jump box instructions. I've dug into the pod logs and the biggest offender is that the certsyncer which is responsible for syncing certs (as the name suggests) is stuck unable to reach apiserver, which is working just fine. I'm seeing tons and tons of: reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 2753797 (2754963) │ reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.Secret ended with: too old resource version: 2705909 (2707212) weaved with: reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps?resourceVersio│ n=417111&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.Secret: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/secrets?resourceVersion=3934│ 21&timeout=5m26s&timeoutSeconds=326&watch=true: dial tcp [::1]:6443: connect: connection refused which will prevent proper cert syncing, thus restarting kcm thus proper working of CSR. Having said that, I'm sending this over to the networking team for them to verify their bits. Just a bit more evidence why I don't think it's the SDN: - this connection is 100% host network - it's localhost -> localhost - the SDN seems to be generally up, and hasn't restarted There are a few connection disruptions, but there's no way it's a "down network" for localhost conns. From looking through kcm-o logs, specifically cert-syncer it looks like it's not picking up csr-signer secret, a problem that was solved in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/279. I'm trying to figure out the exact operator version being used there, but based on that PR it's safe to move it out of 4.2. Can I ask you to re-try this with latest release, given the above? I have not seen this issue on recent builds, and also on most recent cluster version version is 4.2.0-0.nightly-2019-10-03-224032 in AWS: # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-127.us-west-2.compute.internal Ready worker 30h v1.14.6+463c73f1f ip-10-0-140-32.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-141-70.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-151-184.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-158-95.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-168-128.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-174-121.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-9-25.us-west-2.compute.internal Ready workload 30h v1.14.6+463c73f1f # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-10-03-224032 True False 31h Cluster version is 4.2.0-0.nightly-2019-10-03-224032 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-10-03-224032 True False False 31h cloud-credential 4.2.0-0.nightly-2019-10-03-224032 True False False 31h cluster-autoscaler 4.2.0-0.nightly-2019-10-03-224032 True False False 31h console 4.2.0-0.nightly-2019-10-03-224032 True False False 31h dns 4.2.0-0.nightly-2019-10-03-224032 True False False 31h image-registry 4.2.0-0.nightly-2019-10-03-224032 True False False 31h ingress 4.2.0-0.nightly-2019-10-03-224032 True False False 31h insights 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-scheduler 4.2.0-0.nightly-2019-10-03-224032 True False False 31h machine-api 4.2.0-0.nightly-2019-10-03-224032 True False False 31h machine-config 4.2.0-0.nightly-2019-10-03-224032 True False False 31h marketplace 4.2.0-0.nightly-2019-10-03-224032 True False False 31h monitoring 4.2.0-0.nightly-2019-10-03-224032 True False False 31h network 4.2.0-0.nightly-2019-10-03-224032 True False False 31h node-tuning 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-samples 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-10-03-224032 True False False 12h service-ca 4.2.0-0.nightly-2019-10-03-224032 True False False 31h service-catalog-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h service-catalog-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h storage 4.2.0-0.nightly-2019-10-03-224032 True False False 31h # oc get csr No resources found. I've opened https://github.com/openshift/library-go/pull/546 to add more logging data for next time, but based on previous comment I think we can close this bug. Feel free to re-open if the problems re-appears. Can you provide us with logs from kcm-o and kcm? As well as must-gather dump from a cluster where this appears? *** Bug 1755608 has been marked as a duplicate of this bug. *** Adding TestBlocker from duplicate bug 1755608 I have not seen this issue anymore on OCP 4.2 and 4.3 in: - AWS: 4.2.2 and 4.3.0-0.nightly-2019-11-02-092336 - GCP: 4.3.0-0.nightly-2019-11-08-094604 and 4.3.0-0.nightly-2019-11-02-092336 These clusters were running for several days, some for reliability testing and all nodes are showing Ready. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 qe_test_coverage flag has been set to minus because i see that the fix applies for a openshift version <=4.8 |