Description of problem: We saw this issue in two separate clusters: on Azure and on AWS IPI installs of 4.2.0-0.nightly-2019-09-24-025718, 3 master and 3 worker nodes. After 24 hours all nodes go NotReady. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-24-025718 True False 26h Cluster version is 4.2.0-0.nightly-2019-09-24-025718 # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-136-9.us-west-2.compute.internal NotReady master 26h v1.14.6+12a7c657c ip-10-0-141-42.us-west-2.compute.internal NotReady worker 26h v1.14.6+12a7c657c ip-10-0-144-58.us-west-2.compute.internal NotReady worker 26h v1.14.6+12a7c657c ip-10-0-155-76.us-west-2.compute.internal NotReady master 26h v1.14.6+12a7c657c ip-10-0-174-46.us-west-2.compute.internal NotReady worker 26h v1.14.6+12a7c657c ip-10-0-174-90.us-west-2.compute.internal NotReady master 26h v1.14.6+12a7c657c # oc get csr NAME AGE REQUESTOR CONDITION csr-24xt2 165m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2k4jk 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-4fczm 42m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-56k9q 120m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-64kbw 165m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-6dnph 58m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-79wm9 3h5m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-7pr7h 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-87fxc 73m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-9bjmw 89m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-9kbw7 150m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bm46g 3h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-cnl94 150m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-cvsb4 3h5m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-ddhl4 58m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-f66j4 11m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-fmbn8 104m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-gs22k 42m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-hg8pw 73m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-hx9qf 89m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-jd7mp 135m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-jfwq2 11m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-jrxzd 73m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-k2446 3h5m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-k7g7s 3h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-kfmqw 104m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-l6ggq 120m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-l6wnn 165m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-mcc9n 135m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-nshk6 150m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-rrkxc 135m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-rwzcl 3h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-rz7w4 42m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-sjhqm 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-tsqhs 120m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-vj8sg 104m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-wk9lb 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-z5dzz 89m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-zq7h4 58m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending # oc logs -n openshift-cluster-machine-approver machine-approver-d599647f4-kk9tr Error from server: Get https://10.0.136.9:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-d599647f4-kk9tr/machine-approver-controller: remote error: tls: internal error # oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates subject= /CN=kube-csr-signer_@1569388028 notBefore=Sep 25 05:07:07 2019 GMT notAfter=Oct 25 05:07:08 2019 GMT oc adm must-gather timing out, oc debug node hanging. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-24-025718 How reproducible: Reproduced on two separate clusters, on in AWS and one in Azure. Steps to Reproduce: 1. IPI Install of an OCP 4.2 cluster in Azure or AWS, 3 masters, 3 worker nodes 2. Wait 24 hours 3. oc get nodes shows all nodes not ready 4. oc get csr shows several pending openshift-machine-config-operator:node-bootstrapper csrs Actual results: All nodes go NotReady after 24 hours Expected results: Cert rotation/auto-approval successful. All nodes remain Ready Additional info: May be related to BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1737611
According to this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c27 It looks like the CA is rotated after 24 hours, I'm unsure how this might impact the library. This line should prevent any problems with mismatch CA's and cause a fall-back to machine-api-based validation: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R156 But I'm not 100% certain that the handshake will fail for the purposes of retrieving a serving cert. There are a variety of checks that should be non-fatal, but if any of them are encountered, we'll never try the machine-api as a fall back: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R278 It's unclear if csr address lists, emails, and names will be in the identical order there were once in, for example. I recommend we always try to validate the machine-api bits in case of any error, rather than returning early if we hit an error in that function.
PR for 4.2: https://github.com/openshift/cluster-machine-approver/pull/49 Bugzilla bot doesn't want to cooperate, thinks we should target this bug at 4.2.z instead of 4.2.0.
As Mike pointed machine-approver logs went missing so is difficult to understand the root of the problem here. Interesting `oc get csr` gives a list of pending CSR requested by `system:serviceaccount:openshift-machine-config-operator:node-bootstrapper` which should be only the case for the bootstrapping client certs. So one guess is that something getting corrupted at some point for a node might be generating indefinite CSRs causing the machine approver to reach the short circuiting limit, hence preventing it from handling the renewal properly
We saw this once before but weren't able to reproduce it at the time. We can talk about the PR to fallback to the machine-api flow separately, but there's essentially no way that will solve this problem. Like Alberto said, these CSRs are coming from the bootstrap user, which should *only* be true for new first-time client certificates on new nodes. Those CSRs have never gone through the flow affected by the recent changes, and wouldn't hit the clause changed in the PR. However, I think I may have just seen the beginning of this on an AWS cluster. I scaled up a worker MachineSet, and the actuator somehow created 2 instances for the new Machine. The one created slightly later was the one associated with the Machine object and had its CSRs approved correctly. The earlier one is effectively "leaked" -- it has no backing Machine object and is stuck in a state where it's just repeatedly creating CSRs for its initial client certificate. These come from the bootstrap user, and once enough of them have been created the machine-approver will stop approving new certificates. Seems exactly like what we're seeing here. Just a note: The CA is not normally rotated after 24 hours. The comment in the other ticket is because we were testing with a short-lived CA. I think the 24 hour timeframe here comes from the fact that new CSRs are created every 15-20 minutes, and it takes about 24 hours to hit the limits of the machine-approver. QE, can you confirm if you see any "leaked" instances for these clusters? On AWS you would see multiple EC2 instances in the console with the same name for example, but one would have an IP address / hostname for which no corresponding Node object exists.
We worked with Walid to get access to the cluster. The leaked instance was a red herring. That is a separate issue that causes similar symptoms, but doesn't seem to result in the cluster becoming entirely unhealthy. We will file a another bug for that. Also, I misspoke with regards to the 24 hour validity of the CA. I forgot that the installer *does* create an initial CA valid for 24 hours, and the kube-controller-manager-operator creates the longer-lived CA afterwards. At this point I'm convinced this is *not* an issue with the cluster-machine-approver. Here's what we're seeing: - Cluster comes online with initial 24 hour CA. - The machine-approver approves new client certs and serving renewals as expected. - As the 24 hour limit approaches the serving certs renew more and more frequently. - Eventually the 24 hour CA expires, which means client certs expire and new ones are requested with the bootstrap credentials. - The machine-approver correctly refuses to approve these because a Node object already exists with that name. It seems the kube-controller-manager-operator *does* generate new CAs and puts them into the ConfigMap, but the kube-controller-manager is not restarted and so continues signing with the short-lived CA until it expires. Later manually approving the pending CSRs results in this log from the kube-controller-manager: 2019-09-27T13:02:22.144091695+00:00 stderr F E0927 13:02:22.144058 1 certificate_controller.go:147] Sync csr-f9rgf failed with : error auto signing csr: the signer has expired: 2019-09-25 09:51:08 +0000 UTC I've seen some issues related to "backpressure" for certificate rotation. Could this be related? If the timing is right, could the CAs be updated in an order that doesn't allow the kube-controller-manager to be restarted properly?
To make progress we'll need to see 1. `oc get secrets -A -oyaml` 2. `oc get configmaps -A -oyaml` 3. `oc get csrs -oyaml` 4. `oc get clusteroperators -oyaml` 5. `oc adm must-gather` output if it works. In 4.2, we *thought* we automatically reloaded the kube-controller-manager with https://github.com/openshift/origin/pull/23717. if must-gather does not work, plan to leave the environments up with an ssh bastion and jump box instructions.
I've dug into the pod logs and the biggest offender is that the certsyncer which is responsible for syncing certs (as the name suggests) is stuck unable to reach apiserver, which is working just fine. I'm seeing tons and tons of: reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 2753797 (2754963) │ reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.Secret ended with: too old resource version: 2705909 (2707212) weaved with: reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps?resourceVersio│ n=417111&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.Secret: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/secrets?resourceVersion=3934│ 21&timeout=5m26s&timeoutSeconds=326&watch=true: dial tcp [::1]:6443: connect: connection refused which will prevent proper cert syncing, thus restarting kcm thus proper working of CSR. Having said that, I'm sending this over to the networking team for them to verify their bits.
Just a bit more evidence why I don't think it's the SDN: - this connection is 100% host network - it's localhost -> localhost - the SDN seems to be generally up, and hasn't restarted There are a few connection disruptions, but there's no way it's a "down network" for localhost conns.
From looking through kcm-o logs, specifically cert-syncer it looks like it's not picking up csr-signer secret, a problem that was solved in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/279. I'm trying to figure out the exact operator version being used there, but based on that PR it's safe to move it out of 4.2. Can I ask you to re-try this with latest release, given the above?
I have not seen this issue on recent builds, and also on most recent cluster version version is 4.2.0-0.nightly-2019-10-03-224032 in AWS: # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-127.us-west-2.compute.internal Ready worker 30h v1.14.6+463c73f1f ip-10-0-140-32.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-141-70.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-151-184.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-158-95.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-168-128.us-west-2.compute.internal Ready master 31h v1.14.6+463c73f1f ip-10-0-174-121.us-west-2.compute.internal Ready worker 31h v1.14.6+463c73f1f ip-10-0-9-25.us-west-2.compute.internal Ready workload 30h v1.14.6+463c73f1f # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-10-03-224032 True False 31h Cluster version is 4.2.0-0.nightly-2019-10-03-224032 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-10-03-224032 True False False 31h cloud-credential 4.2.0-0.nightly-2019-10-03-224032 True False False 31h cluster-autoscaler 4.2.0-0.nightly-2019-10-03-224032 True False False 31h console 4.2.0-0.nightly-2019-10-03-224032 True False False 31h dns 4.2.0-0.nightly-2019-10-03-224032 True False False 31h image-registry 4.2.0-0.nightly-2019-10-03-224032 True False False 31h ingress 4.2.0-0.nightly-2019-10-03-224032 True False False 31h insights 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h kube-scheduler 4.2.0-0.nightly-2019-10-03-224032 True False False 31h machine-api 4.2.0-0.nightly-2019-10-03-224032 True False False 31h machine-config 4.2.0-0.nightly-2019-10-03-224032 True False False 31h marketplace 4.2.0-0.nightly-2019-10-03-224032 True False False 31h monitoring 4.2.0-0.nightly-2019-10-03-224032 True False False 31h network 4.2.0-0.nightly-2019-10-03-224032 True False False 31h node-tuning 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h openshift-samples 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-10-03-224032 True False False 31h operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-10-03-224032 True False False 12h service-ca 4.2.0-0.nightly-2019-10-03-224032 True False False 31h service-catalog-apiserver 4.2.0-0.nightly-2019-10-03-224032 True False False 31h service-catalog-controller-manager 4.2.0-0.nightly-2019-10-03-224032 True False False 31h storage 4.2.0-0.nightly-2019-10-03-224032 True False False 31h # oc get csr No resources found.
I've opened https://github.com/openshift/library-go/pull/546 to add more logging data for next time, but based on previous comment I think we can close this bug. Feel free to re-open if the problems re-appears.
Can you provide us with logs from kcm-o and kcm? As well as must-gather dump from a cluster where this appears?
*** Bug 1755608 has been marked as a duplicate of this bug. ***
Adding TestBlocker from duplicate bug 1755608
I have not seen this issue anymore on OCP 4.2 and 4.3 in: - AWS: 4.2.2 and 4.3.0-0.nightly-2019-11-02-092336 - GCP: 4.3.0-0.nightly-2019-11-08-094604 and 4.3.0-0.nightly-2019-11-02-092336 These clusters were running for several days, some for reliability testing and all nodes are showing Ready.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
qe_test_coverage flag has been set to minus because i see that the fix applies for a openshift version <=4.8