Bug 1755469 - All nodes go NotReady after 24 hours, several pending CSRs
Summary: All nodes go NotReady after 24 hours, several pending CSRs
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.2.0
Hardware: All
OS: Linux
Target Milestone: ---
: 4.3.0
Assignee: David Eads
QA Contact: Walid A.
: 1755608 (view as bug list)
Depends On: 1755608
Blocks: 1779801
TreeView+ depends on / blocked
Reported: 2019-09-25 14:39 UTC by Walid A.
Modified: 2021-01-20 21:06 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1755608 1779801 (view as bug list)
Last Closed: 2020-05-13 21:25:14 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-05-13 21:25:18 UTC

Description Walid A. 2019-09-25 14:39:32 UTC
Description of problem:
We saw this issue in two separate clusters: on Azure and on AWS IPI installs of 4.2.0-0.nightly-2019-09-24-025718, 3 master and 3 worker nodes.  After 24 hours all nodes go NotReady. 

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-24-025718   True        False         26h     Cluster version is 4.2.0-0.nightly-2019-09-24-025718

# oc get nodes
NAME                                        STATUS     ROLES    AGE   VERSION
ip-10-0-136-9.us-west-2.compute.internal    NotReady   master   26h   v1.14.6+12a7c657c
ip-10-0-141-42.us-west-2.compute.internal   NotReady   worker   26h   v1.14.6+12a7c657c
ip-10-0-144-58.us-west-2.compute.internal   NotReady   worker   26h   v1.14.6+12a7c657c
ip-10-0-155-76.us-west-2.compute.internal   NotReady   master   26h   v1.14.6+12a7c657c
ip-10-0-174-46.us-west-2.compute.internal   NotReady   worker   26h   v1.14.6+12a7c657c
ip-10-0-174-90.us-west-2.compute.internal   NotReady   master   26h   v1.14.6+12a7c657c

# oc get csr
NAME        AGE    REQUESTOR                                                                   CONDITION
csr-24xt2   165m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2k4jk   27m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-4fczm   42m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-56k9q   120m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-64kbw   165m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-6dnph   58m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-79wm9   3h5m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-7pr7h   27m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-87fxc   73m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9bjmw   89m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9kbw7   150m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bm46g   3h     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-cnl94   150m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-cvsb4   3h5m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-ddhl4   58m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-f66j4   11m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-fmbn8   104m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-gs22k   42m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-hg8pw   73m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-hx9qf   89m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-jd7mp   135m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-jfwq2   11m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-jrxzd   73m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-k2446   3h5m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-k7g7s   3h     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-kfmqw   104m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-l6ggq   120m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-l6wnn   165m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-mcc9n   135m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-nshk6   150m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rrkxc   135m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rwzcl   3h     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rz7w4   42m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-sjhqm   27m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-tsqhs   120m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-vj8sg   104m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-wk9lb   12m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-z5dzz   89m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-zq7h4   58m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

# oc logs -n openshift-cluster-machine-approver machine-approver-d599647f4-kk9tr
Error from server: Get remote error: tls: internal error

# oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates
subject= /CN=kube-csr-signer_@1569388028
notBefore=Sep 25 05:07:07 2019 GMT
notAfter=Oct 25 05:07:08 2019 GMT

oc adm must-gather timing out, oc debug node hanging.

Version-Release number of selected component (if applicable):


How reproducible:

Reproduced on two separate clusters, on in AWS and one in Azure.

Steps to Reproduce:
1. IPI Install of an OCP 4.2 cluster in Azure or AWS, 3 masters, 3 worker nodes
2. Wait 24 hours
3. oc get nodes shows all nodes not ready
4. oc get csr shows several pending openshift-machine-config-operator:node-bootstrapper csrs

Actual results:
All nodes go NotReady after 24 hours

Expected results:

Cert rotation/auto-approval successful.  All nodes remain Ready

Additional info:
May be related to BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1737611

Comment 5 Michael Gugino 2019-09-25 19:31:56 UTC
According to this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c27

It looks like the CA is rotated after 24 hours, I'm unsure how this might impact the library.  This line should prevent any problems with mismatch CA's and cause a fall-back to machine-api-based validation: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R156  But I'm not 100% certain that the handshake will fail for the purposes of retrieving a serving cert.

There are a variety of checks that should be non-fatal, but if any of them are encountered, we'll never try the machine-api as a fall back: https://github.com/openshift/cluster-machine-approver/pull/38/files#diff-a482a1a664767249971b0b71ca9baaf0R278

It's unclear if csr address lists, emails, and names will be in the identical order there were once in, for example.

I recommend we always try to validate the machine-api bits in case of any error, rather than returning early if we hit an error in that function.

Comment 6 Michael Gugino 2019-09-25 20:08:00 UTC
PR for 4.2: https://github.com/openshift/cluster-machine-approver/pull/49

Bugzilla bot doesn't want to cooperate, thinks we should target this bug at 4.2.z instead of 4.2.0.

Comment 7 Alberto 2019-09-26 08:19:13 UTC
As Mike pointed machine-approver logs went missing so is difficult to understand the root of the problem here. Interesting `oc get csr` gives a list of pending CSR requested by `system:serviceaccount:openshift-machine-config-operator:node-bootstrapper` which should be only the case for the bootstrapping client certs. So one guess is that something getting corrupted at some point for a node might be generating indefinite CSRs causing the machine approver to reach the short circuiting limit, hence preventing it from handling the renewal properly

Comment 8 Brad Ison 2019-09-26 13:03:15 UTC
We saw this once before but weren't able to reproduce it at the time. We can talk about the PR to fallback to the machine-api flow separately, but there's essentially no way that will solve this problem. Like Alberto said, these CSRs are coming from the bootstrap user, which should *only* be true for new first-time client certificates on new nodes. Those CSRs have never gone through the flow affected by the recent changes, and wouldn't hit the clause changed in the PR.

However, I think I may have just seen the beginning of this on an AWS cluster. I scaled up a worker MachineSet, and the actuator somehow created 2 instances for the new Machine. The one created slightly later was the one associated with the Machine object and had its CSRs approved correctly. The earlier one is effectively "leaked" -- it has no backing Machine object and is stuck in a state where it's just repeatedly creating CSRs for its initial client certificate. These come from the bootstrap user, and once enough of them have been created the machine-approver will stop approving new certificates. Seems exactly like what we're seeing here.

Just a note: The CA is not normally rotated after 24 hours. The comment in the other ticket is because we were testing with a short-lived CA. I think the 24 hour timeframe here comes from the fact that new CSRs are created every 15-20 minutes, and it takes about 24 hours to hit the limits of the machine-approver.

QE, can you confirm if you see any "leaked" instances for these clusters? On AWS you would see multiple EC2 instances in the console with the same name for example, but one would have an IP address / hostname for which no corresponding Node object exists.

Comment 13 Brad Ison 2019-09-27 15:10:54 UTC
We worked with Walid to get access to the cluster. The leaked instance was a red herring. That is a separate issue that causes similar symptoms, but doesn't seem to result in the cluster becoming entirely unhealthy. We will file a another bug for that.

Also, I misspoke with regards to the 24 hour validity of the CA. I forgot that the installer *does* create an initial CA valid for 24 hours, and the kube-controller-manager-operator creates the longer-lived CA afterwards.

At this point I'm convinced this is *not* an issue with the cluster-machine-approver. Here's what we're seeing:

- Cluster comes online with initial 24 hour CA.
- The machine-approver approves new client certs and serving renewals as expected.
- As the 24 hour limit approaches the serving certs renew more and more frequently.
- Eventually the 24 hour CA expires, which means client certs expire and new ones are requested with the bootstrap credentials.
- The machine-approver correctly refuses to approve these because a Node object already exists with that name.

It seems the kube-controller-manager-operator *does* generate new CAs and puts them into the ConfigMap, but the kube-controller-manager is not restarted and so continues signing with the short-lived CA until it expires. Later manually approving the pending CSRs results in this log from the kube-controller-manager:

2019-09-27T13:02:22.144091695+00:00 stderr F E0927 13:02:22.144058       1 certificate_controller.go:147] Sync csr-f9rgf failed with : error auto signing csr: the signer has expired: 2019-09-25 09:51:08 +0000 UTC

I've seen some issues related to "backpressure" for certificate rotation. Could this be related? If the timing is right, could the CAs be updated in an order that doesn't allow the kube-controller-manager to be restarted properly?

Comment 14 David Eads 2019-09-27 18:24:05 UTC
To make progress we'll need to see 

1. `oc get secrets -A -oyaml`
2. `oc get configmaps -A -oyaml`
3. `oc get csrs -oyaml`
4. `oc get clusteroperators -oyaml`
5. `oc adm must-gather` output if it works.

In 4.2, we *thought* we automatically reloaded the kube-controller-manager with https://github.com/openshift/origin/pull/23717.

if must-gather does not work, plan to leave the environments up with an ssh bastion and jump box instructions.

Comment 21 Maciej Szulik 2019-10-02 15:51:23 UTC
I've dug into the pod logs and the biggest offender is that the certsyncer which is responsible for syncing certs (as the name suggests) is stuck 
unable to reach apiserver, which is working just fine. I'm seeing tons and tons of:

reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 2753797 (2754963)                                                         │
reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.Secret ended with: too old resource version: 2705909 (2707212)

weaved with:

reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps?resourceVersio│
n=417111&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused                                                                                                                                                                          reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.Secret: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/secrets?resourceVersion=3934│
21&timeout=5m26s&timeoutSeconds=326&watch=true: dial tcp [::1]:6443: connect: connection refused

which will prevent proper cert syncing, thus restarting kcm thus proper working of CSR.

Having said that, I'm sending this over to the networking team for them to verify their bits.

Comment 25 Casey Callendrello 2019-10-02 17:17:13 UTC
Just a bit more evidence why I don't think it's the SDN:
- this connection is 100% host network
- it's localhost -> localhost
- the SDN seems to be generally up, and hasn't restarted

There are a few connection disruptions, but there's no way it's a "down network" for localhost conns.

Comment 28 Maciej Szulik 2019-10-04 15:16:12 UTC
From looking through kcm-o logs, specifically cert-syncer it looks like it's not picking up csr-signer secret, a problem
that was solved in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/279. I'm trying to figure
out the exact operator version being used there, but based on that PR it's safe to move it out of 4.2. 

Can I ask you to re-try this with latest release, given the above?

Comment 29 Walid A. 2019-10-06 01:29:08 UTC
I have not seen this issue on recent builds, and also on most recent cluster version version is 4.2.0-0.nightly-2019-10-03-224032 in AWS:

 # oc get nodes
NAME                                         STATUS   ROLES      AGE   VERSION
ip-10-0-135-127.us-west-2.compute.internal   Ready    worker     30h   v1.14.6+463c73f1f
ip-10-0-140-32.us-west-2.compute.internal    Ready    worker     31h   v1.14.6+463c73f1f
ip-10-0-141-70.us-west-2.compute.internal    Ready    master     31h   v1.14.6+463c73f1f
ip-10-0-151-184.us-west-2.compute.internal   Ready    worker     31h   v1.14.6+463c73f1f
ip-10-0-158-95.us-west-2.compute.internal    Ready    master     31h   v1.14.6+463c73f1f
ip-10-0-168-128.us-west-2.compute.internal   Ready    master     31h   v1.14.6+463c73f1f
ip-10-0-174-121.us-west-2.compute.internal   Ready    worker     31h   v1.14.6+463c73f1f
ip-10-0-9-25.us-west-2.compute.internal      Ready    workload   30h   v1.14.6+463c73f1f

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-10-03-224032   True        False         31h     Cluster version is 4.2.0-0.nightly-2019-10-03-224032

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
cloud-credential                           4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
cluster-autoscaler                         4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
console                                    4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
dns                                        4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
image-registry                             4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
ingress                                    4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
insights                                   4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
kube-apiserver                             4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
kube-controller-manager                    4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
kube-scheduler                             4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
machine-api                                4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
machine-config                             4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
marketplace                                4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
monitoring                                 4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
network                                    4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
node-tuning                                4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
openshift-apiserver                        4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
openshift-controller-manager               4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
openshift-samples                          4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
operator-lifecycle-manager                 4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-10-03-224032   True        False         False      12h
service-ca                                 4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
service-catalog-apiserver                  4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
service-catalog-controller-manager         4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h
storage                                    4.2.0-0.nightly-2019-10-03-224032   True        False         False      31h

# oc get csr
No resources found.

Comment 30 Maciej Szulik 2019-10-07 20:31:06 UTC
I've opened https://github.com/openshift/library-go/pull/546 to add more logging data for next time, but based on previous comment I think we can close this bug. 
Feel free to re-open if the problems re-appears.

Comment 35 Maciej Szulik 2019-10-28 20:07:51 UTC
Can you provide us with logs from kcm-o and kcm? As well as must-gather dump from a cluster where this appears?

Comment 39 Alberto 2019-11-06 13:19:05 UTC
*** Bug 1755608 has been marked as a duplicate of this bug. ***

Comment 40 Mike Fiedler 2019-11-06 15:44:45 UTC
Adding TestBlocker from duplicate bug 1755608

Comment 42 Walid A. 2019-11-10 15:12:19 UTC
I have not seen this issue anymore on OCP 4.2 and 4.3 in:
- AWS: 4.2.2 and 4.3.0-0.nightly-2019-11-02-092336
- GCP: 4.3.0-0.nightly-2019-11-08-094604 and 4.3.0-0.nightly-2019-11-02-092336

These clusters were running for several days, some for reliability testing and all nodes are showing Ready.

Comment 44 errata-xmlrpc 2020-05-13 21:25:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.