Bug 1684049

Summary:

Worker node client certification rotation fail after it expires

Product:

OpenShift Container Platform

Reporter:

weiwei jiang <wjiang>

Component:

Master

Assignee:

David Eads <deads>

Status:

CLOSED ERRATA

QA Contact:

Xingxing Xia <xxia>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.1.0

CC:

akrzos, aos-bugs, deads, erich, florin-alexandru.peter, hongkliu, jeder, jiazha, jokerman, juzhao, mifiedle, mmccomas, nelluri, sponnaga, wsun, xtian

Target Milestone:

---

Keywords:

Regression, TestBlocker

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-40

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:44:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1671928

Attachments:

Description	Flags
cluster info for Comment 4	none
kubeconfig file	none
clsuter operator information	none
clsuter operator information when worker nodes in NotReady status	none

Description weiwei jiang 2019-02-28 10:21:43 UTC

Description of problem:
Worker node got NotReady when it's client certification expires

# oc get nodes  -o wide 
NAME                                         STATUS     ROLES    AGE     VERSION              INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION              CONTAINER-RUNTIME
ip-10-0-128-205.us-east-2.compute.internal   Ready      master   7h41m   v1.12.4+4dd65df23d   10.0.128.205   <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7
ip-10-0-136-48.us-east-2.compute.internal    NotReady   worker   7h23m   v1.12.4+4dd65df23d   10.0.136.48    <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7
ip-10-0-146-107.us-east-2.compute.internal   Ready      master   7h41m   v1.12.4+4dd65df23d   10.0.146.107   <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7
ip-10-0-153-231.us-east-2.compute.internal   NotReady   worker   7h23m   v1.12.4+4dd65df23d   10.0.153.231   <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7
ip-10-0-168-14.us-east-2.compute.internal    NotReady   worker   7h23m   v1.12.4+4dd65df23d   10.0.168.14    <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7
ip-10-0-171-50.us-east-2.compute.internal    Ready      master   7h41m   v1.12.4+4dd65df23d   10.0.171.50    <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.6-1.rhaos4.0.git2f0cb0d.el7


# curl --cert  /var/lib/kubelet/pki/kubelet-client-current.pem https://172.30.0.1 -vvv
* About to connect() to 172.30.0.1 port 443 (#0)
*   Trying 172.30.0.1...
* Connected to 172.30.0.1 (172.30.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* unable to load client key: -8178 (SEC_ERROR_BAD_KEY)
* NSS error -8178 (SEC_ERROR_BAD_KEY)
* Peer's public key is invalid.
* Closing connection 0
curl: (58) unable to load client key: -8178 (SEC_ERROR_BAD_KEY)

# journalctl -u kubelet |grep -i certificate_manager|tail
Feb 28 09:54:01 ip-10-0-136-48 hyperkube[4446]: E0228 09:54:01.477437    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:54:13 ip-10-0-136-48 hyperkube[4446]: E0228 09:54:13.248202    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:54:33 ip-10-0-136-48 hyperkube[4446]: E0228 09:54:33.476758    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:54:45 ip-10-0-136-48 hyperkube[4446]: E0228 09:54:45.241122    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:55:05 ip-10-0-136-48 hyperkube[4446]: E0228 09:55:05.477969    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:55:17 ip-10-0-136-48 hyperkube[4446]: E0228 09:55:17.238751    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:55:37 ip-10-0-136-48 hyperkube[4446]: E0228 09:55:37.477060    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:55:49 ip-10-0-136-48 hyperkube[4446]: E0228 09:55:49.239663    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized
Feb 28 09:56:09 ip-10-0-136-48 hyperkube[4446]: E0228 09:56:09.477029    4446 certificate_manager.go:356] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized


Version-Release number of selected component (if applicable):
oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-27-213933   True        False         7h19m   Cluster version is 4.0.0-0.nightly-2019-02-27-213933


How reproducible:
Often

Steps to Reproduce:
1. setup ocp cluster and wait the kubelet client certificate expires
2.
3.

Actual results:
worker node got notready

Expected results:
worker node should be ready and certificate rotation work well

Additional info:
[root@ip-10-0-136-48 core]# ls -al /var/lib/kubelet/pki/kubelet-client-*
-rw-------. 1 root root 1183 Feb 28 02:33 /var/lib/kubelet/pki/kubelet-client-2019-02-28-02-33-34.pem
-rw-------. 1 root root 1183 Feb 28 05:26 /var/lib/kubelet/pki/kubelet-client-2019-02-28-05-26-41.pem
-rw-------. 1 root root 1183 Feb 28 07:47 /var/lib/kubelet/pki/kubelet-client-2019-02-28-07-47-11.pem
lrwxrwxrwx. 1 root root   59 Feb 28 07:47 /var/lib/kubelet/pki/kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2019-02-28-07-47-11.pem
(reverse-i-search)`client': ls -al /var/lib/kubelet/pki/kubelet-^Cient-*
[root@ip-10-0-136-48 core]# ls -al /var/lib/kubelet/pki/kubelet-server-*
-rw-------. 1 root root 1265 Feb 28 02:34 /var/lib/kubelet/pki/kubelet-server-2019-02-28-02-34-13.pem
-rw-------. 1 root root 1265 Feb 28 05:17 /var/lib/kubelet/pki/kubelet-server-2019-02-28-05-17-12.pem
lrwxrwxrwx. 1 root root   59 Feb 28 05:17 /var/lib/kubelet/pki/kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2019-02-28-05-17-12.pem
[root@ip-10-0-136-48 core]# openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
notBefore=Feb 28 07:42:00 2019 GMT
notAfter=Feb 28 10:18:45 2019 GMT
[root@ip-10-0-136-48 core]# openssl x509 -in /var/lib/kubelet/pki/kubelet-server-current.pem -noout -dates
notBefore=Feb 28 05:12:00 2019 GMT
notAfter=Feb 28 08:18:44 2019 GMT

Comment 4 David Eads 2019-02-28 13:45:35 UTC

To debug `openshift-must-gather inspect clusteroperators` is very helpful (see https://github.com/openshift/must-gather).  It would gather the KAS and KCM configurations and logs and the current operator states.  For kubelet cert issues `openshift-must-gather inspect csr`, the actual certificates being used by the kubelet, and the kubelet logs are necessary.

Adding Eric in case he wants to start a KCS article.  Adding Seth in case he has other information he would want (this could just as easily arrive as a kubelet issue).

Comment 6 Junqi Zhao 2019-02-28 15:04:26 UTC

Created attachment 1539548 [details]
cluster info for Comment 4

Comment 7 Junqi Zhao 2019-02-28 15:07:30 UTC

Created attachment 1539549 [details]
kubeconfig file

attached kubeconfig file if you need to debug

Comment 9 David Eads 2019-02-28 18:22:14 UTC

At least one problem in your cluster was solved by https://github.com/openshift/origin/pull/22178.

```
F0228 18:18:29.430158       1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized

```

Which is preventing future rollouts.

Comment 11 Junqi Zhao 2019-03-01 00:49:44 UTC

Created attachment 1539676 [details]
clsuter operator information

Comment 13 Junqi Zhao 2019-03-01 02:27:58 UTC

Created attachment 1539707 [details]
clsuter operator information when worker nodes in NotReady status

Comment 14 Junqi Zhao 2019-03-01 02:39:22 UTC

add Regression keyword, the worker nodes could be in Running status with our previous payload, such as 4.0.0-0.nightly-2019-02-26-125216

Comment 15 Junqi Zhao 2019-03-01 02:40:41 UTC

(In reply to Junqi Zhao from comment #14)
> add Regression keyword, the worker nodes could be in Running status with our
> previous payload, such as 4.0.0-0.nightly-2019-02-26-125216

I mean worker nodes could be in Running status for more than a day or even longer

Comment 17 David Eads 2019-03-01 13:52:21 UTC

Just to clarify.  Comment 9 indicated that https://github.com/openshift/origin/pull/22178 is necessary for the cluster to survive having aggregate apiservers like metrics and package server going down.  I don't know why those two are problematic and not the kube-apiserver, but that pull adds toleration for downed elements.  Without that, revisions will fail and certs will expire.  Your cluster gives you about 8 hours warning before death right now.  Alerts should be firing because the clusteroperator is failing.

If you can reproduce after 22178, we may have something new to look at.

Comment 19 Xingxing Xia 2019-03-04 10:20:18 UTC

The latest payload 4.0.0-0.nightly-2019-03-04-033148 doesn't merge above PR in, and thus the env set up with it can reproduce it, so wait for build that includes the fix PR
$ oc get no -w # watch nodes
...
ip-10-0-170-155.us-east-2.compute.internal   Ready    worker   5h6m    v1.12.4+4dd65df23d
ip-10-0-170-155.us-east-2.compute.internal   NotReady   worker   5h7m   v1.12.4+4dd65df23d
...

$ oc get po -n openshift-kube-apiserver
kube-apiserver-ip-10-0-143-237.us-east-2.compute.internal       1/1     Running            0          96m                              
kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal       0/1     CrashLoopBackOff   19         77m
kube-apiserver-ip-10-0-162-22.us-east-2.compute.internal        1/1     Running            0          97m

Comment 20 Junqi Zhao 2019-03-04 14:14:14 UTC

(In reply to Xingxing Xia from comment #19)
> The latest payload 4.0.0-0.nightly-2019-03-04-033148 doesn't merge above PR
> in, and thus the env set up with it can reproduce it, so wait for build that
> includes the fix PR

> $ oc get po -n openshift-kube-apiserver
> kube-apiserver-ip-10-0-143-237.us-east-2.compute.internal       1/1    
> Running            0          96m                              
> kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal       0/1    
> CrashLoopBackOff   19         77m
> kube-apiserver-ip-10-0-162-22.us-east-2.compute.internal        1/1    
> Running            0          97m

$ oc -n openshift-kube-apiserver describe pod kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal
E0304 14:11:32.058871       1 reflector.go:134] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io)
I0304 14:11:32.068426       1 cache.go:39] Caches are synced for AvailableConditionController controller
I0304 14:11:32.069364       1 cache.go:39] Caches are synced for autoregister controller
I0304 14:11:32.152218       1 cache.go:39] Caches are synced for APIServiceRegistrationController controller
I0304 14:11:32.258916       1 controller_utils.go:1034] Caches are synced for crd-autoregister controller
E0304 14:11:32.391394       1 autoregister_controller.go:190] v1beta1.cloudcredential.openshift.io failed with : apiservices.apiregistration.k8s.io "v1beta1.cloudcredential.openshift.io" already exists
E0304 14:11:32.391653       1 autoregister_controller.go:190] v1beta1.machine.openshift.io failed with : apiservices.apiregistration.k8s.io "v1beta1.machine.openshift.io" already exists
E0304 14:11:32.463740       1 autoregister_controller.go:190] v1alpha1.marketplace.redhat.com failed with : apiservices.apiregistration.k8s.io "v1alpha1.marketplace.redhat.com" already exists
W0304 14:11:32.470201       1 lease.go:226] Resetting endpoints for master service "kubernetes" to [10.0.137.255 10.0.144.117 10.0.169.231]
E0304 14:11:32.534603       1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized
E0304 14:11:32.657012       1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized
I0304 14:11:32.860646       1 storage_scheduling.go:100] all system priority classes are created successfully or already exist.
F0304 14:11:32.929751       1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized

Comment 21 Mike Fiedler 2019-03-04 14:54:09 UTC

The problem does not seem to be occurring on 4.0.0-0.nightly-2019-03-04-095138.

- kube-apiserver pods are stable - no crashloop
- describe on the kube-apiserver pods shows no error messages
- after 3 hours, nodes are still Ready
- oc get clusteroperator does show theat kube-scheduler, kube-controller-manager and openshift-api-server had an issue.   investigating the logs:

# oc get clusteroperators
NAME                                  VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
cluster-autoscaler                              True        False         False     152m
cluster-storage-operator                        True        False         False     150m
console                                         True        False         False     144m
dns                                             True        False         False     165m
image-registry                                  True        False         False     149m
ingress                                         True        False         False     145m
kube-apiserver                                  True        False         False     162m
kube-controller-manager                         True        False         False     35m
kube-scheduler                                  True        False         False     34m
machine-api                                     True        False         False     153m
machine-config                                  True        False         False     35m
marketplace-operator                            True        False         False     150m
monitoring                                      True        False         False     143m
network                                         True        False         False     145m
node-tuning                                     True        False         False     145m
openshift-apiserver                             True        False         False     34m
openshift-authentication                        True        False         False     154m
openshift-cloud-credential-operator             True        False         False     153m
openshift-controller-manager                    True        False         False     150m
openshift-samples                               True        False         False     149m
operator-lifecycle-manager                      True        False         False     152m



We will continue soaking.

Comment 22 Mike Fiedler 2019-03-04 17:23:06 UTC

This is on 4.0.0-0.nightly-2019-03-04-095138

comment 21 is incorrect, after about 5 hours, openshift-kube-apiserver started crashlooping

kube-apiserver-ip-10-0-129-56.us-east-2.compute.internal        1/1     Running            0          88m
kube-apiserver-ip-10-0-155-119.us-east-2.compute.internal       0/1     CrashLoopBackOff   17         68m
kube-apiserver-ip-10-0-171-110.us-east-2.compute.internal       1/1     Running            0          89m


    Last State:  Terminated
      Reason:    Error
      Message:   34] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io)                           
E0304 17:18:04.525810       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.526366       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.526795       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.527313       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.527745       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.528458       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.529128       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.529612       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.530116       1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found
E0304 17:18:04.530649       1 handler_proxy.go:131] error resolving openshift-operator-lifecycle-manager/v1alpha1-packages-apps-redhat-com: service "v1alpha1-packages-apps-redhat-com" not found                                                                                        
I0304 17:18:04.627336       1 cache.go:39] Caches are synced for AvailableConditionController controller
W0304 17:18:04.738029       1 lease.go:226] Resetting endpoints for master service "kubernetes" to [10.0.129.56 10.0.155.119 10.0.171.110]
E0304 17:18:04.880644       1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized
F0304 17:18:05.004825       1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized

Comment 26 Xingxing Xia 2019-03-05 08:43:47 UTC

Verified in 4.0.0-0.nightly-2019-03-04-234414 envs, the envs didn't hit the bug again.
$ oc get no -w # didn't see NotReady one
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-129-108.us-east-2.compute.internal   Ready    master   7h10m   v1.12.4+5dc94f3fda
ip-10-0-131-97.us-east-2.compute.internal    Ready    worker   6h48m   v1.12.4+5dc94f3fda
ip-10-0-144-198.us-east-2.compute.internal   Ready    worker   6h49m   v1.12.4+5dc94f3fda
ip-10-0-156-137.us-east-2.compute.internal   Ready    master   7h10m   v1.12.4+5dc94f3fda
ip-10-0-160-55.us-east-2.compute.internal    Ready    master   7h10m   v1.12.4+5dc94f3fda
ip-10-0-172-102.us-east-2.compute.internal   Ready    worker   6h49m   v1.12.4+5dc94f3fda

$ oc get po -n openshift-kube-apiserver -w -l apiserver # didn't see crashing one
NAME                                                        READY   STATUS    RESTARTS   AGE
kube-apiserver-ip-10-0-129-108.us-east-2.compute.internal   1/1     Running   2          59m
kube-apiserver-ip-10-0-156-137.us-east-2.compute.internal   1/1     Running   2          60m
kube-apiserver-ip-10-0-160-55.us-east-2.compute.internal    1/1     Running   2          61m

Comment 31 Mike Fiedler 2019-03-05 18:54:47 UTC

We have 3 clusters up for 13h, 15h and 22h and all are healthy
- All nodes Ready
- no clusteroperators failing
- no crashlooping kube-apiserver pods

Marking this verified.

Comment 34 errata-xmlrpc 2019-06-04 10:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758