Bug 1618659 - Fail to minor version upgrade node due to node's csr was not aprroved
Summary: Fail to minor version upgrade node due to node's csr was not aprroved
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.z
Assignee: Russell Teague
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1619408
TreeView+ depends on / blocked
 
Reported: 2018-08-17 09:00 UTC by liujia
Modified: 2021-12-10 17:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
With the oc_csr_approve module recently backported to release 3.10, csr approvals are more robust. The task to approve the node csr was skipped previously for minor version upgrades. By using the new module, and not skipping the approval task in minor upgrades, the node csr is approved as expected and the node runs normally.
Clone Of:
Environment:
Last Closed: 2018-09-04 07:10:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2578 0 None None None 2018-09-04 07:11:05 UTC

Description liujia 2018-08-17 09:00:27 UTC
Description of problem:
Upgrade ocp from v3.10.14 to v3.10.28, master upgrade succeed, but node upgrade failed at task 
Task [openshift_node : Approve the node] ***************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48
Friday 17 August 2018  06:09:48 +0000 (0:00:03.593)       0:48:32.550 ********* 
skipping: [node1] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
...
TASK [openshift_node : Check status of node service] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:57
Friday 17 August 2018  06:09:48 +0000 (0:00:00.031)       0:48:32.581 ********* 
FAILED - RETRYING: Check status of node service (30 retries left).
...
FAILED - RETRYING: Check status of node service (4 retries left).
fatal: [node1]: FAILED! => {"ansible_job_id": "756056770276.32182", "attempts": 28, "changed": false, "failed": true, "finished": 1, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}

=======================================
#journalctl -u atomic-openshift-node.service

Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389100    5503 server.go:501] Successfully initialized cloud provider: "gce" from the config file: "/etc/origin/cloudprovider/gce.conf"
Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389197    5503 server.go:739] cloud provider determined current node name to be qe-jliu-r-node-registry-router-1
Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389237    5503 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.390784    5503 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new one
Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.416628    5503 csr.go:105] csr for this node already exists, reusing
Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.420230    5503 csr.go:113] csr for this node is still valid
Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service start operation timed out. Terminating.
Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Failed to start OpenShift Node.
Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service failed.
Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: Starting OpenShift Node...
Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 atomic-openshift-node[6316]: I0817 02:40:22.545480    6317 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]

[root@qe-jliu-r-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-5w89l                                              46m       system:node:qe-jliu-r-master-etcd-1                       Pending
csr-8sj5b                                              59m       system:node:qe-jliu-r-master-etcd-1                       Pending
csr-jdhmk                                              33m       system:node:qe-jliu-r-master-etcd-1                       Pending
csr-jhmfl                                              1h        system:node:qe-jliu-r-master-etcd-1                       Pending
csr-nqdml                                              1h        system:node:qe-jliu-r-master-etcd-1                       Pending
csr-rnlw2                                              1h        system:node:qe-jliu-r-master-etcd-1                       Pending
csr-rp4d7                                              1h        system:node:qe-jliu-r-master-etcd-1                       Pending
csr-tbw5v                                              20m       system:node:qe-jliu-r-master-etcd-1                       Pending
csr-xdzzl                                              7m        system:node:qe-jliu-r-master-etcd-1                       Pending
node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc   1h        system:serviceaccount:openshift-infra:node-bootstrapper   Pending

It seems caused by a new merged pr9530, which skipped task [openshift_node : Approve the node], so node was not ready(but master node works well and upgrade successfully). After approve them manually, node works well. 

#kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve

[root@qe-jliu-r-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-5w89l                                              1h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-8sj5b                                              1h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-jdhmk                                              1h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-jhmfl                                              2h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-nqdml                                              1h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-rnlw2                                              2h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-rp4d7                                              2h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-tbw5v                                              1h        system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-wxnhx                                              38m       system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-xdzzl                                              50m       system:node:qe-jliu-r-master-etcd-1                       Approved,Issued
csr-xvc7r                                              34m       system:node:qe-jliu-r-node-registry-router-1              Approved,Issued
node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc   1h        system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued


Version-Release number of the following components:
ansible-2.4.6.0-1.el7ae.noarch
openshift-ansible-3.10.28-1.git.0.9242c73None.noarch

How reproducible:
always

Steps to Reproduce:
1. Run upgrade ocp from v3.10.14 to v3.10.28
2.
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2018-08-20 16:52:45 UTC
This task is removing node certificates during the upgrade process.


TASK [openshift_node : Remove previous bootstrap certificates] *****************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/bootstrap_changes.yml:92
Friday 17 August 2018  05:34:26 +0000 (0:00:02.073)       0:13:09.914 ********* 
changed: [qe-jliu-r-master-etcd-1.0817-80n.qe.rhcloud.com] => {"changed": true, "failed": false, "path": "/etc/origin/node/certificates", "state": "absent"}

We either need to land the oc_adm_csr changes to ensure that we fix the node approval and then reinstate that task or we need to also conditionally remove the certs.

Comment 4 Johnny Liu 2018-08-28 06:56:22 UTC
This issue still reproduced with openshift-ansible-3.10.35-1.git.0.e5b821eNone.noarch.

This is blocking v3.10 minor version update.

Comment 5 Scott Dodson 2018-08-28 12:42:09 UTC
This should've been resolved by http://github.com/openshift/openshift-ansible/pull/9751

Can we please test with 3.10.36-1 or later?

Comment 8 Johnny Liu 2018-08-29 03:27:40 UTC
Re-test this bug with openshift-ansible-3.10.38-1.git.0.8cfad6d.el7.noarch, still reproduced, but pending csr is less than before.

[root@qe-jialiu3102-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-6ktzf                                              1h        system:admin                                              Approved,Issued
csr-bz4hk                                              1h        system:admin                                              Approved,Issued
csr-dpbpl                                              57m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
csr-f7fvr                                              56m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
csr-jsbtg                                              56m       system:node:qe-jialiu3102-node-1                          Approved,Issued
csr-m92fw                                              56m       system:node:qe-jialiu3102-node-registry-router-1          Approved,Issued
csr-mkkg5                                              16m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
node-csr-06NTOIbTWS_kwM1ArCXM7354PVzzTbn-BhdyusiDkrU   56m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-CzQEqfHAKPf0ezTNNMpUKjfMKIcGHdlpMUQ_NdrN3ds   56m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-txdILUcMgrJtYoFtquQxphl0t9pkVCyEp1vySxyD3L0   16m       system:admin                                              Approved,Issued


After approve the pending node csr manually, node service is starting successfully.

[root@qe-jialiu3102-master-etcd-1 ~]# kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve
certificatesigningrequest.certificates.k8s.io "node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04" approved

Comment 10 Scott Dodson 2018-08-29 11:44:16 UTC
Russ, I think now that we've backported the CSR approval changes we should re-introduce the csr check using the new module. The logs in comment 9 look as if the approval was skipped all together.

This may be a side effect of us doing some bootstrapping tasks on all upgrades which may clear out certificates? I cant think of any other reason why a minor upgrade would require new certificates.

Comment 11 Russell Teague 2018-08-29 12:43:26 UTC
Proposed: https://github.com/openshift/openshift-ansible/pull/9817

Comment 12 Scott Dodson 2018-08-29 16:46:36 UTC
In 3.10.41-1

Comment 13 Johnny Liu 2018-08-30 02:32:52 UTC
Verified this bug with openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch, and PASS.

upgrade is completed successfully, node service is started successfully, and no pending csr is seen.

[root@qe-jialiu3102-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-49p68                                              41m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
csr-9bz59                                              40m       system:node:qe-jialiu3102-node-1                          Approved,Issued
csr-fsqbc                                              44m       system:admin                                              Approved,Issued
csr-g9lm7                                              44m       system:admin                                              Approved,Issued
csr-l77kf                                              40m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
csr-njcz8                                              18m       system:node:qe-jialiu3102-master-etcd-1                   Approved,Issued
csr-pzjmd                                              9m        system:node:qe-jialiu3102-node-registry-router-1          Approved,Issued
csr-v24p7                                              7m        system:node:qe-jialiu3102-node-1                          Approved,Issued
csr-vtvhq                                              40m       system:node:qe-jialiu3102-node-registry-router-1          Approved,Issued
node-csr-1swVnqF9z1UTUCyptMNbDbDH5cAeblkinZEnjeVUepU   9m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-C-XW9iAMNjgCOOWReg92hn3R3mc2zSWkdTdTDgkxQeo   40m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-InEz0zDMViHe_9POYYqJQGXzCFpoaCE57VygOIBnn-Y   18m       system:admin                                              Approved,Issued
node-csr-qxOiXzkQ537wAG2yt3O7Azu_42G7gcxTSM1uLJCyYro   40m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-xWpNVF82Gfxpm2nPiYiQZYXsM-aUagX6jxvR7b8nl0o   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued

Comment 15 errata-xmlrpc 2018-09-04 07:10:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2578


Note You need to log in before you can comment on or make changes to this bug.