Bug 1618659
| Summary: | Fail to minor version upgrade node due to node's csr was not aprroved | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
| Component: | Cluster Version Operator | Assignee: | Russell Teague <rteague> |
| Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.10.0 | CC: | aos-bugs, dmoessne, jialiu, jokerman, mgugino, mifiedle, mmccomas, sdodson, smunilla |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | 3.10.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
With the oc_csr_approve module recently backported to release 3.10, csr approvals are more robust. The task to approve the node csr was skipped previously for minor version upgrades. By using the new module, and not skipping the approval task in minor upgrades, the node csr is approved as expected and the node runs normally.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-04 07:10:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1619408 | ||
This task is removing node certificates during the upgrade process.
TASK [openshift_node : Remove previous bootstrap certificates] *****************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/bootstrap_changes.yml:92
Friday 17 August 2018 05:34:26 +0000 (0:00:02.073) 0:13:09.914 *********
changed: [qe-jliu-r-master-etcd-1.0817-80n.qe.rhcloud.com] => {"changed": true, "failed": false, "path": "/etc/origin/node/certificates", "state": "absent"}
We either need to land the oc_adm_csr changes to ensure that we fix the node approval and then reinstate that task or we need to also conditionally remove the certs.
This issue still reproduced with openshift-ansible-3.10.35-1.git.0.e5b821eNone.noarch. This is blocking v3.10 minor version update. This should've been resolved by http://github.com/openshift/openshift-ansible/pull/9751 Can we please test with 3.10.36-1 or later? Re-test this bug with openshift-ansible-3.10.38-1.git.0.8cfad6d.el7.noarch, still reproduced, but pending csr is less than before. [root@qe-jialiu3102-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-6ktzf 1h system:admin Approved,Issued csr-bz4hk 1h system:admin Approved,Issued csr-dpbpl 57m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-f7fvr 56m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-jsbtg 56m system:node:qe-jialiu3102-node-1 Approved,Issued csr-m92fw 56m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued csr-mkkg5 16m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued node-csr-06NTOIbTWS_kwM1ArCXM7354PVzzTbn-BhdyusiDkrU 56m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-CzQEqfHAKPf0ezTNNMpUKjfMKIcGHdlpMUQ_NdrN3ds 56m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-txdILUcMgrJtYoFtquQxphl0t9pkVCyEp1vySxyD3L0 16m system:admin Approved,Issued After approve the pending node csr manually, node service is starting successfully. [root@qe-jialiu3102-master-etcd-1 ~]# kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve certificatesigningrequest.certificates.k8s.io "node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04" approved Russ, I think now that we've backported the CSR approval changes we should re-introduce the csr check using the new module. The logs in comment 9 look as if the approval was skipped all together. This may be a side effect of us doing some bootstrapping tasks on all upgrades which may clear out certificates? I cant think of any other reason why a minor upgrade would require new certificates. In 3.10.41-1 Verified this bug with openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch, and PASS. upgrade is completed successfully, node service is started successfully, and no pending csr is seen. [root@qe-jialiu3102-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-49p68 41m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-9bz59 40m system:node:qe-jialiu3102-node-1 Approved,Issued csr-fsqbc 44m system:admin Approved,Issued csr-g9lm7 44m system:admin Approved,Issued csr-l77kf 40m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-njcz8 18m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-pzjmd 9m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued csr-v24p7 7m system:node:qe-jialiu3102-node-1 Approved,Issued csr-vtvhq 40m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued node-csr-1swVnqF9z1UTUCyptMNbDbDH5cAeblkinZEnjeVUepU 9m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-C-XW9iAMNjgCOOWReg92hn3R3mc2zSWkdTdTDgkxQeo 40m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-InEz0zDMViHe_9POYYqJQGXzCFpoaCE57VygOIBnn-Y 18m system:admin Approved,Issued node-csr-qxOiXzkQ537wAG2yt3O7Azu_42G7gcxTSM1uLJCyYro 40m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-xWpNVF82Gfxpm2nPiYiQZYXsM-aUagX6jxvR7b8nl0o 7m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2578 |
Description of problem: Upgrade ocp from v3.10.14 to v3.10.28, master upgrade succeed, but node upgrade failed at task Task [openshift_node : Approve the node] *************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48 Friday 17 August 2018 06:09:48 +0000 (0:00:03.593) 0:48:32.550 ********* skipping: [node1] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} ... TASK [openshift_node : Check status of node service] *************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:57 Friday 17 August 2018 06:09:48 +0000 (0:00:00.031) 0:48:32.581 ********* FAILED - RETRYING: Check status of node service (30 retries left). ... FAILED - RETRYING: Check status of node service (4 retries left). fatal: [node1]: FAILED! => {"ansible_job_id": "756056770276.32182", "attempts": 28, "changed": false, "failed": true, "finished": 1, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} ======================================= #journalctl -u atomic-openshift-node.service Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389100 5503 server.go:501] Successfully initialized cloud provider: "gce" from the config file: "/etc/origin/cloudprovider/gce.conf" Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389197 5503 server.go:739] cloud provider determined current node name to be qe-jliu-r-node-registry-router-1 Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389237 5503 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.390784 5503 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new one Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.416628 5503 csr.go:105] csr for this node already exists, reusing Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.420230 5503 csr.go:113] csr for this node is still valid Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service start operation timed out. Terminating. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Failed to start OpenShift Node. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service failed. Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart. Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: Starting OpenShift Node... Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 atomic-openshift-node[6316]: I0817 02:40:22.545480 6317 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true] [root@qe-jliu-r-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-5w89l 46m system:node:qe-jliu-r-master-etcd-1 Pending csr-8sj5b 59m system:node:qe-jliu-r-master-etcd-1 Pending csr-jdhmk 33m system:node:qe-jliu-r-master-etcd-1 Pending csr-jhmfl 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-nqdml 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-rnlw2 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-rp4d7 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-tbw5v 20m system:node:qe-jliu-r-master-etcd-1 Pending csr-xdzzl 7m system:node:qe-jliu-r-master-etcd-1 Pending node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc 1h system:serviceaccount:openshift-infra:node-bootstrapper Pending It seems caused by a new merged pr9530, which skipped task [openshift_node : Approve the node], so node was not ready(but master node works well and upgrade successfully). After approve them manually, node works well. #kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve [root@qe-jliu-r-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-5w89l 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-8sj5b 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-jdhmk 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-jhmfl 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-nqdml 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-rnlw2 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-rp4d7 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-tbw5v 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-wxnhx 38m system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-xdzzl 50m system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-xvc7r 34m system:node:qe-jliu-r-node-registry-router-1 Approved,Issued node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc 1h system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued Version-Release number of the following components: ansible-2.4.6.0-1.el7ae.noarch openshift-ansible-3.10.28-1.git.0.9242c73None.noarch How reproducible: always Steps to Reproduce: 1. Run upgrade ocp from v3.10.14 to v3.10.28 2. 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag