Description of problem: Upgrade ocp from v3.10.14 to v3.10.28, master upgrade succeed, but node upgrade failed at task Task [openshift_node : Approve the node] *************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48 Friday 17 August 2018 06:09:48 +0000 (0:00:03.593) 0:48:32.550 ********* skipping: [node1] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} ... TASK [openshift_node : Check status of node service] *************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:57 Friday 17 August 2018 06:09:48 +0000 (0:00:00.031) 0:48:32.581 ********* FAILED - RETRYING: Check status of node service (30 retries left). ... FAILED - RETRYING: Check status of node service (4 retries left). fatal: [node1]: FAILED! => {"ansible_job_id": "756056770276.32182", "attempts": 28, "changed": false, "failed": true, "finished": 1, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} ======================================= #journalctl -u atomic-openshift-node.service Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389100 5503 server.go:501] Successfully initialized cloud provider: "gce" from the config file: "/etc/origin/cloudprovider/gce.conf" Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389197 5503 server.go:739] cloud provider determined current node name to be qe-jliu-r-node-registry-router-1 Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.389237 5503 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.390784 5503 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new one Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.416628 5503 csr.go:105] csr for this node already exists, reusing Aug 17 02:35:17 qe-jliu-r-node-registry-router-1 atomic-openshift-node[5503]: I0817 02:35:17.420230 5503 csr.go:113] csr for this node is still valid Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service start operation timed out. Terminating. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Failed to start OpenShift Node. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Aug 17 02:40:17 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service failed. Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart. Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 systemd[1]: Starting OpenShift Node... Aug 17 02:40:22 qe-jliu-r-node-registry-router-1 atomic-openshift-node[6316]: I0817 02:40:22.545480 6317 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true] [root@qe-jliu-r-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-5w89l 46m system:node:qe-jliu-r-master-etcd-1 Pending csr-8sj5b 59m system:node:qe-jliu-r-master-etcd-1 Pending csr-jdhmk 33m system:node:qe-jliu-r-master-etcd-1 Pending csr-jhmfl 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-nqdml 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-rnlw2 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-rp4d7 1h system:node:qe-jliu-r-master-etcd-1 Pending csr-tbw5v 20m system:node:qe-jliu-r-master-etcd-1 Pending csr-xdzzl 7m system:node:qe-jliu-r-master-etcd-1 Pending node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc 1h system:serviceaccount:openshift-infra:node-bootstrapper Pending It seems caused by a new merged pr9530, which skipped task [openshift_node : Approve the node], so node was not ready(but master node works well and upgrade successfully). After approve them manually, node works well. #kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve [root@qe-jliu-r-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-5w89l 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-8sj5b 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-jdhmk 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-jhmfl 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-nqdml 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-rnlw2 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-rp4d7 2h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-tbw5v 1h system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-wxnhx 38m system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-xdzzl 50m system:node:qe-jliu-r-master-etcd-1 Approved,Issued csr-xvc7r 34m system:node:qe-jliu-r-node-registry-router-1 Approved,Issued node-csr-zkZacz4-7xZaevamNB1U_NjqESYbAWphXdksbPrv3Hc 1h system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued Version-Release number of the following components: ansible-2.4.6.0-1.el7ae.noarch openshift-ansible-3.10.28-1.git.0.9242c73None.noarch How reproducible: always Steps to Reproduce: 1. Run upgrade ocp from v3.10.14 to v3.10.28 2. 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
This task is removing node certificates during the upgrade process. TASK [openshift_node : Remove previous bootstrap certificates] ***************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/bootstrap_changes.yml:92 Friday 17 August 2018 05:34:26 +0000 (0:00:02.073) 0:13:09.914 ********* changed: [qe-jliu-r-master-etcd-1.0817-80n.qe.rhcloud.com] => {"changed": true, "failed": false, "path": "/etc/origin/node/certificates", "state": "absent"} We either need to land the oc_adm_csr changes to ensure that we fix the node approval and then reinstate that task or we need to also conditionally remove the certs.
This issue still reproduced with openshift-ansible-3.10.35-1.git.0.e5b821eNone.noarch. This is blocking v3.10 minor version update.
This should've been resolved by http://github.com/openshift/openshift-ansible/pull/9751 Can we please test with 3.10.36-1 or later?
Re-test this bug with openshift-ansible-3.10.38-1.git.0.8cfad6d.el7.noarch, still reproduced, but pending csr is less than before. [root@qe-jialiu3102-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-6ktzf 1h system:admin Approved,Issued csr-bz4hk 1h system:admin Approved,Issued csr-dpbpl 57m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-f7fvr 56m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-jsbtg 56m system:node:qe-jialiu3102-node-1 Approved,Issued csr-m92fw 56m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued csr-mkkg5 16m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued node-csr-06NTOIbTWS_kwM1ArCXM7354PVzzTbn-BhdyusiDkrU 56m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-CzQEqfHAKPf0ezTNNMpUKjfMKIcGHdlpMUQ_NdrN3ds 56m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-txdILUcMgrJtYoFtquQxphl0t9pkVCyEp1vySxyD3L0 16m system:admin Approved,Issued After approve the pending node csr manually, node service is starting successfully. [root@qe-jialiu3102-master-etcd-1 ~]# kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve certificatesigningrequest.certificates.k8s.io "node-csr-ALO1Hz0cT2p7PL0B7VFbMTgz8Xji16YEDZWCYeiCC04" approved
Russ, I think now that we've backported the CSR approval changes we should re-introduce the csr check using the new module. The logs in comment 9 look as if the approval was skipped all together. This may be a side effect of us doing some bootstrapping tasks on all upgrades which may clear out certificates? I cant think of any other reason why a minor upgrade would require new certificates.
Proposed: https://github.com/openshift/openshift-ansible/pull/9817
In 3.10.41-1
Verified this bug with openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch, and PASS. upgrade is completed successfully, node service is started successfully, and no pending csr is seen. [root@qe-jialiu3102-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-49p68 41m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-9bz59 40m system:node:qe-jialiu3102-node-1 Approved,Issued csr-fsqbc 44m system:admin Approved,Issued csr-g9lm7 44m system:admin Approved,Issued csr-l77kf 40m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-njcz8 18m system:node:qe-jialiu3102-master-etcd-1 Approved,Issued csr-pzjmd 9m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued csr-v24p7 7m system:node:qe-jialiu3102-node-1 Approved,Issued csr-vtvhq 40m system:node:qe-jialiu3102-node-registry-router-1 Approved,Issued node-csr-1swVnqF9z1UTUCyptMNbDbDH5cAeblkinZEnjeVUepU 9m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-C-XW9iAMNjgCOOWReg92hn3R3mc2zSWkdTdTDgkxQeo 40m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-InEz0zDMViHe_9POYYqJQGXzCFpoaCE57VygOIBnn-Y 18m system:admin Approved,Issued node-csr-qxOiXzkQ537wAG2yt3O7Azu_42G7gcxTSM1uLJCyYro 40m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-xWpNVF82Gfxpm2nPiYiQZYXsM-aUagX6jxvR7b8nl0o 7m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2578