Description of problem: installation variables vm_type: m3.large node_number: 3 auth_type: allowall Installation stuck at TASK [Approve node certificates when bootstrapping] TASK [Dump the bootstrap hostnames] ******************************************** Tuesday 28 August 2018 14:08:32 +0800 (0:00:00.242) 0:12:26.156 ******** ok: [ec2-54-173-107-165.compute-1.amazonaws.com] => { "msg": [ "ip-172-18-0-218.ec2.internal", "ip-172-18-23-190.ec2.internal", "ip-172-18-13-240.ec2.internal", "ip-172-18-29-86.ec2.internal", "ip-172-18-3-188.ec2.internal" ] } TASK [Approve node certificates when bootstrapping] **************************** Tuesday 28 August 2018 14:08:32 +0800 (0:00:00.090) 0:12:26.247 ******** FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). and it stuck at 29th retry and did not continue with other retries for a long time. csr is pending for nodes: ip-172-18-23-190.ec2.internal, ip-172-18-23-190.ec2.internal, ip-172-18-13-240.ec2.internal # oc get node NAME STATUS ROLES AGE VERSION ip-172-18-0-218.ec2.internal Ready master 51m v1.11.0+d4cacc0 ip-172-18-13-240.ec2.internal Ready <none> 46m v1.11.0+d4cacc0 ip-172-18-23-190.ec2.internal Ready <none> 46m v1.11.0+d4cacc0 ip-172-18-29-86.ec2.internal Ready compute 46m v1.11.0+d4cacc0 ip-172-18-3-188.ec2.internal Ready compute 46m v1.11.0+d4cacc0 # oc get csr NAME AGE REQUESTOR CONDITION csr-5pfrl 8m system:node:ip-172-18-23-190.ec2.internal Pending csr-7gsfg 48m system:node:ip-172-18-0-218.ec2.internal Approved,Issued csr-7qd8k 46m system:node:ip-172-18-23-190.ec2.internal Approved,Issued csr-cbvqt 46m system:node:ip-172-18-13-240.ec2.internal Approved,Issued csr-ccms4 21m system:node:ip-172-18-23-190.ec2.internal Pending csr-cv6tz 33m system:node:ip-172-18-23-190.ec2.internal Pending csr-cx8xk 50m system:admin Approved,Issued csr-dtr5s 46m system:node:ip-172-18-29-86.ec2.internal Approved,Issued csr-jg2gd 46m system:node:ip-172-18-3-188.ec2.internal Approved,Issued csr-jqwjs 46m system:node:ip-172-18-13-240.ec2.internal Pending csr-kg5dj 46m system:node:ip-172-18-3-188.ec2.internal Pending csr-knfvn 34m system:node:ip-172-18-3-188.ec2.internal Pending csr-kq7zh 21m system:node:ip-172-18-3-188.ec2.internal Pending csr-l97cp 8m system:node:ip-172-18-3-188.ec2.internal Pending csr-m8pks 8m system:node:ip-172-18-13-240.ec2.internal Pending csr-nmpcn 46m system:node:ip-172-18-0-218.ec2.internal Approved,Issued csr-tkkgt 50m system:admin Approved,Issued csr-w8mkp 21m system:node:ip-172-18-13-240.ec2.internal Pending csr-wmgqt 34m system:node:ip-172-18-13-240.ec2.internal Pending csr-wqkpg 46m system:node:ip-172-18-23-190.ec2.internal Pending node-csr-7dFzjOYLfpb4_Y1PHiI82fdySE45jP0tIlUO76ri-cM 46m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-9YRqjhAGecWHO5Of5x8sl4JQhPuPVQv03Jcr5qI1Jb8 46m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-pEcYR7djB4SMDXSQ1e6U2oRL1qHYXwq3On82Gm09Mkc 46m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-v_iwUUogoHmbiYv1E3BgSovXhGC65TG9GXKWgcBnxo4 46m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued Version-Release number of the following components: openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Installation stuck at TASK [Approve node certificates when bootstrapping] Expected results: Additional info: It seems it is easily reproduced with big number of nodes
I also hit such issue, seem like this bug is easy to be reproduced in a big scale of cluster. In a 3 node clutser (1 master node + 1 infra node + 1 compute node), installation is completed successfully. In a 5 node clutser (3 master node + 1 infra node + 1 compute node), installation failed. In the comment 0, the cluster is 1 master node + 2 infra nodes + 2 compute nodes. This is introduced recently (since .24 build), and is blocking master HA env set up, blocking QE's testing.
PR Created: https://github.com/openshift/openshift-ansible/pull/9800
The PR is merged into openshift-ansible-3.11.0-0.25.0, move it to ON_QA.
Verified this bug with openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch, and PASS. TASK [Dump the bootstrap hostnames] ******************************************** Wednesday 29 August 2018 11:37:05 +0800 (0:00:00.265) 0:00:50.049 ****** ok: [ec2-52-54-185-242.compute-1.amazonaws.com] => { "msg": [ "ip-172-18-12-30.ec2.internal", "ip-172-18-3-190.ec2.internal", "ip-172-18-7-182.ec2.internal", "ip-172-18-2-12.ec2.internal", "ip-172-18-8-189.ec2.internal" ] } TASK [Approve node certificates when bootstrapping] **************************** Wednesday 29 August 2018 11:37:05 +0800 (0:00:00.083) 0:00:50.132 ****** FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (27 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (26 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (25 retries left). changed: [ec2-52-54-185-242.compute-1.amazonaws.com] => {"attempts": 7, "changed": true, "client_approve_results": [], "rc": 0, "server_approve_results": ["certificatesigningrequest.certificates.k8s.io/csr-gzgp8 approved\n", "certificatesigningrequest.certificates.k8s.io/csr-8m2cf approved\n"]} [root@ip-172-18-12-30 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-4fb8x 22m system:node:ip-172-18-3-190.ec2.internal Approved,Issued csr-4kbph 24m system:admin Approved,Issued csr-6qrkd 24m system:admin Approved,Issued csr-7bb2s 19m system:node:ip-172-18-12-30.ec2.internal Approved,Issued csr-8lvvf 24m system:admin Approved,Issued csr-8m2cf 18m system:node:ip-172-18-8-189.ec2.internal Approved,Issued csr-bgx75 22m system:node:ip-172-18-7-182.ec2.internal Approved,Issued csr-bvl28 19m system:node:ip-172-18-3-190.ec2.internal Approved,Issued csr-cz2b7 24m system:admin Approved,Issued csr-gzgp8 18m system:node:ip-172-18-2-12.ec2.internal Approved,Issued csr-lfbrj 24m system:admin Approved,Issued csr-n5dnc 24m system:admin Approved,Issued csr-pj4pt 19m system:node:ip-172-18-7-182.ec2.internal Approved,Issued csr-q7kcz 19m system:node:ip-172-18-8-189.ec2.internal Approved,Issued csr-t6dj9 19m system:node:ip-172-18-2-12.ec2.internal Approved,Issued csr-zqxlr 22m system:node:ip-172-18-12-30.ec2.internal Approved,Issued node-csr-6PKGRUMJDch4nq6Hd45QCer0so5VZjCe7DdCrbmkMBI 19m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-p2foh5KptJCXUbKmE3DtN3HSEOUwKaiCnq5nPCZWzo4 19m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued
meet this openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch TASK [Approve node certificates when bootstrapping] **************************** Tuesday 04 September 2018 17:24:17 +0800 (0:00:00.083) 0:17:50.596 ***** FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (3 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left). fatal: [share3-wmengah311-master-etcd-zone1-1.0904-beu.qe.rhcloud.com]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: share3-wmengah311-master-etcd-zone1-1", "state": "unknown"} [root@share3-wmengah311-master-etcd-zone1-1 ~]# oc get node NAME STATUS ROLES AGE VERSION share3-wmengah311-master-etcd-zone2-1 Ready master 29m v1.11.0+d4cacc0 share3-wmengah311-master-etcd-zone2-2 Ready master 29m v1.11.0+d4cacc0 [root@share3-wmengah311-master-etcd-zone1-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-5h8hn 11m system:admin Pending csr-d6n6v 11m system:admin Pending csr-dk679 11m system:admin Pending csr-fg4ms 11m system:admin Approved,Issued csr-hdbv7 8m system:node:share3-wmengah311-master-etcd-zone2-2 Pending csr-htrq2 7m system:node:share3-wmengah311-master-etcd-zone2-1 Pending csr-k742q 11m system:admin Approved,Issued csr-lv872 8m system:node:share3-wmengah311-master-etcd-zone2-1 Pending csr-lz9mg 7m system:node:share3-wmengah311-master-etcd-zone2-2 Pending csr-n2nvz 3m system:node:share3-wmengah311-master-etcd-zone1-1 Pending csr-wwg5d 1m system:node:share3-wmengah311-master-etcd-zone1-1 Pending csr-xv2vv 11m system:admin Approved,Issued csr-xvnml 7m system:node:share3-wmengah311-master-etcd-zone1-1 Pending csr-znd2l 8m system:node:share3-wmengah311-master-etcd-zone1-1 Pending node-csr--kVBjtGE8wbaDAscL3IH5_6rWZGku46qSV0pL_XqNgs 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-2UaLHRSRPOXm9nhI98SBfyzgNkX3V-C7SIxP5tbqJqU 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-67v1yVAJlG-gIHZHvOA90d6Dhjb9YhCou94wvg2EARM 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-Tp8zgZHGhOkFjteTYtr_OlDakdJ_25omEPT07u4eIuA 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-_e2GsHH_F2ccF0SY21cBhgB_rQtgW2lqbh2zfmfrm-8 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending node-csr-zqu_gDIn5YLNGWw5zXaU0mIe8gseNLlQoQUmBNCvVaw 7m system:serviceaccount:openshift-infra:node-bootstrapper Pending
Seeing similar issue with openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch.
(In reply to Marek Goldmann from comment #10) > Seeing similar issue with > openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch. The same here with version 3.10.
Also witnessed in openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch Not able to get through the deployment.
We should probably consider that to be a new bug since QE has already VERIFIED this bug and we may be dealing with a new problem. With the new bug can you include your complete inventory, as well as the output of `oc get nodes` and `oc get csr -o yaml` The last will likely contain private data for signed certificates so please mark it as a private attachment unless it's just a test environment you don't care about. My suspicion is that the name on the CSR is different than what we're expecting it to be.
Create a new 3.10 bug https://bugzilla.redhat.com/show_bug.cgi?id=1625817 to track the 3.10 issue , since this original bug has been fixed and been verified by QE,so still change back the bug's status to VERIFIED,if something still needs to be fixed in this bug,feel free to re-open it
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.