Bug 1622945 - [3.11] Installation stuck at TASK [Approve node certificates when bootstrapping]
Summary: [3.11] Installation stuck at TASK [Approve node certificates when bootstrapping]
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Michael Gugino
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1565405 1623204 1623248 1625817
TreeView+ depends on / blocked
 
Reported: 2018-08-28 08:39 UTC by Junqi Zhao
Modified: 2018-12-21 15:23 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1623204 1623248 1625817 (view as bug list)
Environment:
Last Closed: 2018-12-21 15:23:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Junqi Zhao 2018-08-28 08:39:03 UTC
Description of problem:
installation variables 
vm_type: m3.large
node_number: 3
auth_type: allowall

Installation stuck at TASK [Approve node certificates when bootstrapping]

TASK [Dump the bootstrap hostnames] ********************************************
Tuesday 28 August 2018  14:08:32 +0800 (0:00:00.242)       0:12:26.156 ******** 
ok: [ec2-54-173-107-165.compute-1.amazonaws.com] => {
    "msg": [
        "ip-172-18-0-218.ec2.internal", 
        "ip-172-18-23-190.ec2.internal", 
        "ip-172-18-13-240.ec2.internal", 
        "ip-172-18-29-86.ec2.internal", 
        "ip-172-18-3-188.ec2.internal"
    ]
}

TASK [Approve node certificates when bootstrapping] ****************************
Tuesday 28 August 2018  14:08:32 +0800 (0:00:00.090)       0:12:26.247 ******** 
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).

and it stuck at 29th retry and did not continue with other retries for a long time.
csr is pending for nodes: ip-172-18-23-190.ec2.internal, ip-172-18-23-190.ec2.internal, ip-172-18-13-240.ec2.internal
# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-0-218.ec2.internal    Ready     master    51m       v1.11.0+d4cacc0
ip-172-18-13-240.ec2.internal   Ready     <none>    46m       v1.11.0+d4cacc0
ip-172-18-23-190.ec2.internal   Ready     <none>    46m       v1.11.0+d4cacc0
ip-172-18-29-86.ec2.internal    Ready     compute   46m       v1.11.0+d4cacc0
ip-172-18-3-188.ec2.internal    Ready     compute   46m       v1.11.0+d4cacc0
# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-5pfrl                                              8m        system:node:ip-172-18-23-190.ec2.internal                 Pending
csr-7gsfg                                              48m       system:node:ip-172-18-0-218.ec2.internal                  Approved,Issued
csr-7qd8k                                              46m       system:node:ip-172-18-23-190.ec2.internal                 Approved,Issued
csr-cbvqt                                              46m       system:node:ip-172-18-13-240.ec2.internal                 Approved,Issued
csr-ccms4                                              21m       system:node:ip-172-18-23-190.ec2.internal                 Pending
csr-cv6tz                                              33m       system:node:ip-172-18-23-190.ec2.internal                 Pending
csr-cx8xk                                              50m       system:admin                                              Approved,Issued
csr-dtr5s                                              46m       system:node:ip-172-18-29-86.ec2.internal                  Approved,Issued
csr-jg2gd                                              46m       system:node:ip-172-18-3-188.ec2.internal                  Approved,Issued
csr-jqwjs                                              46m       system:node:ip-172-18-13-240.ec2.internal                 Pending
csr-kg5dj                                              46m       system:node:ip-172-18-3-188.ec2.internal                  Pending
csr-knfvn                                              34m       system:node:ip-172-18-3-188.ec2.internal                  Pending
csr-kq7zh                                              21m       system:node:ip-172-18-3-188.ec2.internal                  Pending
csr-l97cp                                              8m        system:node:ip-172-18-3-188.ec2.internal                  Pending
csr-m8pks                                              8m        system:node:ip-172-18-13-240.ec2.internal                 Pending
csr-nmpcn                                              46m       system:node:ip-172-18-0-218.ec2.internal                  Approved,Issued
csr-tkkgt                                              50m       system:admin                                              Approved,Issued
csr-w8mkp                                              21m       system:node:ip-172-18-13-240.ec2.internal                 Pending
csr-wmgqt                                              34m       system:node:ip-172-18-13-240.ec2.internal                 Pending
csr-wqkpg                                              46m       system:node:ip-172-18-23-190.ec2.internal                 Pending
node-csr-7dFzjOYLfpb4_Y1PHiI82fdySE45jP0tIlUO76ri-cM   46m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-9YRqjhAGecWHO5Of5x8sl4JQhPuPVQv03Jcr5qI1Jb8   46m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-pEcYR7djB4SMDXSQ1e6U2oRL1qHYXwq3On82Gm09Mkc   46m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-v_iwUUogoHmbiYv1E3BgSovXhGC65TG9GXKWgcBnxo4   46m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued


Version-Release number of the following components:
openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm

How reproducible:
Always
Steps to Reproduce:
1. 
2.
3.

Actual results:
Installation stuck at TASK [Approve node certificates when bootstrapping]

Expected results:

Additional info:
It seems it is easily reproduced with big number of nodes

Comment 2 Johnny Liu 2018-08-28 08:53:29 UTC
I also hit such issue, seem like this bug is easy to be reproduced in a big scale of cluster. 

In a 3 node clutser (1 master node + 1 infra node + 1 compute node), installation is completed successfully.

In a 5 node clutser (3 master node + 1 infra node + 1 compute node), installation failed.

In the comment 0, the cluster is 1 master node + 2 infra nodes + 2 compute nodes.

This is introduced recently (since .24 build), and is blocking master HA env set up, blocking QE's testing.

Comment 5 Michael Gugino 2018-08-28 14:50:07 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/9800

Comment 7 Johnny Liu 2018-08-29 03:54:41 UTC
The PR is merged into openshift-ansible-3.11.0-0.25.0, move it to ON_QA.

Comment 8 Johnny Liu 2018-08-29 03:57:01 UTC
Verified this bug with openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch, and PASS.


TASK [Dump the bootstrap hostnames] ********************************************
Wednesday 29 August 2018  11:37:05 +0800 (0:00:00.265)       0:00:50.049 ****** 
ok: [ec2-52-54-185-242.compute-1.amazonaws.com] => {
    "msg": [
        "ip-172-18-12-30.ec2.internal", 
        "ip-172-18-3-190.ec2.internal", 
        "ip-172-18-7-182.ec2.internal", 
        "ip-172-18-2-12.ec2.internal", 
        "ip-172-18-8-189.ec2.internal"
    ]
}

TASK [Approve node certificates when bootstrapping] ****************************
Wednesday 29 August 2018  11:37:05 +0800 (0:00:00.083)       0:00:50.132 ****** 
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (27 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (26 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (25 retries left).
changed: [ec2-52-54-185-242.compute-1.amazonaws.com] => {"attempts": 7, "changed": true, "client_approve_results": [], "rc": 0, "server_approve_results": ["certificatesigningrequest.certificates.k8s.io/csr-gzgp8 approved\n", "certificatesigningrequest.certificates.k8s.io/csr-8m2cf approved\n"]}

[root@ip-172-18-12-30 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-4fb8x                                              22m       system:node:ip-172-18-3-190.ec2.internal                  Approved,Issued
csr-4kbph                                              24m       system:admin                                              Approved,Issued
csr-6qrkd                                              24m       system:admin                                              Approved,Issued
csr-7bb2s                                              19m       system:node:ip-172-18-12-30.ec2.internal                  Approved,Issued
csr-8lvvf                                              24m       system:admin                                              Approved,Issued
csr-8m2cf                                              18m       system:node:ip-172-18-8-189.ec2.internal                  Approved,Issued
csr-bgx75                                              22m       system:node:ip-172-18-7-182.ec2.internal                  Approved,Issued
csr-bvl28                                              19m       system:node:ip-172-18-3-190.ec2.internal                  Approved,Issued
csr-cz2b7                                              24m       system:admin                                              Approved,Issued
csr-gzgp8                                              18m       system:node:ip-172-18-2-12.ec2.internal                   Approved,Issued
csr-lfbrj                                              24m       system:admin                                              Approved,Issued
csr-n5dnc                                              24m       system:admin                                              Approved,Issued
csr-pj4pt                                              19m       system:node:ip-172-18-7-182.ec2.internal                  Approved,Issued
csr-q7kcz                                              19m       system:node:ip-172-18-8-189.ec2.internal                  Approved,Issued
csr-t6dj9                                              19m       system:node:ip-172-18-2-12.ec2.internal                   Approved,Issued
csr-zqxlr                                              22m       system:node:ip-172-18-12-30.ec2.internal                  Approved,Issued
node-csr-6PKGRUMJDch4nq6Hd45QCer0so5VZjCe7DdCrbmkMBI   19m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-p2foh5KptJCXUbKmE3DtN3HSEOUwKaiCnq5nPCZWzo4   19m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued

Comment 9 Weihua Meng 2018-09-04 09:50:02 UTC
meet this 
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

TASK [Approve node certificates when bootstrapping] ****************************
Tuesday 04 September 2018  17:24:17 +0800 (0:00:00.083)       0:17:50.596 ***** 
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (3 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (2 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
fatal: [share3-wmengah311-master-etcd-zone1-1.0904-beu.qe.rhcloud.com]: FAILED! => {"attempts": 30, "changed": false, "msg": "Cound not find csr for nodes: share3-wmengah311-master-etcd-zone1-1", "state": "unknown"}

[root@share3-wmengah311-master-etcd-zone1-1 ~]# oc get node 
NAME                                    STATUS    ROLES     AGE       VERSION
share3-wmengah311-master-etcd-zone2-1   Ready     master    29m       v1.11.0+d4cacc0
share3-wmengah311-master-etcd-zone2-2   Ready     master    29m       v1.11.0+d4cacc0

[root@share3-wmengah311-master-etcd-zone1-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-5h8hn                                              11m       system:admin                                              Pending
csr-d6n6v                                              11m       system:admin                                              Pending
csr-dk679                                              11m       system:admin                                              Pending
csr-fg4ms                                              11m       system:admin                                              Approved,Issued
csr-hdbv7                                              8m        system:node:share3-wmengah311-master-etcd-zone2-2         Pending
csr-htrq2                                              7m        system:node:share3-wmengah311-master-etcd-zone2-1         Pending
csr-k742q                                              11m       system:admin                                              Approved,Issued
csr-lv872                                              8m        system:node:share3-wmengah311-master-etcd-zone2-1         Pending
csr-lz9mg                                              7m        system:node:share3-wmengah311-master-etcd-zone2-2         Pending
csr-n2nvz                                              3m        system:node:share3-wmengah311-master-etcd-zone1-1         Pending
csr-wwg5d                                              1m        system:node:share3-wmengah311-master-etcd-zone1-1         Pending
csr-xv2vv                                              11m       system:admin                                              Approved,Issued
csr-xvnml                                              7m        system:node:share3-wmengah311-master-etcd-zone1-1         Pending
csr-znd2l                                              8m        system:node:share3-wmengah311-master-etcd-zone1-1         Pending
node-csr--kVBjtGE8wbaDAscL3IH5_6rWZGku46qSV0pL_XqNgs   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-2UaLHRSRPOXm9nhI98SBfyzgNkX3V-C7SIxP5tbqJqU   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-67v1yVAJlG-gIHZHvOA90d6Dhjb9YhCou94wvg2EARM   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-Tp8zgZHGhOkFjteTYtr_OlDakdJ_25omEPT07u4eIuA   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-_e2GsHH_F2ccF0SY21cBhgB_rQtgW2lqbh2zfmfrm-8   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending
node-csr-zqu_gDIn5YLNGWw5zXaU0mIe8gseNLlQoQUmBNCvVaw   7m        system:serviceaccount:openshift-infra:node-bootstrapper   Pending

Comment 10 Marek Goldmann 2018-09-04 12:28:42 UTC
Seeing similar issue with openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch.

Comment 11 Serena Cortopassi 2018-09-04 14:48:40 UTC
(In reply to Marek Goldmann from comment #10)
> Seeing similar issue with
> openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch.

The same here with version 3.10.

Comment 12 Rhys Oxenham 2018-09-05 12:59:54 UTC
Also witnessed in openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch

Not able to get through the deployment.

Comment 13 Scott Dodson 2018-09-05 17:21:12 UTC
We should probably consider that to be a new bug since QE has already VERIFIED this bug and we may be dealing with a new problem.

With the new bug can you include your complete inventory, as well as the output of `oc get nodes` and `oc get csr -o yaml`

The last will likely contain private data for signed certificates so please mark it as a private attachment unless it's just a test environment you don't care about.

My suspicion is that the name on the CSR is different than what we're expecting it to be.

Comment 14 Wei Sun 2018-09-06 03:21:46 UTC
Create a new 3.10 bug https://bugzilla.redhat.com/show_bug.cgi?id=1625817 to track the 3.10 issue , since this original bug has been fixed and been verified by QE,so still change back the bug's status to VERIFIED,if something still needs to be fixed in this bug,feel free to re-open it

Comment 19 Luke Meyer 2018-12-21 15:23:13 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.