Bug 1608784 - install OCP v3.11 failed at TASK [Approve bootstrap nodes]
Summary: install OCP v3.11 failed at TASK [Approve bootstrap nodes]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Michael Gugino
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-26 09:52 UTC by Weihua Meng
Modified: 2018-10-11 07:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:22:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:22:40 UTC

Description Weihua Meng 2018-07-26 09:52:14 UTC
Description of problem:
install OCP v3.11 failed

Version-Release number of the following components:
openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch.rpm

How reproducible:
Always (3 out of 3)

Steps to Reproduce:
1. Install OCP 3.11 on RHEL Atomic Host
on AWS EC2
vm_type: m4.xlarge
1 master + 1 infra + 1 compute

Actual results:
Install failed.
TASK [Approve bootstrap nodes] *************************************************
Thursday 26 July 2018  04:52:37 -0400 (0:00:00.085)       0:21:38.148 ********* 
fatal: [ec2-xxx.compute-1.amazonaws.com]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [{"client_accepted": true, "csrs": {"csr-8qvc5": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": {"creationTimestamp": "2018-07-26T08:45:30Z", "generateName": "csr-", "name": "csr-8qvc5", "namespace": "", "resourceVersion": "689", "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-8qvc5", "uid": "40841d9d-90b0-11e8-a412-0e9ba41fd52c"}, "spec": {"groups": ["system:masters", "system:cluster-admins", "system:authenticated"], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQkJEQ0JyQUlCQURCS01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14TVRBdkJnTlZCQU1US0hONQpjM1JsYlRwdWIyUmxPbWx3TFRFM01pMHhPQzB3TFRFNU5pNWxZekl1YVc1MFpYSnVZV3d3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFRVmlzcDd1akJ4aWxON0w4amc1MnkxM3dnOERZRm0vTGNVRGxDR1FubWYKZytObERjNE5Wei80MThXM055TDdza1pvcGJySHE1N0hJdjVMVlBNYXJkK0FvQUF3Q2dZSUtvWkl6ajBFQXdJRApSd0F3UkFJZ1RDNStnaTk1ajg2TlpuNzlQQVVUbjZ3SU1aNnJxT2ZJR0ZQMyszSnZBbllDSUdDcWNoUnVOSDE2CldmY1ltTGllRGh0UThzbmpqeGRuWDFpZDN2S29LRFZWCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=", "usages": ["digital signature", "key encipherment", "client auth"], "username": "system:admin"}, "status": {"certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNoekNDQVcrZ0F3SUJBZ0lVRk1jaElISGR4ZHdZOHB4cUlzbTdxWjZHalRvd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0pqRWtNQ0lHQTFVRUF3d2JiM0JsYm5Ob2FXWjBMWE5wWjI1bGNrQXhOVE15TlRrME5Ua3dNQjRYRFRFNApNRGN5TmpBNE5ERXdNRm9YRFRFNU1EY3lOakE0TkRFd01Gb3dTakVWTUJNR0ExVUVDaE1NYzNsemRHVnRPbTV2ClpHVnpNVEV3THdZRFZRUURFeWh6ZVhOMFpXMDZibTlrWlRwcGNDMHhOekl0TVRndE1DMHhPVFl1WldNeUxtbHUKZEdWeWJtRnNNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVGWXJLZTdvd2NZcFRleS9JNE9kcwp0ZDhJUEEyQlp2eTNGQTVRaGtKNW40UGpaUTNPRFZjLytOZkZ0emNpKzdKR2FLVzZ4NnVleHlMK1MxVHpHcTNmCmdLTlVNRkl3RGdZRFZSMFBBUUgvQkFRREFnV2dNQk1HQTFVZEpRUU1NQW9HQ0NzR0FRVUZCd01DTUF3R0ExVWQKRXdFQi93UUNNQUF3SFFZRFZSME9CQllFRkM5Tmk3VXk0ay9mOVhoZlhOWmFYR29NMXdmMU1BMEdDU3FHU0liMwpEUUVCQ3dVQUE0SUJBUUFQMUEzL0JDbnNrTWRxekV5V0svVHpsMm9heHhacVJiNndnR2FIQ0xGV0xvcThDMkNXCmVZU2MwWDNSWUZuQ2dOM3gzblFMQXpmOERIY3NMZ1psNGFMVjh4WmVpUGF0b0l6YlQ4aE96Z3NPYXBGM0pBZ28KUW1IVnRId1lnZFVZRHY3WUdYYWd4ZEg2Uk5zK05rbTZHN2N2djlXR1lLMm9TZSs2MDd4RlRPNmlkTVBZSlNxdApibFc2OUs3ZTloWFlPbFpNeElXUHNtOGpBWVlhS281WUNDR2JtZDZlSXdlUTBpWHJ0TVFJUlFXMTlNLythRDdYCk5FYWZKR2JUaU5QdDV4Nzg3Uk00YytEYUpYbm42SzRldURmakFET3pKM0dnVFp1L3Vjd2lrYjFBMVd4UERpY3IKdW5xRDZLMm1ndlRzWk1YY1RSdnNTeGFsWTVudEMrYTgzSVQyCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K", "conditions": [{"lastUpdateTime": "2018-07-26T08:45:40Z", "message": "Auto approving kubelet client certificate after SubjectAccessReview.", "reason": "AutoApproved", "type": "Approved"}]}}, "csr-q7z49": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": {"creationTimestamp": "2018-07-26T08:48:43Z", "generateName": "csr-", "name": "csr-q7z49", "namespace": "", "resourceVersion": "1971", "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-q7z49", "uid": "b3bfb38b-90b0-11e8-a412-0e9ba41fd52c"}, "spec": {"groups": ["system:nodes", "system:authenticated"], "request": 
...

Failure summary:


  1. Hosts:    ec2-xxx.compute-1.amazonaws.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Report approval errors
     Message:  Node approval failed
tools/launch_instance.rb:458:in `block in run_ansible_playbook': ansible failed execution, see logs (RuntimeError)

on master
[root@ip-172-18-0-196 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-8qvc5                                              21m       system:admin                                              Approved,Issued
csr-q7z49                                              18m       system:node:ip-172-18-0-196.ec2.internal                  Approved,Issued
csr-rrmtd                                              17m       system:node:ip-172-18-0-196.ec2.internal                  Approved,Issued
csr-t99p6                                              21m       system:admin                                              Approved,Issued
node-csr-G9BQPQAbXwUU3uQC_ZrFpBWagFmujL0rPxkfbiRkWmU   17m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-Mq2fQJQsj5g7v3-raLL4Htn-p2NT577oHnHrCMmUFM0   17m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued

Expected results:
install succeed.

Comment 3 Michael Gugino 2018-08-09 23:09:35 UTC
I tried this on Fedora Atomic Host, 3 hosts, install succeeds as expected.

Please attach failure logs directly to this BZ.  The links above are all expired.

Also, always attach inventory and variables, as well as what playbook you are running.

Comment 4 Weihua Meng 2018-08-10 02:30:29 UTC
That is good news.
If install succeeded after two weeks, It is likely be fixed during this period.

(In reply to Michael Gugino from comment #3)
> I tried this on Fedora Atomic Host, 3 hosts, install succeeds as expected.

which playbook used? 
is it the one I report the bug two weeks ago?
If not the playbook I used when the bug reported, then likely changes during those two weeks fixed it. 

> 
> Please attach failure logs directly to this BZ.  The links above are all
> expired.

They are gone for more than two weeks passed.
I did not realize the logs were needed for such long time, sorry about that.

> 
> Also, always attach inventory and variables, as well as what playbook you
> are running.

Comment 5 Michael Gugino 2018-08-10 13:27:55 UTC
@wmeng

I need install logs, inventory, and need to know what/how you ran ansible-playbook.

Please retry whatever was done to discover this problem and provide this information so I can try to figure out what the problem is.

Comment 7 Weihua Meng 2018-08-12 02:02:05 UTC
remove testblocker, as more than two weeks passed, not meet the issue with latest build 3.11.0-0.13.0

OCP v3.11.0-0.9.0 can reproduce this bug.

Comment 8 Michael Gugino 2018-08-13 15:39:49 UTC
I don't see any immediate reason why this would have failed in v3.11.0-0.9.0.  The output of the csr module appears to indicate that all the csrs are approved, the problem is timeout with no additional info.

Results:

   "results":[
      {
         "cmd":"/usr/local/bin/oc adm certificate approve csr-6p2xq",
         "results":{ },
         "returncode":0
      },
      {
         "cmd":"/usr/local/bin/oc adm certificate approve csr-75qvw",
         "results":{ },
         "returncode":0
      },
      {
         "cmd":"/usr/local/bin/oc adm certificate approve csr-n6hk8",
         "results":{ },
         "returncode":0
      },
      {
         "cmd":"/usr/local/bin/oc adm certificate approve node-csr-4nCWplUj64E5xCyQ8-mVxTTDExShGyZ0Z6synaGCwZI",
         "results":{ },
         "returncode":0
      },
      {
         "cmd":"/usr/local/bin/oc adm certificate approve node-csr-aE-RL4RCYc5kqZaVP64iPdcDE8Kpt8xCGbF4Kr8w3mM",
         "results":{ },
         "returncode":0
      },
      {
         "cmd":"/usr/local/bin/oc adm certificate approve node-csr-zEm4fCqhtwG_QLnsOBibyEP6N2vFQcqQ6UnpWxFa1hE",
         "results":{ },
         "returncode":0
      }


As you can see, there are only 6 results posted; 2 for each of 3 nodes, but we should have 8 total:

TASK [Dump the bootstrap hostnames] ********************************************
Sunday 12 August 2018  09:27:44 +0800 (0:00:00.218)       0:20:19.480 ********* 
ok: [qe-wmengah31109-master-etcd-1.0812-v8n.qe.rhcloud.com] => {
    "msg": [
        "qe-wmengah31109-master-etcd-1", 
        "qe-wmengah31109-node-registry-router-1", 
        "qe-wmengah31109-node-1", 
        "qe-wmengah31109-node-2"
    ]
}

Most likely fixed by: 446e64cd3744b72fce9512ab1225e75475a3104b but it's not clear why.

Comment 9 Scott Dodson 2018-08-16 13:28:21 UTC
Can we please test with openshift-ansible-3.11.0-0.16.0 which contains the commit mentioned in the previous comment?

Comment 10 Weihua Meng 2018-08-17 06:22:15 UTC
Fixed.
openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch

Installation succeeded and cluster is working well.

Comment 12 errata-xmlrpc 2018-10-11 07:22:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.