Description of problem: install OCP v3.11 failed Version-Release number of the following components: openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch.rpm How reproducible: Always (3 out of 3) Steps to Reproduce: 1. Install OCP 3.11 on RHEL Atomic Host on AWS EC2 vm_type: m4.xlarge 1 master + 1 infra + 1 compute Actual results: Install failed. TASK [Approve bootstrap nodes] ************************************************* Thursday 26 July 2018 04:52:37 -0400 (0:00:00.085) 0:21:38.148 ********* fatal: [ec2-xxx.compute-1.amazonaws.com]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.", "nodes": [{"client_accepted": true, "csrs": {"csr-8qvc5": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": {"creationTimestamp": "2018-07-26T08:45:30Z", "generateName": "csr-", "name": "csr-8qvc5", "namespace": "", "resourceVersion": "689", "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-8qvc5", "uid": "40841d9d-90b0-11e8-a412-0e9ba41fd52c"}, "spec": {"groups": ["system:masters", "system:cluster-admins", "system:authenticated"], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQkJEQ0JyQUlCQURCS01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14TVRBdkJnTlZCQU1US0hONQpjM1JsYlRwdWIyUmxPbWx3TFRFM01pMHhPQzB3TFRFNU5pNWxZekl1YVc1MFpYSnVZV3d3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFRVmlzcDd1akJ4aWxON0w4amc1MnkxM3dnOERZRm0vTGNVRGxDR1FubWYKZytObERjNE5Wei80MThXM055TDdza1pvcGJySHE1N0hJdjVMVlBNYXJkK0FvQUF3Q2dZSUtvWkl6ajBFQXdJRApSd0F3UkFJZ1RDNStnaTk1ajg2TlpuNzlQQVVUbjZ3SU1aNnJxT2ZJR0ZQMyszSnZBbllDSUdDcWNoUnVOSDE2CldmY1ltTGllRGh0UThzbmpqeGRuWDFpZDN2S29LRFZWCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=", "usages": ["digital signature", "key encipherment", "client auth"], "username": "system:admin"}, "status": {"certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNoekNDQVcrZ0F3SUJBZ0lVRk1jaElISGR4ZHdZOHB4cUlzbTdxWjZHalRvd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0pqRWtNQ0lHQTFVRUF3d2JiM0JsYm5Ob2FXWjBMWE5wWjI1bGNrQXhOVE15TlRrME5Ua3dNQjRYRFRFNApNRGN5TmpBNE5ERXdNRm9YRFRFNU1EY3lOakE0TkRFd01Gb3dTakVWTUJNR0ExVUVDaE1NYzNsemRHVnRPbTV2ClpHVnpNVEV3THdZRFZRUURFeWh6ZVhOMFpXMDZibTlrWlRwcGNDMHhOekl0TVRndE1DMHhPVFl1WldNeUxtbHUKZEdWeWJtRnNNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVGWXJLZTdvd2NZcFRleS9JNE9kcwp0ZDhJUEEyQlp2eTNGQTVRaGtKNW40UGpaUTNPRFZjLytOZkZ0emNpKzdKR2FLVzZ4NnVleHlMK1MxVHpHcTNmCmdLTlVNRkl3RGdZRFZSMFBBUUgvQkFRREFnV2dNQk1HQTFVZEpRUU1NQW9HQ0NzR0FRVUZCd01DTUF3R0ExVWQKRXdFQi93UUNNQUF3SFFZRFZSME9CQllFRkM5Tmk3VXk0ay9mOVhoZlhOWmFYR29NMXdmMU1BMEdDU3FHU0liMwpEUUVCQ3dVQUE0SUJBUUFQMUEzL0JDbnNrTWRxekV5V0svVHpsMm9heHhacVJiNndnR2FIQ0xGV0xvcThDMkNXCmVZU2MwWDNSWUZuQ2dOM3gzblFMQXpmOERIY3NMZ1psNGFMVjh4WmVpUGF0b0l6YlQ4aE96Z3NPYXBGM0pBZ28KUW1IVnRId1lnZFVZRHY3WUdYYWd4ZEg2Uk5zK05rbTZHN2N2djlXR1lLMm9TZSs2MDd4RlRPNmlkTVBZSlNxdApibFc2OUs3ZTloWFlPbFpNeElXUHNtOGpBWVlhS281WUNDR2JtZDZlSXdlUTBpWHJ0TVFJUlFXMTlNLythRDdYCk5FYWZKR2JUaU5QdDV4Nzg3Uk00YytEYUpYbm42SzRldURmakFET3pKM0dnVFp1L3Vjd2lrYjFBMVd4UERpY3IKdW5xRDZLMm1ndlRzWk1YY1RSdnNTeGFsWTVudEMrYTgzSVQyCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K", "conditions": [{"lastUpdateTime": "2018-07-26T08:45:40Z", "message": "Auto approving kubelet client certificate after SubjectAccessReview.", "reason": "AutoApproved", "type": "Approved"}]}}, "csr-q7z49": {"apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": {"creationTimestamp": "2018-07-26T08:48:43Z", "generateName": "csr-", "name": "csr-q7z49", "namespace": "", "resourceVersion": "1971", "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-q7z49", "uid": "b3bfb38b-90b0-11e8-a412-0e9ba41fd52c"}, "spec": {"groups": ["system:nodes", "system:authenticated"], "request": ... Failure summary: 1. Hosts: ec2-xxx.compute-1.amazonaws.com Play: Approve any pending CSR requests from inventory nodes Task: Report approval errors Message: Node approval failed tools/launch_instance.rb:458:in `block in run_ansible_playbook': ansible failed execution, see logs (RuntimeError) on master [root@ip-172-18-0-196 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-8qvc5 21m system:admin Approved,Issued csr-q7z49 18m system:node:ip-172-18-0-196.ec2.internal Approved,Issued csr-rrmtd 17m system:node:ip-172-18-0-196.ec2.internal Approved,Issued csr-t99p6 21m system:admin Approved,Issued node-csr-G9BQPQAbXwUU3uQC_ZrFpBWagFmujL0rPxkfbiRkWmU 17m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued node-csr-Mq2fQJQsj5g7v3-raLL4Htn-p2NT577oHnHrCMmUFM0 17m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued Expected results: install succeed.
I tried this on Fedora Atomic Host, 3 hosts, install succeeds as expected. Please attach failure logs directly to this BZ. The links above are all expired. Also, always attach inventory and variables, as well as what playbook you are running.
That is good news. If install succeeded after two weeks, It is likely be fixed during this period. (In reply to Michael Gugino from comment #3) > I tried this on Fedora Atomic Host, 3 hosts, install succeeds as expected. which playbook used? is it the one I report the bug two weeks ago? If not the playbook I used when the bug reported, then likely changes during those two weeks fixed it. > > Please attach failure logs directly to this BZ. The links above are all > expired. They are gone for more than two weeks passed. I did not realize the logs were needed for such long time, sorry about that. > > Also, always attach inventory and variables, as well as what playbook you > are running.
@wmeng I need install logs, inventory, and need to know what/how you ran ansible-playbook. Please retry whatever was done to discover this problem and provide this information so I can try to figure out what the problem is.
remove testblocker, as more than two weeks passed, not meet the issue with latest build 3.11.0-0.13.0 OCP v3.11.0-0.9.0 can reproduce this bug.
I don't see any immediate reason why this would have failed in v3.11.0-0.9.0. The output of the csr module appears to indicate that all the csrs are approved, the problem is timeout with no additional info. Results: "results":[ { "cmd":"/usr/local/bin/oc adm certificate approve csr-6p2xq", "results":{ }, "returncode":0 }, { "cmd":"/usr/local/bin/oc adm certificate approve csr-75qvw", "results":{ }, "returncode":0 }, { "cmd":"/usr/local/bin/oc adm certificate approve csr-n6hk8", "results":{ }, "returncode":0 }, { "cmd":"/usr/local/bin/oc adm certificate approve node-csr-4nCWplUj64E5xCyQ8-mVxTTDExShGyZ0Z6synaGCwZI", "results":{ }, "returncode":0 }, { "cmd":"/usr/local/bin/oc adm certificate approve node-csr-aE-RL4RCYc5kqZaVP64iPdcDE8Kpt8xCGbF4Kr8w3mM", "results":{ }, "returncode":0 }, { "cmd":"/usr/local/bin/oc adm certificate approve node-csr-zEm4fCqhtwG_QLnsOBibyEP6N2vFQcqQ6UnpWxFa1hE", "results":{ }, "returncode":0 } As you can see, there are only 6 results posted; 2 for each of 3 nodes, but we should have 8 total: TASK [Dump the bootstrap hostnames] ******************************************** Sunday 12 August 2018 09:27:44 +0800 (0:00:00.218) 0:20:19.480 ********* ok: [qe-wmengah31109-master-etcd-1.0812-v8n.qe.rhcloud.com] => { "msg": [ "qe-wmengah31109-master-etcd-1", "qe-wmengah31109-node-registry-router-1", "qe-wmengah31109-node-1", "qe-wmengah31109-node-2" ] } Most likely fixed by: 446e64cd3744b72fce9512ab1225e75475a3104b but it's not clear why.
Can we please test with openshift-ansible-3.11.0-0.16.0 which contains the commit mentioned in the previous comment?
Fixed. openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch Installation succeeded and cluster is working well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652