Description of problem: We have automation to install OpenShift on OpenStack in a repeatable way. The recent 3.10 install completes successfully. On the attempt to scale to 250 nodes our install gets stuck on the approval step and I see several hundred Pending certificate signing request (csr)s. The scaleup operation ran until about 161 nodes and eventually failed to approve nodes. The log message was: TASK [Approve bootstrap nodes] ************************************************* task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40 Version-Release number of selected component (if applicable): $ oc version oc v3.10.10 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://lb-0.scale-ci.example.com:8443 openshift v3.10.10 kubernetes v1.10.0+b81c8f8 $ git describe v3.10.0-rc.0-115-g1d59617 How reproducible: We can often get this csr problem. Steps to Reproduce: 1. Install OpenStack 2. Install OpenShift on OpenStack 3. Attempt to scale up to 250 nodes and notice the failure to approve nodes. Actual results: The openshift-ansible playbook openshift-ansible/playbooks/openshift-node/scaleup.yml fails with the following error: TASK [Approve bootstrap nodes] ************************************************* task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40 Tuesday 03 July 2018 12:56:29 -0400 (0:00:00.179) 0:08:23.501 ********** fatal: [master-1.scale-ci.example.com]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested. When I went to the cluster I saw just over 500 csrs in "Pending" state. root@master-1: /home/openshift # oc get csr --all-namespaces | grep Pending | wc -l 507 Expected results: I expected the scale up to succeed. Additional info: I will attach the logs in further comments.
*** Bug 1597904 has been marked as a duplicate of this bug. ***
*** Bug 1597871 has been marked as a duplicate of this bug. ***
Scott, is the fix for the bug? https://github.com/openshift/openshift-ansible/pull/9079 As stated before I was unable to reproduce the issue with openshift-ansible-3.10.12-1.git.264.fa89aae.el7.noarch.rpm. I'm still not sure what the issue was.
I encountered this problem again on the build dated 06-29 there were over 400 certificate signing requests (csrs) in pending state when the scale up to 250 nodes was run. I was able to catch this before the playbook timed out and manually approve the Pending requests, but I believe this playbook would have failed had I let it run. $ git describe v3.10.0-rc.0-129-g61563cb $ git status # On branch release-3.10 This is still a problem on 3.10 for scaleup.
I retested on both openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm and openshift-ansible-3.10.14-1.git.273.a64b86b.el7.noarch.rpm. Unfortunately still no luck to reproduce it. I'm assuming the bug should only could be reproduced in a large scale. Matt, I'll appreciate very much if you could help to verify it. (suggest to test it in openshift-ansible-3.10.15-1 or later which has the fix of comment 4)
Created attachment 1457563 [details] The log file for an 8 node scaleup. I scaled from 242 to 249 (only 8 nodes) and saw the csr problem again today. I watched the 'oc get csr' command and watch all csrs go from Pending to approve but the scale up of this small amount failed.
Comment on attachment 1457564 [details] The log file for an 8 node scaleup. This is the log file from 242 to 249 where we saw the csr problem.
I am seeing this behavior with only 1 node scaleup. The csr are getting approved but are not issued. The change was introduced somewhere at version 3.6 ... The problem in openshift-3.6 was that the certificates controller wasn't running. https://github.com/openshift/origin/issues/13500 Here is a link Clayton is explaining the change: https://github.com/openshift/openshift-ansible/issues/4685 I am seeing similar behavior in 3.10 with csr approved but not issued node-csr-aBq1AF-GQKHD0uFLKU0ASdT4VnwUqbmB0IJXmeoV6TI 28m system:serviceaccount:openshift-infra:node-bootstrapper Approved where a valid csr looks like csr-gkvck 22h system:node:ip-172-31-5-189.us-west-2.compute.internal Approved,Issued Maybe check the master controllers log. To not watch 'oc get csr' you can use 'oc observe csr'
In this particular case we're pretty sure what happens is that the masters are included in the list of nodes that we expect to see a pending CSR for. However since the masters were created hours ago those CSRs were approved and have been purged from the API. We track two CSRs per host, a client side and a server side CSR. From logs in private comments in this bug we see this which shows that the module believes that none of the masters have been approved even though they've been approved previously. {"client_accepted": false, "csrs": {}, "denied": false, "name": "master-0.scale-ci.example.com", "server_accepted": false}, {"client_accepted": false, "csrs": {}, "denied": false, "name": "master-2.scale-ci.example.com", "server_accepted": false}, {"client_accepted": false, "csrs": {}, "denied": false, "name": "master-1.scale-ci.example.com", "server_accepted": false} https://github.com/openshift/openshift-ansible/pull/9137 fixes this by removing the masters from the hosts we expect to find CSRs for.
https://github.com/openshift/openshift-ansible/pull/9152 backport to release-3.10
Testing process 1) Install 3.10 using 3.10.15 openshift-ansible 2) Delete all CSRs oc get csr oc delete csr csr-1234 etc until there are none 3) Scale up one additional node, this should fail 4) `oc adm certificate approve all` pending CSRs, then remove them again 5) Update to a version of openshift-ansible with this fix, scale up an additional node, this should succeed
In openshift-ansible-3.10.17-1 and later
I performed the testing procedure similar to what Scott Dodson listed in comment 16. 1) A pre-fixed 3.10 cluster was installed. 2) Was able to scale up once and show the scale failed to validate CSRs. 3) Approved the CSRs manually. 4) Pulled sdodson's git branch that contained the proposed fix. 5) Ran an additional scale up operation which was successful. We have not had the opportunity to test the RPM of this fix at this time but the process above satisfied me that a valid fix was coming.
I followed steps in comment #16 and i was able to re-produce the issue. After upgrading openshift-ansible package to 3.10.18 the issue did not happen. Verified with openshift-ansible-3.10.18-1.git.314.cfe4f91.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816