Bug 1597906

Summary: OpenShift on Openstack - pending csrs on scaleup
Product: OpenShift Container Platform Reporter: Matt Bruzek <mbruzek>
Component: UnknownAssignee: Eric Paris <eparis>
Status: CLOSED DUPLICATE QA Contact: Johnny Liu <jialiu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jokerman, mmccomas, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-05 01:36:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Bruzek 2018-07-03 21:37:45 UTC
Description of problem:

We have automation to install OpenShift on OpenStack in a repeatable way. The recent 3.10 install completes successfully. On the attempt to scale to 250 nodes our install gets stuck on the approval step and I see several hundred Pending certificate signing request (csr)s. 

The scaleup operation ran until about 161 nodes and eventually failed to approve nodes. The log message was:

TASK [Approve bootstrap nodes] *************************************************
task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40

Version-Release number of selected component (if applicable):
$ oc version
oc v3.10.10
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://lb-0.scale-ci.example.com:8443
openshift v3.10.10
kubernetes v1.10.0+b81c8f8

$ git describe
v3.10.0-rc.0-115-g1d59617

How reproducible: We can often get this csr problem.


Steps to Reproduce:
1. Install OpenStack
2. Install OpenShift on OpenStack
3. Attempt to scale up to 250 nodes and notice the failure to approve nodes. 

Actual results:

The openshift-ansible playbook openshift-ansible/playbooks/openshift-node/scaleup.yml fails with the following error:


TASK [Approve bootstrap nodes] *************************************************
task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40
Tuesday 03 July 2018  12:56:29 -0400 (0:00:00.179)       0:08:23.501 **********
fatal: [master-1.scale-ci.example.com]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.

When I went to the cluster I saw just over 500 csrs in "Pending" state.

root@master-1: /home/openshift # oc get csr --all-namespaces | grep Pending | wc -l                                                       
507 

Expected results:
I expected the scale up to succeed.

Additional info:

I will attach the logs in further comments.

Comment 1 Xiaoli Tian 2018-07-05 01:36:44 UTC

*** This bug has been marked as a duplicate of bug 1597904 ***