Bug 1597908 - OpenShift on OpenStack fails to approve Pending csrs on scaleup
Summary: OpenShift on OpenStack fails to approve Pending csrs on scaleup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.10.z
Assignee: Scott Dodson
QA Contact: Matt Bruzek
URL:
Whiteboard:
: 1597871 1597904 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-03 21:39 UTC by Matt Bruzek
Modified: 2019-03-15 21:10 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 19:17:50 UTC
Target Upstream Version:


Attachments (Terms of Use)
The log file for an 8 node scaleup. (1.89 MB, text/plain)
2018-07-09 19:00 UTC, Matt Bruzek
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:18:17 UTC

Description Matt Bruzek 2018-07-03 21:39:14 UTC
Description of problem:

We have automation to install OpenShift on OpenStack in a repeatable way. The recent 3.10 install completes successfully. On the attempt to scale to 250 nodes our install gets stuck on the approval step and I see several hundred Pending certificate signing request (csr)s. 

The scaleup operation ran until about 161 nodes and eventually failed to approve nodes. The log message was:

TASK [Approve bootstrap nodes] *************************************************
task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40

Version-Release number of selected component (if applicable):
$ oc version
oc v3.10.10
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://lb-0.scale-ci.example.com:8443
openshift v3.10.10
kubernetes v1.10.0+b81c8f8

$ git describe
v3.10.0-rc.0-115-g1d59617

How reproducible: We can often get this csr problem.


Steps to Reproduce:
1. Install OpenStack
2. Install OpenShift on OpenStack
3. Attempt to scale up to 250 nodes and notice the failure to approve nodes. 

Actual results:

The openshift-ansible playbook openshift-ansible/playbooks/openshift-node/scaleup.yml fails with the following error:


TASK [Approve bootstrap nodes] *************************************************
task path: /home/cloud-user/openshift-ansible/playbooks/openshift-node/private/join.yml:40
Tuesday 03 July 2018  12:56:29 -0400 (0:00:00.179)       0:08:23.501 **********
fatal: [master-1.scale-ci.example.com]: FAILED! => {"changed": true, "finished": false, "msg": "Timed out accepting certificate signing requests. Failing as requested.

When I went to the cluster I saw just over 500 csrs in "Pending" state.

root@master-1: /home/openshift # oc get csr --all-namespaces | grep Pending | wc -l                                                       
507 

Expected results:
I expected the scale up to succeed.

Additional info:

I will attach the logs in further comments.

Comment 2 Xiaoli Tian 2018-07-05 01:39:39 UTC
*** Bug 1597904 has been marked as a duplicate of this bug. ***

Comment 3 Scott Dodson 2018-07-05 12:49:47 UTC
*** Bug 1597871 has been marked as a duplicate of this bug. ***

Comment 4 Gan Huang 2018-07-06 00:57:05 UTC
Scott, is the fix for the bug? https://github.com/openshift/openshift-ansible/pull/9079

As stated before I was unable to reproduce the issue with openshift-ansible-3.10.12-1.git.264.fa89aae.el7.noarch.rpm. I'm still not sure what the issue was.

Comment 6 Matt Bruzek 2018-07-07 03:08:23 UTC
I encountered this problem again on the build dated 06-29 there were over 400 certificate signing requests (csrs) in pending state when the scale up to 250 nodes was run.

I was able to catch this before the playbook timed out and manually approve the Pending requests, but I believe this playbook would have failed had I let it run.

$ git describe
v3.10.0-rc.0-129-g61563cb
$ git status
# On branch release-3.10

This is still a problem on 3.10 for scaleup.

Comment 7 Gan Huang 2018-07-09 03:31:04 UTC
I retested on both openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm and  openshift-ansible-3.10.14-1.git.273.a64b86b.el7.noarch.rpm. Unfortunately still no luck to reproduce it.

I'm assuming the bug should only could be reproduced in a large scale. 

Matt, I'll appreciate very much if you could help to verify it.

(suggest to test it in openshift-ansible-3.10.15-1 or later which has the fix of comment 4)

Comment 10 Matt Bruzek 2018-07-09 18:57:50 UTC
Created attachment 1457563 [details]
The log file for an 8 node scaleup.

I scaled from 242 to 249 (only 8 nodes) and saw the csr problem again today. I watched the 'oc get csr' command and watch all csrs go from Pending to approve but the scale up of this small amount failed.

Comment 12 Matt Bruzek 2018-07-09 19:00:54 UTC
Comment on attachment 1457564 [details]
The log file for an 8 node scaleup.

This is the log file from 242 to 249 where we saw the csr problem.

Comment 13 Zvonko Kosic 2018-07-11 11:27:24 UTC
I am seeing this behavior with only 1 node scaleup.
The csr are getting approved but are not issued. 

The change was introduced somewhere at version 3.6 ... 

The problem in openshift-3.6 was that the certificates
controller wasn't running. 

https://github.com/openshift/origin/issues/13500

Here is a link Clayton is explaining the change: 

https://github.com/openshift/openshift-ansible/issues/4685

I am seeing similar behavior in 3.10 with csr approved but not issued


node-csr-aBq1AF-GQKHD0uFLKU0ASdT4VnwUqbmB0IJXmeoV6TI   28m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved

where a valid csr looks like 

csr-gkvck                                              22h       system:node:ip-172-31-5-189.us-west-2.compute.internal    Approved,Issued

Maybe check the master controllers log.

To not watch 'oc get csr' you can use 'oc observe csr'

Comment 14 Scott Dodson 2018-07-11 12:55:48 UTC
In this particular case we're pretty sure what happens is that the masters are included in the list of nodes that we expect to see a pending CSR for. However since the masters were created hours ago those CSRs were approved and have been purged from the API. We track two CSRs per host, a client side and a server side CSR.

From logs in private comments in this bug we see this which shows that the module believes that none of the masters have been approved even though they've been approved previously.

{"client_accepted": false, "csrs": {}, "denied": false, "name": "master-0.scale-ci.example.com", "server_accepted": false},
{"client_accepted": false, "csrs": {}, "denied": false, "name": "master-2.scale-ci.example.com", "server_accepted": false},
{"client_accepted": false, "csrs": {}, "denied": false, "name": "master-1.scale-ci.example.com", "server_accepted": false}

https://github.com/openshift/openshift-ansible/pull/9137 fixes this by removing the masters from the hosts we expect to find CSRs for.

Comment 15 Scott Dodson 2018-07-11 13:08:02 UTC
https://github.com/openshift/openshift-ansible/pull/9152 backport to release-3.10

Comment 16 Scott Dodson 2018-07-11 13:12:28 UTC
Testing process

1) Install 3.10 using 3.10.15 openshift-ansible
2) Delete all CSRs
oc get csr
oc delete csr csr-1234  
etc until there are none
3) Scale up one additional node, this should fail
4) `oc adm certificate approve all` pending CSRs, then remove them again
5) Update to a version of openshift-ansible with this fix, scale up an additional node, this should succeed

Comment 17 Scott Dodson 2018-07-16 13:26:25 UTC
In openshift-ansible-3.10.17-1 and later

Comment 18 Matt Bruzek 2018-07-19 18:22:10 UTC
I performed the testing procedure similar to what Scott Dodson listed in comment 16. 

1) A pre-fixed 3.10 cluster was installed.
2) Was able to scale up once and show the scale failed to validate CSRs.
3) Approved the CSRs manually.
4) Pulled sdodson's git branch that contained the proposed fix.
5) Ran an additional scale up operation which was successful.

We have not had the opportunity to test the RPM of this fix at this time but the process above satisfied me that a valid fix was coming.

Comment 20 Vikas Laad 2018-07-19 19:26:16 UTC
I followed steps in comment #16 and i was able to re-produce the issue. After upgrading openshift-ansible package to 3.10.18 the issue did not happen.

Verified with openshift-ansible-3.10.18-1.git.314.cfe4f91.el7

Comment 22 errata-xmlrpc 2018-07-30 19:17:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.