Bug 1366740

Summary: scalability: serviceaccounts not created immediately in HA cluster when creating projects quickly
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED DUPLICATE QA Contact: weiwei jiang <wjiang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: aos-bugs, jokerman, mifiedle, mmccomas, tstclair, wsun
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-12 18:24:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
master-config.yaml none

Description Mike Fiedler 2016-08-12 16:39:31 UTC
Description of problem:

Please re-assign if REST API isn't correct.

For horizontal scale testing we have a cluster loader script which creates a large number of projects, populates them with secretes, build configs, DCs, RCs etc and runs deployments in each project.   The script has been used successfully since 3.2.

A recent change (since 3.3.0.11) has changed the behavior or project creation.   When projects are created consecutively (not simultaneously), the serviceaccounts are not created immediately after the 24th project is created.   There is a long delay (minutes) before the serviceaccounts are created.   This causes deployments to fail when trying to populate the projects.

This is a blocker for horizontal scalability testing in an environment with 300 nodes and 1000 projects.   In the past, this issue was not encountered - all projects could deploy DCs immediately.

This only seems to occur in HA environments (multi-master)


Version-Release number of selected component (if applicable): 

3.3.0.18


How reproducible:

Always

Steps to Reproduce:
0. Install an HA cluster.   Mine has 3 master/etcd, 1 master load balancer, 2 registry/router and 5 nodes
1.  for i in {1..50}; do oc new-project project$i; done
2.  for i in {1..50}; do echo project$i; oc get sa -n project$i --no-headers| wc -l; done

Actual results:

After the 23rd project (or so), the projects will not have an serviceaccounts.   Wait for a while (minutes) and then run the oc get sa again and more projects will have the serviceaccounts.

Attempts to run a deployment results in events similar to this popping for the namespace:

DeploymentConfig             Warning   FailedRetry   {deployments-controller }   deploymentconfig0-1: About to stop retrying deploymentconfig0-1: couldn't create deployer pod for cncf13/deploymentconfig0-1: pods "deploymentconfig0-1-deploy" is forbidden: service account cncf13/deployer was not found, retry after the service account is created

Expected results:

serviceaccounts created immediately and deployments of DCs operational immediately after project creation.

Comment 1 Mike Fiedler 2016-08-12 16:51:58 UTC
sleeping 1 second between project creations does not help, still see projects with no SAs.   Sleeping 10 seconds does seem to help.  Have not bisected.

Comment 3 Jordan Liggitt 2016-08-12 16:57:54 UTC
Likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1364431

Can you include the contents of the master-config.yaml?

Comment 4 Mike Fiedler 2016-08-12 18:18:37 UTC
Created attachment 1190534 [details]
master-config.yaml

Config attached, let me know if there is a tune-able.

Comment 5 Jordan Liggitt 2016-08-12 18:24:38 UTC
Yeah, dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1364431

The client config overrides had a type in them, so instead of setting "qps" to 200 and 300, it set "ops" to 200/300, the server ignored the unknown field, and defaulted to 5.

As a workaround until you have an install containing https://github.com/openshift/openshift-ansible/pull/2287, you can work around it by editing the config. 

Change:
"ops: 200" to "qps: 200"
"ops: 300" to "qps: 300"

Also check the node config, the same typo exists there.

*** This bug has been marked as a duplicate of bug 1364431 ***

Comment 6 Mike Fiedler 2016-08-12 18:26:44 UTC
Changing ops to qps works around it.