1366740 – scalability: serviceaccounts not created immediately in HA cluster when creating projects quickly

Bug 1366740 - scalability: serviceaccounts not created immediately in HA cluster when creating projects quickly

Summary: scalability: serviceaccounts not created immediately in HA cluster when creat...

Keywords:
Status:	CLOSED DUPLICATE of bug 1364431
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jordan Liggitt
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-12 16:39 UTC by Mike Fiedler
Modified:	2016-10-30 22:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-12 18:24:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
master-config.yaml (4.04 KB, text/plain) 2016-08-12 18:18 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2016-08-12 16:39:31 UTC

Description of problem:

Please re-assign if REST API isn't correct.

For horizontal scale testing we have a cluster loader script which creates a large number of projects, populates them with secretes, build configs, DCs, RCs etc and runs deployments in each project. The script has been used successfully since 3.2.

A recent change (since 3.3.0.11) has changed the behavior or project creation. When projects are created consecutively (not simultaneously), the serviceaccounts are not created immediately after the 24th project is created. There is a long delay (minutes) before the serviceaccounts are created. This causes deployments to fail when trying to populate the projects.

This is a blocker for horizontal scalability testing in an environment with 300 nodes and 1000 projects. In the past, this issue was not encountered - all projects could deploy DCs immediately.

This only seems to occur in HA environments (multi-master)

Version-Release number of selected component (if applicable):

3.3.0.18

How reproducible:

Always

Steps to Reproduce:
0. Install an HA cluster. Mine has 3 master/etcd, 1 master load balancer, 2 registry/router and 5 nodes
1. for i in {1..50}; do oc new-project project$i; done
2. for i in {1..50}; do echo project$i; oc get sa -n project$i --no-headers| wc -l; done

Actual results:

After the 23rd project (or so), the projects will not have an serviceaccounts. Wait for a while (minutes) and then run the oc get sa again and more projects will have the serviceaccounts.

Attempts to run a deployment results in events similar to this popping for the namespace:

DeploymentConfig Warning FailedRetry {deployments-controller } deploymentconfig0-1: About to stop retrying deploymentconfig0-1: couldn't create deployer pod for cncf13/deploymentconfig0-1: pods "deploymentconfig0-1-deploy" is forbidden: service account cncf13/deployer was not found, retry after the service account is created

Expected results:

serviceaccounts created immediately and deployments of DCs operational immediately after project creation.

Comment 1 Mike Fiedler 2016-08-12 16:51:58 UTC

sleeping 1 second between project creations does not help, still see projects with no SAs.   Sleeping 10 seconds does seem to help.  Have not bisected.

Comment 3 Jordan Liggitt 2016-08-12 16:57:54 UTC

Likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1364431

Can you include the contents of the master-config.yaml?

Comment 4 Mike Fiedler 2016-08-12 18:18:37 UTC

Created attachment 1190534 [details]
master-config.yaml

Config attached, let me know if there is a tune-able.

Comment 5 Jordan Liggitt 2016-08-12 18:24:38 UTC

Yeah, dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1364431

The client config overrides had a type in them, so instead of setting "qps" to 200 and 300, it set "ops" to 200/300, the server ignored the unknown field, and defaulted to 5.

As a workaround until you have an install containing https://github.com/openshift/openshift-ansible/pull/2287, you can work around it by editing the config. 

Change:
"ops: 200" to "qps: 200"
"ops: 300" to "qps: 300"

Also check the node config, the same typo exists there.

*** This bug has been marked as a duplicate of bug 1364431 ***

Comment 6 Mike Fiedler 2016-08-12 18:26:44 UTC

Changing ops to qps works around it.

Note You need to log in before you can comment on or make changes to this bug.