1882568 – kube-apiserver crash/restarts creating large numbers of projects on Azure cluster - api unreachable

Bug 1882568 - kube-apiserver crash/restarts creating large numbers of projects on Azure cluster - api unreachable

Summary: kube-apiserver crash/restarts creating large numbers of projects on Azure clu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Shubha Narayanan
QA Contact:	Mike Gahagan
Docs Contact:	Latha S
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-25 01:58 UTC by Mike Fiedler
Modified:	2023-09-15 00:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-09 06:46:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike Fiedler 2020-09-25 01:58:22 UTC

Description of problem:

While creating 5000 empty projects on Azure, kube-apiserver container starts continuously exiting and restarting.   The API is unreachable (not even intermittently) and oc adm must-gather is not possible. Will add a private comment with location of master journals and tarball of all master pod logs.

The cluster is on 3 master/3 computes of size  Standard_D4s_v3 (4 vCpu/16Gi memory).   The same test on equivalent sized AWS instances (m4.xlarge) is successful.  


Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-22-073212


How reproducible: Mostly.  1 successful run creating 5K projects, 3 failed.


Steps to Reproduce:
1. Create Azure cluster with Standard_D4s_v3 instances.  3 masters, 3 computes
2. for i in {0..4999}; do echo $i; oc new-project --skip-config-write test$i; done


Actual results:

Cluster becomes unresponsive.   No oc commands work.  kube-apiserver container on all masters continuously exits and restarts.

Comment 2 Stefan Schimanski 2020-09-25 09:12:55 UTC

The API servers time out creating RBAC objects. This is very probably due to slow etcd on Azure.

Moving to etcd for them to look at metrics.

Comment 3 Sam Batschelet 2020-09-25 14:08:55 UTC

IPI for Azure has 8 CPU and has specific build requirements outlined below[1]. If the test is conducted with hardware below these thresholds then I would say perf failure is expected. We should be testing based on what we ship.

Minimum Azure Requirements Summary:

- at least Standard_D8s_v3 (8 vCPU, 32GiB memory)
- 1 TiB Premium SSD (P30)
- host caching to ReadOnly

Closing as not a bug if you can rerun the test with the min hardware requirements and still hit these same failure cases we can explore at that time.

[1]https://docs.google.com/document/d/1yPpakMC1OSOWeeM4m_bHDuLECLPXrpq5ow2C9HEcs1A/edit#heading=h.lvwt62wax7yu

Comment 4 Mike Fiedler 2020-09-25 14:30:24 UTC

If those are the minimum requirements they need to be in the documentation

Comment 5 Mike Fiedler 2020-09-25 14:31:11 UTC

Or better yet, the default installation configuration for IPI on Azure.

Comment 6 Sam Batschelet 2020-09-25 14:32:28 UTC

Sounds great, thank you!

Comment 8 Shubha Narayanan 2021-08-26 14:50:30 UTC

Mike,
I am planning to update these requirements in docs here: https://docs.openshift.com/container-platform/4.8/installing/installing_azure/installing-azure-account.html#installation-azure-limits_installing-azure-account 
We already have the vCPU requirements mentioned here. However, need to ad:
- 1 TiB Premium SSD (P30)
- host caching to ReadOnly

As per my analysis, this seems to be relevant to OS Disk component. Can you confirm what component could these be mapped to?

Comment 9 Mike Fiedler 2021-08-27 13:51:37 UTC

Confirming that this is relevant to OS Disk.

Comment 10 Shubha Narayanan 2021-08-31 12:58:41 UTC

PR - https://github.com/openshift/openshift-docs/pull/35887

Comment 11 Shubha Narayanan 2021-08-31 12:59:04 UTC

PR - https://github.com/openshift/openshift-docs/pull/35887

Comment 12 Mike Gahagan 2021-08-31 18:44:02 UTC

Confirmed we now have minimum disk size and performance requirements in the docs.

Comment 14 Shubha Narayanan 2021-09-09 06:46:36 UTC

@sbatsche - Request your response for the above query. 

Since this bug is related to Azure and any similar fixes would need to be taken as a different scenario, I am marking this bug closed.

Comment 15 Red Hat Bugzilla 2023-09-15 00:48:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.