Bug 1882568

Summary: kube-apiserver crash/restarts creating large numbers of projects on Azure cluster - api unreachable
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: DocumentationAssignee: Shubha Narayanan <snarayan>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Gahagan <mgahagan>
Severity: high Docs Contact: Latha S <lmurthy>
Priority: high    
Version: 4.6CC: ahoffer, aos-bugs, jokerman, lmurthy, mfojtik, sbatsche, tsze, xxia
Target Milestone: ---Keywords: Performance, Reopened
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-09 06:46:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2020-09-25 01:58:22 UTC
Description of problem:

While creating 5000 empty projects on Azure, kube-apiserver container starts continuously exiting and restarting.   The API is unreachable (not even intermittently) and oc adm must-gather is not possible. Will add a private comment with location of master journals and tarball of all master pod logs.

The cluster is on 3 master/3 computes of size  Standard_D4s_v3 (4 vCpu/16Gi memory).   The same test on equivalent sized AWS instances (m4.xlarge) is successful.  


Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-22-073212


How reproducible: Mostly.  1 successful run creating 5K projects, 3 failed.


Steps to Reproduce:
1. Create Azure cluster with Standard_D4s_v3 instances.  3 masters, 3 computes
2. for i in {0..4999}; do echo $i; oc new-project --skip-config-write test$i; done


Actual results:

Cluster becomes unresponsive.   No oc commands work.  kube-apiserver container on all masters continuously exits and restarts.

Comment 2 Stefan Schimanski 2020-09-25 09:12:55 UTC
The API servers time out creating RBAC objects. This is very probably due to slow etcd on Azure.

Moving to etcd for them to look at metrics.

Comment 3 Sam Batschelet 2020-09-25 14:08:55 UTC
IPI for Azure has 8 CPU and has specific build requirements outlined below[1]. If the test is conducted with hardware below these thresholds then I would say perf failure is expected. We should be testing based on what we ship.

Minimum Azure Requirements Summary:

- at least Standard_D8s_v3 (8 vCPU, 32GiB memory)
- 1 TiB Premium SSD (P30)
- host caching to ReadOnly

Closing as not a bug if you can rerun the test with the min hardware requirements and still hit these same failure cases we can explore at that time.

[1]https://docs.google.com/document/d/1yPpakMC1OSOWeeM4m_bHDuLECLPXrpq5ow2C9HEcs1A/edit#heading=h.lvwt62wax7yu

Comment 4 Mike Fiedler 2020-09-25 14:30:24 UTC
If those are the minimum requirements they need to be in the documentation

Comment 5 Mike Fiedler 2020-09-25 14:31:11 UTC
Or better yet, the default installation configuration for IPI on Azure.

Comment 6 Sam Batschelet 2020-09-25 14:32:28 UTC
Sounds great, thank you!

Comment 8 Shubha Narayanan 2021-08-26 14:50:30 UTC
Mike,
I am planning to update these requirements in docs here: https://docs.openshift.com/container-platform/4.8/installing/installing_azure/installing-azure-account.html#installation-azure-limits_installing-azure-account 
We already have the vCPU requirements mentioned here. However, need to ad:
- 1 TiB Premium SSD (P30)
- host caching to ReadOnly

As per my analysis, this seems to be relevant to OS Disk component. Can you confirm what component could these be mapped to?

Comment 9 Mike Fiedler 2021-08-27 13:51:37 UTC
Confirming that this is relevant to OS Disk.

Comment 10 Shubha Narayanan 2021-08-31 12:58:41 UTC
PR - https://github.com/openshift/openshift-docs/pull/35887

Comment 11 Shubha Narayanan 2021-08-31 12:59:04 UTC
PR - https://github.com/openshift/openshift-docs/pull/35887

Comment 12 Mike Gahagan 2021-08-31 18:44:02 UTC
Confirmed we now have minimum disk size and performance requirements in the docs.

Comment 14 Shubha Narayanan 2021-09-09 06:46:36 UTC
@sbatsche - Request your response for the above query. 

Since this bug is related to Azure and any similar fixes would need to be taken as a different scenario, I am marking this bug closed.

Comment 15 Red Hat Bugzilla 2023-09-15 00:48:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days