Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1882568 - kube-apiserver crash/restarts creating large numbers of projects on Azure cluster - api unreachable
Summary: kube-apiserver crash/restarts creating large numbers of projects on Azure clu...
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Vikram Goyal
QA Contact: Xiaoli Tian
Vikram Goyal
Depends On:
TreeView+ depends on / blocked
Reported: 2020-09-25 01:58 UTC by Mike Fiedler
Modified: 2020-09-26 07:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-09-25 14:08:55 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Mike Fiedler 2020-09-25 01:58:22 UTC
Description of problem:

While creating 5000 empty projects on Azure, kube-apiserver container starts continuously exiting and restarting.   The API is unreachable (not even intermittently) and oc adm must-gather is not possible. Will add a private comment with location of master journals and tarball of all master pod logs.

The cluster is on 3 master/3 computes of size  Standard_D4s_v3 (4 vCpu/16Gi memory).   The same test on equivalent sized AWS instances (m4.xlarge) is successful.  

Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-22-073212

How reproducible: Mostly.  1 successful run creating 5K projects, 3 failed.

Steps to Reproduce:
1. Create Azure cluster with Standard_D4s_v3 instances.  3 masters, 3 computes
2. for i in {0..4999}; do echo $i; oc new-project --skip-config-write test$i; done

Actual results:

Cluster becomes unresponsive.   No oc commands work.  kube-apiserver container on all masters continuously exits and restarts.

Comment 2 Stefan Schimanski 2020-09-25 09:12:55 UTC
The API servers time out creating RBAC objects. This is very probably due to slow etcd on Azure.

Moving to etcd for them to look at metrics.

Comment 3 Sam Batschelet 2020-09-25 14:08:55 UTC
IPI for Azure has 8 CPU and has specific build requirements outlined below[1]. If the test is conducted with hardware below these thresholds then I would say perf failure is expected. We should be testing based on what we ship.

Minimum Azure Requirements Summary:

- at least Standard_D8s_v3 (8 vCPU, 32GiB memory)
- 1 TiB Premium SSD (P30)
- host caching to ReadOnly

Closing as not a bug if you can rerun the test with the min hardware requirements and still hit these same failure cases we can explore at that time.


Comment 4 Mike Fiedler 2020-09-25 14:30:24 UTC
If those are the minimum requirements they need to be in the documentation

Comment 5 Mike Fiedler 2020-09-25 14:31:11 UTC
Or better yet, the default installation configuration for IPI on Azure.

Comment 6 Sam Batschelet 2020-09-25 14:32:28 UTC
Sounds great, thank you!

Note You need to log in before you can comment on or make changes to this bug.