Bug 2070277
| Summary: | cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michael McCune <mimccune> |
| Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> |
| Cloud Compute sub component: | Cluster Autoscaler | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, ematysek, mimccune, openshift-bugzilla-robot, zhsun |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2069095 | Environment: | |
| Last Closed: | 2022-04-20 14:49:49 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2069095 | ||
| Bug Blocks: | |||
|
Comment 6
Eric Matysek
2022-04-13 18:51:51 UTC
Verified
clusterversion: 4.9.0-0.nightly-2022-04-14-125839
1. Setup OpenShift Container Platform on AWS
2. Enable autoscaler by clusterautoscler
3. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
4. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
5. About 2 hours later, run https://docs.openshift.com/container-platform/4.8/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks to trigger `etcd defrag`, Run it for each `etcd` member (last the leader)
$ oc rsh -n openshift-etcd etcd-zhsunazure-2kw7c-master-0 [15:39:09]
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
sh-4.4# unset ETCDCTL_ENDPOINTS
sh-4.4# etcdctl endpoint status -w table --cluster
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.6:2379 | 50928b3f1313232 | 3.5.0 | 1.7 GB | false | false | 6 | 337966 | 337966 | |
| https://10.0.0.8:2379 | 1ac574d30e667edc | 3.5.0 | 1.7 GB | true | false | 6 | 337966 | 337966 | |
| https://10.0.0.7:2379 | ec66c6beb10bb6b3 | 3.5.0 | 1.7 GB | false | false | 6 | 337966 | 337966 | |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.6:2379 defrag
Finished defragmenting etcd member[https://10.0.0.6:2379]
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.7:2379 defrag
Finished defragmenting etcd member[https://10.0.0.7:2379]
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.8:2379 defrag
Finished defragmenting etcd member[https://10.0.0.8:2379]
cluster-autoscaler-default pod is not restarting at any time
$ oc get po [15:41:10]
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-default-6dfb9f8bcc-jhgpp 1/1 Running 0 3h51m
cluster-autoscaler-operator-654fb45c66-4mdz6 2/2 Running 0 4h33m
cluster-baremetal-operator-569b4ff4dd-5q9wj 2/2 Running 0 4h33m
machine-api-controllers-785cc9fdf-lb969 7/7 Running 0 4h26m
machine-api-operator-ff7dd5bd7-cf4xv 2/2 Running 0 4h33m
$ oc edit po cluster-autoscaler-default-6dfb9f8bcc-jhgpp
spec:
containers:
- args:
- --logtostderr
- --v=1
- --cloud-provider=clusterapi
- --namespace=openshift-machine-api
- --leader-elect-lease-duration=137s
- --leader-elect-renew-deadline=107s
- --leader-elect-retry-period=26s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.29 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1363 |