Bug 2069095
Summary: | cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> | |
Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> | |
Cloud Compute sub component: | Cluster Autoscaler | QA Contact: | Milind Yadav <miyadav> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | aos-bugs, mimccune, miyadav, sreber | |
Version: | 4.9 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.z | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2070277 (view as bug list) | Environment: | ||
Last Closed: | 2022-04-21 13:16:01 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2063194 | |||
Bug Blocks: | 2070277 |
Comment 5
Milind Yadav
2022-04-06 10:14:34 UTC
in addition to above comment#5 I did not see "lost master" in logs .. > Needed to confirm if these results looks good , or we need to check any other info as well ?
i think Simon is better suited to answer that for the cluster. from the perspective of the cluster-autoscaler, i think as long as it doesn't crash and hands over leadership properly, it's successful for me.
sorry, didn't mean to clear the needinfo @sreber did you have any details to add to Milind's question? (In reply to Milind Yadav from comment #6) > in addition to above comment#5 I did not see "lost master" in logs .. I think this looks good. What you want to make sure is that `cluster-autoscaler-default-*` pod is not restarting at any time. Or more specifically, once you have the setup from https://bugzilla.redhat.com/show_bug.cgi?id=2069095#c5, you can run https://docs.openshift.com/container-platform/4.8/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks to trigger `etcd defrag` (as the automated controller does only trigger it when 45% of etcd is considered fragmented). Run it for each `etcd` member (last the leader) and verify if `cluster-autoscaler-default-*` is restarting or not. If it remains stable and working (does not restart) we can consider the issue tracked in this Red Hat Support Case addressed to the extend possible/feasible. The key is to verify the above but also to make sure that the rest of the ClusterAutoscaling functionality continue to work as expected. Thanks Simon for references . In parallel to running defragmentation , I ran the workload which caused the cluster to scale and it looks good , no unexpected crashes . CLuster-operator logs - . . . . I0407 04:10:42.991317 1 validator.go:161] Validation webhook called for ClustAutoscaler: default I0407 04:10:42.995473 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.048466 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.048489 1 clusterautoscaler_controller.go:270] Creating ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.054665 1 clusterautoscaler_controller.go:224] Created ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.054772 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.061560 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.067814 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.093271 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.097483 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.102242 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.107606 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.113883 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.121381 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.161828 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.166709 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.171402 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:48.229330 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:48.235000 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:48.242293 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:59.467780 1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1 I0407 04:10:59.471528 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:10:59.486305 1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1 I0407 04:10:59.507874 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:10:59.514366 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.219911 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.365730 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.379363 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:24:59.596571 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:40:17.355885 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:40:17.381273 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 . . . Defragmentation : [miyadav@miyadav ~]$ oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd etcd-ip-10-0-138-10.us-east-2.compute.internal 4/4 Running 0 87m 10.0.138.10 ip-10-0-138-10.us-east-2.compute.internal <none> <none> etcd-ip-10-0-163-254.us-east-2.compute.internal 4/4 Running 0 90m 10.0.163.254 ip-10-0-163-254.us-east-2.compute.internal <none> <none> etcd-ip-10-0-221-0.us-east-2.compute.internal 4/4 Running 0 89m 10.0.221.0 ip-10-0-221-0.us-east-2.compute.internal <none> <none> [miyadav@miyadav ~]$ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal etcdctl endpoint status --cluster -w table Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.221.0:2379 | 7f49a444454d5fa9 | 3.5.0 | 2.8 GB | false | false | 10[miyadav@miyadav ~]$ [miyadav@miyadav ~]$ $ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal bash: $: command not found... [miyadav@miyadav ~]$ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) sh-4.4# sh-4.4# unset ETCDCTL_ENDPOINTS sh: sh-4.4#: command not found sh-4.4# unset ETCDCTL_ENDPOINTS sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag Finished defragmenting etcd member[https://localhost:2379] sh-4.4# etcdctl endpoint status -w table --cluster +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.221.0:2379 | 7f49a444454d5fa9 | 3.5.0 | 3.0 GB | false | false | 10 | 577845 | 577845 | | | https://10.0.163.254:2379 | 96592f389ac22ff2 | 3.5.0 | 2.7 GB | true | false | 10 | 577846 | 577846 | | | https://10.0.138.10:2379 | f033163df9ebfad9 | 3.5.0 | 3.0 GB | false | false | 10 | 577849 | 577849 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ Cluster scaled up adding new machine and then scaled down after jobs completed successfully [miyadav@miyadav ~]$ oc get jobs NAME COMPLETIONS DURATION AGE work-queue-4scfk 100/100 8m1s 14m [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 99m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 99m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 99m miyadav-0704-czr6m-worker-us-east-2a-lb7rj Running m5.4xlarge us-east-2 us-east-2a 14m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 97m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 97m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 97m [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 102m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 102m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 102m miyadav-0704-czr6m-worker-us-east-2a-lb7rj Running m5.4xlarge us-east-2 us-east-2a 17m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 100m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 100m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 100m [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-138-10.us-east-2.compute.internal Ready master 102m v1.23.5+1f952b3 ip-10-0-138-112.us-east-2.compute.internal Ready worker 96m v1.23.5+1f952b3 ip-10-0-144-206.us-east-2.compute.internal Ready worker 14m v1.23.5+1f952b3 ip-10-0-163-254.us-east-2.compute.internal Ready master 101m v1.23.5+1f952b3 ip-10-0-173-223.us-east-2.compute.internal Ready worker 97m v1.23.5+1f952b3 ip-10-0-193-68.us-east-2.compute.internal Ready worker 96m v1.23.5+1f952b3 ip-10-0-221-0.us-east-2.compute.internal Ready master 101m v1.23.5+1f952b3 [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 108m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 108m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 108m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 106m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 106m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 106m . . Additional info : Based on results moving to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1356 |