Hide Forgot
Hi Michael , Simon ,Thanks for the detailed step comment #1 (Description) I followed all the steps and installed mentioned operators and here are the logs : ( after 2 hrs or so) [miyadav@miyadav ~]$ oc logs -f cluster-autoscaler-operator-6589f54589-wx9jl -c cluster-autoscaler-operator I0406 07:36:40.830761 1 main.go:13] Go Version: go1.17.5 I0406 07:36:40.831034 1 main.go:14] Go OS/Arch: linux/amd64 I0406 07:36:40.831150 1 main.go:15] Version: cluster-autoscaler-operator v4.10.0-202203311829.p0.g8bcdccc.assembly.stream-dirty W0406 07:36:40.837248 1 leaderelection.go:51] unable to get cluster infrastructure status, using HA cluster values for leader election: infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope I0406 07:36:41.887922 1 request.go:665] Waited for 1.040054041s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s W0406 07:36:46.949503 1 machineautoscaler_controller.go:150] Removing support for unregistered target type: cluster.k8s.io/v1beta1, Kind=MachineDeployment W0406 07:36:50.740796 1 machineautoscaler_controller.go:150] Removing support for unregistered target type: cluster.k8s.io/v1beta1, Kind=MachineSet I0406 07:36:51.888023 1 request.go:665] Waited for 1.048171568s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/events.k8s.io/v1beta1?timeout=32s W0406 07:36:54.542488 1 machineautoscaler_controller.go:150] Removing support for unregistered target type: machine.openshift.io/v1beta1, Kind=MachineDeployment I0406 07:36:54.542800 1 main.go:36] Starting cluster-autoscaler-operator I0406 07:36:54.542988 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader... I0406 07:39:14.631980 1 leaderelection.go:258] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader I0406 07:39:14.637921 1 status.go:386] No ClusterAutoscaler. Reporting available. I0406 07:39:14.637938 1 status.go:234] Operator status available: at version 4.10.8 I0406 07:39:14.739388 1 webhookconfig.go:72] Webhook configuration status: updated E0406 07:42:24.712791 1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps cluster-autoscaler-operator-leader) . . Projects that got created : . . . project-5999 Active project-6000 Active project-6001 Active project-6002 Active project-6003 Active project-6004 Active project-6005 Active project-6006 Active project-6007 Active project-6008 Active project-6009 Active project-6010 Active project-6011 Active project-6012 Active project-6013 Active project-6014 Active project-6015 Active project-6016 Active project-6017 Active project-6018 Active project-6019 Active project-6020 Active project-6021 Active project-6022 Active . . . . [miyadav@miyadav ~]$ oc debug node/ip-10-0-164-204.us-east-2.compute.internal Starting pod/ip-10-0-164-204us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` chroot /hosPod IP: 10.0.164.204 If you don't see a command prompt, try pressing enter. chroot /host sh-4.4# sh-4.4# vi /tmp/kubeconfig sh-4.4# oc get nodes --kubeconfig /tmp/kubeconfig NAME STATUS ROLES AGE VERSION ip-10-0-128-231.us-east-2.compute.internal Ready worker 54m v1.23.5+1f952b3 ip-10-0-147-246.us-east-2.compute.internal Ready master 58m v1.23.5+1f952b3 ip-10-0-164-204.us-east-2.compute.internal Ready master 57m v1.23.5+1f952b3 ip-10-0-172-19.us-east-2.compute.internal Ready worker 52m v1.23.5+1f952b3 ip-10-0-198-56.us-east-2.compute.internal Ready master 59m v1.23.5+1f952b3 ip-10-0-205-71.us-east-2.compute.internal Ready worker 54m v1.23.5+1f952b3 sh-4.4# `for i in {5000..7125}; do oc new-project project-$i --kubeconfig /tmp/kubeconfig; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt --kubeconfig /tmp/kubconfig; done` . . . Additional Info : Ran on master node the script above on master node. [miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.8 True False 120m Cluster version is 4.10.8 Needed to confirm if these results looks good , or we need to check any other info as well ?
in addition to above comment#5 I did not see "lost master" in logs ..
> Needed to confirm if these results looks good , or we need to check any other info as well ? i think Simon is better suited to answer that for the cluster. from the perspective of the cluster-autoscaler, i think as long as it doesn't crash and hands over leadership properly, it's successful for me.
sorry, didn't mean to clear the needinfo @sreber did you have any details to add to Milind's question?
(In reply to Milind Yadav from comment #6) > in addition to above comment#5 I did not see "lost master" in logs .. I think this looks good. What you want to make sure is that `cluster-autoscaler-default-*` pod is not restarting at any time. Or more specifically, once you have the setup from https://bugzilla.redhat.com/show_bug.cgi?id=2069095#c5, you can run https://docs.openshift.com/container-platform/4.8/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks to trigger `etcd defrag` (as the automated controller does only trigger it when 45% of etcd is considered fragmented). Run it for each `etcd` member (last the leader) and verify if `cluster-autoscaler-default-*` is restarting or not. If it remains stable and working (does not restart) we can consider the issue tracked in this Red Hat Support Case addressed to the extend possible/feasible. The key is to verify the above but also to make sure that the rest of the ClusterAutoscaling functionality continue to work as expected.
Thanks Simon for references . In parallel to running defragmentation , I ran the workload which caused the cluster to scale and it looks good , no unexpected crashes . CLuster-operator logs - . . . . I0407 04:10:42.991317 1 validator.go:161] Validation webhook called for ClustAutoscaler: default I0407 04:10:42.995473 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.048466 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.048489 1 clusterautoscaler_controller.go:270] Creating ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.054665 1 clusterautoscaler_controller.go:224] Created ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.054772 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.061560 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.067814 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.093271 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.097483 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.102242 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.107606 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.113883 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.121381 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:43.161828 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:43.166709 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:43.171402 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:48.229330 1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default I0407 04:10:48.235000 1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring I0407 04:10:48.242293 1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default I0407 04:10:59.467780 1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1 I0407 04:10:59.471528 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:10:59.486305 1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1 I0407 04:10:59.507874 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:10:59.514366 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.219911 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.365730 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:22:13.379363 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:24:59.596571 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:40:17.355885 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 I0407 04:40:17.381273 1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1 . . . Defragmentation : [miyadav@miyadav ~]$ oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd etcd-ip-10-0-138-10.us-east-2.compute.internal 4/4 Running 0 87m 10.0.138.10 ip-10-0-138-10.us-east-2.compute.internal <none> <none> etcd-ip-10-0-163-254.us-east-2.compute.internal 4/4 Running 0 90m 10.0.163.254 ip-10-0-163-254.us-east-2.compute.internal <none> <none> etcd-ip-10-0-221-0.us-east-2.compute.internal 4/4 Running 0 89m 10.0.221.0 ip-10-0-221-0.us-east-2.compute.internal <none> <none> [miyadav@miyadav ~]$ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal etcdctl endpoint status --cluster -w table Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.221.0:2379 | 7f49a444454d5fa9 | 3.5.0 | 2.8 GB | false | false | 10[miyadav@miyadav ~]$ [miyadav@miyadav ~]$ $ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal bash: $: command not found... [miyadav@miyadav ~]$ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) sh-4.4# sh-4.4# unset ETCDCTL_ENDPOINTS sh: sh-4.4#: command not found sh-4.4# unset ETCDCTL_ENDPOINTS sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag Finished defragmenting etcd member[https://localhost:2379] sh-4.4# etcdctl endpoint status -w table --cluster +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.221.0:2379 | 7f49a444454d5fa9 | 3.5.0 | 3.0 GB | false | false | 10 | 577845 | 577845 | | | https://10.0.163.254:2379 | 96592f389ac22ff2 | 3.5.0 | 2.7 GB | true | false | 10 | 577846 | 577846 | | | https://10.0.138.10:2379 | f033163df9ebfad9 | 3.5.0 | 3.0 GB | false | false | 10 | 577849 | 577849 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ Cluster scaled up adding new machine and then scaled down after jobs completed successfully [miyadav@miyadav ~]$ oc get jobs NAME COMPLETIONS DURATION AGE work-queue-4scfk 100/100 8m1s 14m [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 99m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 99m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 99m miyadav-0704-czr6m-worker-us-east-2a-lb7rj Running m5.4xlarge us-east-2 us-east-2a 14m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 97m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 97m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 97m [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 102m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 102m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 102m miyadav-0704-czr6m-worker-us-east-2a-lb7rj Running m5.4xlarge us-east-2 us-east-2a 17m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 100m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 100m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 100m [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-138-10.us-east-2.compute.internal Ready master 102m v1.23.5+1f952b3 ip-10-0-138-112.us-east-2.compute.internal Ready worker 96m v1.23.5+1f952b3 ip-10-0-144-206.us-east-2.compute.internal Ready worker 14m v1.23.5+1f952b3 ip-10-0-163-254.us-east-2.compute.internal Ready master 101m v1.23.5+1f952b3 ip-10-0-173-223.us-east-2.compute.internal Ready worker 97m v1.23.5+1f952b3 ip-10-0-193-68.us-east-2.compute.internal Ready worker 96m v1.23.5+1f952b3 ip-10-0-221-0.us-east-2.compute.internal Ready master 101m v1.23.5+1f952b3 [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-0704-czr6m-master-0 Running m5.4xlarge us-east-2 us-east-2a 108m miyadav-0704-czr6m-master-1 Running m5.4xlarge us-east-2 us-east-2b 108m miyadav-0704-czr6m-master-2 Running m5.4xlarge us-east-2 us-east-2c 108m miyadav-0704-czr6m-worker-us-east-2a-ww8lz Running m5.4xlarge us-east-2 us-east-2a 106m miyadav-0704-czr6m-worker-us-east-2b-vcbth Running m5.4xlarge us-east-2 us-east-2b 106m miyadav-0704-czr6m-worker-us-east-2c-n786l Running m5.4xlarge us-east-2 us-east-2c 106m . . Additional info : Based on results moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1356