Description of problem: I am working with engineering on Shutdown / Resume of clusters (put the VMs to sleep to save money before the initial 24h certificate rotation). My last test succeeded - but I am left with 10 Alerts that will not clear: - CloudCredentialOperatorDown - ClusterAutoscalerOperatorDown - ClusterMachineApproverDown - ClusterVersionOperatorDown - KubeAPIDown - KubeControllerManagerDown - KubeletDown - KubeSchedulerDown (that one is showing twice) - MachineAPIOperatorDown Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-20-053050 Steps to Reproduce: 1. Deploy a cluster 2. Once deployment has finished and everything stabilized stop the VMs (e.g. using the AWS Console) 3. 26 hours later resume the VMs 4. Wait for CSRs to appear (one for each node) and approve the CSRs 5. Wait for all Nodes to show Ready 6. Wait for all Cluster Operators to stabilize and become available (10 mins or so) 7. Open Web Console and examine Alerts Actual results: - Alerts firing (even a day later) Expected results: - Alerts being cleaned up as the cluster stabilizes Additional info:
All those alerts use `absent(up(<<SOMETHING>>))` expression. Probably investigating one should resolve all of them. Could you paste output of the following prometheus queries: 1. up{job="scheduler"} (from now and from before cluster was put to sleep) Also did by any chance IP addresses of the master nodes changed after resuming from sleep?
@Wolfgang could you also clarify if VMs are stopped or paused?
According to liggitt this is a known issue with KUBERNETES_SERVICE_HOST and recommends to use the kubernetes.default.svc fqdn https://github.com/kubernetes/kubernetes/issues/40973#issuecomment-383969128
@Pavel: We stop them. Basically AWS Instance Stop. And then resume by starting them again. Ansible Code: - when: ACTION == 'stop' name: Stop instances by guid tags ec2_instance: state: stopped wait: no filters: "tag:guid": "{{ guid }}" - when: ACTION == 'start' name: Start instances by guid tags ec2_instance: state: started wait: no filters: "tag:guid": "{{ guid }}" On OpenStack we use `openstack server stop` and `openstack server start`
Just a note: prometheus is using kubernetes client-go for discovery and that library is directly requiring KUBERNETES_SERVICE_HOST [1]. So if this is a known issue and recommendation is not to use this env, then the official kubernetes library should reflect that. [1]: https://github.com/kubernetes/client-go/blob/master/rest/config.go#L464
We are aware that's the case. David created a fix for upstream kubelet https://github.com/kubernetes/kubernetes/pull/91500
*** Bug 1845199 has been marked as a duplicate of this bug. ***
Verified. Stopped cluster for 24 hours by stopping VMs from AWS console. Started after 24 hours, approved CSRs and waited for 15-20 minutes. Do not see any Down alerts from console. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-20-011219 True False 27h Cluster version is 4.6.0-0.nightly-2020-06-20-011219 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-140-103.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.140.103 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-150-169.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.150.169 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-162-173.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.162.173 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-168-242.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.168.242 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-206-17.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.206.17 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-210-169.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.210.169 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-06-20-011219 True False False 27h cloud-credential 4.6.0-0.nightly-2020-06-20-011219 True False False 27h cluster-autoscaler 4.6.0-0.nightly-2020-06-20-011219 True False False 27h config-operator 4.6.0-0.nightly-2020-06-20-011219 True False False 27h console 4.6.0-0.nightly-2020-06-20-011219 True False False 26m csi-snapshot-controller 4.6.0-0.nightly-2020-06-20-011219 True False False 27h dns 4.6.0-0.nightly-2020-06-20-011219 True False False 28m etcd 4.6.0-0.nightly-2020-06-20-011219 True False False 27h image-registry 4.6.0-0.nightly-2020-06-20-011219 True False False 27h ingress 4.6.0-0.nightly-2020-06-20-011219 True False False 27m insights 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-apiserver 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-controller-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-scheduler 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-storage-version-migrator 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-api 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-approver 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-config 4.6.0-0.nightly-2020-06-20-011219 True False False 27h marketplace 4.6.0-0.nightly-2020-06-20-011219 True False False 28m monitoring 4.6.0-0.nightly-2020-06-20-011219 True False False 22m network 4.6.0-0.nightly-2020-06-20-011219 True False False 27h node-tuning 4.6.0-0.nightly-2020-06-20-011219 True False False 27h openshift-apiserver 4.6.0-0.nightly-2020-06-20-011219 True False False 27m openshift-controller-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 22m openshift-samples 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-06-20-011219 True False False 22m service-ca 4.6.0-0.nightly-2020-06-20-011219 True False False 27h storage 4.6.0-0.nightly-2020-06-20-011219 True False False 27h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196