Bug 1839098
| Summary: | Resume a cluster from sleep, Alerts keep firing even though everything is fine | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Wolfgang Kulhanek <wkulhane> |
| Component: | Node | Assignee: | Ryan Phillips <rphillips> |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | alegrand, anpicker, aos-bugs, deads, erooth, jokerman, kakkoyun, lcosic, mloibl, nagrawal, pkrupa, rphillips, surbania, tnozicka, vareti |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:00:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1845369 | ||
|
Description
Wolfgang Kulhanek
2020-05-22 13:32:41 UTC
All those alerts use `absent(up(<<SOMETHING>>))` expression. Probably investigating one should resolve all of them. Could you paste output of the following prometheus queries:
1. up{job="scheduler"} (from now and from before cluster was put to sleep)
Also did by any chance IP addresses of the master nodes changed after resuming from sleep?
@Wolfgang could you also clarify if VMs are stopped or paused? According to liggitt this is a known issue with KUBERNETES_SERVICE_HOST and recommends to use the kubernetes.default.svc fqdn https://github.com/kubernetes/kubernetes/issues/40973#issuecomment-383969128 @Pavel: We stop them. Basically AWS Instance Stop. And then resume by starting them again.
Ansible Code:
- when: ACTION == 'stop'
name: Stop instances by guid tags
ec2_instance:
state: stopped
wait: no
filters:
"tag:guid": "{{ guid }}"
- when: ACTION == 'start'
name: Start instances by guid tags
ec2_instance:
state: started
wait: no
filters:
"tag:guid": "{{ guid }}"
On OpenStack we use `openstack server stop` and `openstack server start`
Just a note: prometheus is using kubernetes client-go for discovery and that library is directly requiring KUBERNETES_SERVICE_HOST [1]. So if this is a known issue and recommendation is not to use this env, then the official kubernetes library should reflect that. [1]: https://github.com/kubernetes/client-go/blob/master/rest/config.go#L464 We are aware that's the case. David created a fix for upstream kubelet https://github.com/kubernetes/kubernetes/pull/91500 *** Bug 1845199 has been marked as a duplicate of this bug. *** Verified. Stopped cluster for 24 hours by stopping VMs from AWS console. Started after 24 hours, approved CSRs and waited for 15-20 minutes. Do not see any Down alerts from console. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-20-011219 True False 27h Cluster version is 4.6.0-0.nightly-2020-06-20-011219 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-140-103.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.140.103 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-150-169.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.150.169 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-162-173.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.162.173 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-168-242.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.168.242 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-206-17.us-east-2.compute.internal Ready worker 27h v1.18.3+e1ba7b6 10.0.206.17 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev ip-10-0-210-169.us-east-2.compute.internal Ready master 27h v1.18.3+e1ba7b6 10.0.210.169 <none> Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-06-20-011219 True False False 27h cloud-credential 4.6.0-0.nightly-2020-06-20-011219 True False False 27h cluster-autoscaler 4.6.0-0.nightly-2020-06-20-011219 True False False 27h config-operator 4.6.0-0.nightly-2020-06-20-011219 True False False 27h console 4.6.0-0.nightly-2020-06-20-011219 True False False 26m csi-snapshot-controller 4.6.0-0.nightly-2020-06-20-011219 True False False 27h dns 4.6.0-0.nightly-2020-06-20-011219 True False False 28m etcd 4.6.0-0.nightly-2020-06-20-011219 True False False 27h image-registry 4.6.0-0.nightly-2020-06-20-011219 True False False 27h ingress 4.6.0-0.nightly-2020-06-20-011219 True False False 27m insights 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-apiserver 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-controller-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-scheduler 4.6.0-0.nightly-2020-06-20-011219 True False False 27h kube-storage-version-migrator 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-api 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-approver 4.6.0-0.nightly-2020-06-20-011219 True False False 27h machine-config 4.6.0-0.nightly-2020-06-20-011219 True False False 27h marketplace 4.6.0-0.nightly-2020-06-20-011219 True False False 28m monitoring 4.6.0-0.nightly-2020-06-20-011219 True False False 22m network 4.6.0-0.nightly-2020-06-20-011219 True False False 27h node-tuning 4.6.0-0.nightly-2020-06-20-011219 True False False 27h openshift-apiserver 4.6.0-0.nightly-2020-06-20-011219 True False False 27m openshift-controller-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 22m openshift-samples 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-06-20-011219 True False False 27h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-06-20-011219 True False False 22m service-ca 4.6.0-0.nightly-2020-06-20-011219 True False False 27h storage 4.6.0-0.nightly-2020-06-20-011219 True False False 27h Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |