1839098 – Resume a cluster from sleep, Alerts keep firing even though everything is fine

Bug 1839098 - Resume a cluster from sleep, Alerts keep firing even though everything is fine

Summary: Resume a cluster from sleep, Alerts keep firing even though everything is fine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1845199 (view as bug list)
Depends On:
Blocks:	1845369
TreeView+	depends on / blocked

Reported:	2020-05-22 13:32 UTC by Wolfgang Kulhanek
Modified:	2020-10-27 16:00 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:00:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 40973	None	closed	KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT weren't set in pod env	2020-11-03 16:56:08 UTC
Github	kubernetes kubernetes pull 91500	None	closed	reduce race risk in kubelet for missing KUBERNETES_SERVICE_HOST	2020-11-03 16:56:09 UTC
Github	openshift origin pull 25075	None	closed	Bug 1839098: UPSTREAM: 91500: reduce race risk in kubelet for missing KUBERNETES_SERVICE_HOST	2020-11-03 16:56:09 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:00:54 UTC

Description Wolfgang Kulhanek 2020-05-22 13:32:41 UTC

Description of problem:

I am working with engineering on Shutdown / Resume of clusters (put the VMs to sleep to save money before the initial 24h certificate rotation).

My last test succeeded - but I am left with 10 Alerts that will not clear:
- CloudCredentialOperatorDown
- ClusterAutoscalerOperatorDown
- ClusterMachineApproverDown
- ClusterVersionOperatorDown
- KubeAPIDown
- KubeControllerManagerDown
- KubeletDown
- KubeSchedulerDown (that one is showing twice)
- MachineAPIOperatorDown


Version-Release number of selected component (if applicable):
 4.5.0-0.nightly-2020-05-20-053050

Steps to Reproduce:
1. Deploy a cluster
2. Once deployment has finished and everything stabilized stop the VMs (e.g. using the AWS Console)
3. 26 hours later resume the VMs
4. Wait for CSRs to appear (one for each node) and approve the CSRs
5. Wait for all Nodes to show Ready
6. Wait for all Cluster Operators to stabilize and become available (10 mins or so)
7. Open Web Console and examine Alerts

Actual results:
- Alerts firing (even a day later)

Expected results:
- Alerts being cleaned up as the cluster stabilizes


Additional info:

Comment 1 Pawel Krupa 2020-05-22 14:02:34 UTC

All those alerts use `absent(up(<<SOMETHING>>))` expression. Probably investigating one should resolve all of them. Could you paste output of the following prometheus queries:

1. up{job="scheduler"} (from now and from before cluster was put to sleep)

Also did by any chance IP addresses of the master nodes changed after resuming from sleep?

Comment 3 Pawel Krupa 2020-05-25 09:12:21 UTC

@Wolfgang could you also clarify if VMs are stopped or paused?

Comment 13 Ryan Phillips 2020-05-27 14:10:21 UTC

According to liggitt this is a known issue with KUBERNETES_SERVICE_HOST and recommends to use the kubernetes.default.svc fqdn https://github.com/kubernetes/kubernetes/issues/40973#issuecomment-383969128

Comment 14 Wolfgang Kulhanek 2020-05-27 16:23:07 UTC

@Pavel: We stop them. Basically AWS Instance Stop. And then resume by starting them again. 

Ansible Code:

    - when: ACTION == 'stop'
      name: Stop instances by guid tags
      ec2_instance:
        state: stopped
        wait: no
        filters:
          "tag:guid": "{{ guid }}"

    - when: ACTION == 'start'
      name: Start instances by guid tags
      ec2_instance:
        state: started
        wait: no
        filters:
          "tag:guid": "{{ guid }}"

On OpenStack we use `openstack server stop` and `openstack server start`

Comment 16 Pawel Krupa 2020-05-28 08:14:28 UTC

Just a note: prometheus is using kubernetes client-go for discovery and that library is directly requiring KUBERNETES_SERVICE_HOST [1]. So if this is a known issue and recommendation is not to use this env, then the official kubernetes library should reflect that.


[1]: https://github.com/kubernetes/client-go/blob/master/rest/config.go#L464

Comment 17 Tomáš Nožička 2020-05-28 10:35:03 UTC

We are aware that's the case. David created a fix for upstream kubelet https://github.com/kubernetes/kubernetes/pull/91500

Comment 19 Ryan Phillips 2020-06-08 21:05:54 UTC

*** Bug 1845199 has been marked as a duplicate of this bug. ***

Comment 22 Sunil Choudhary 2020-06-22 12:33:04 UTC

Verified. Stopped cluster for 24 hours by stopping VMs from AWS console. Started after 24 hours, approved CSRs and waited for 15-20 minutes. Do not see any Down alerts from console.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-20-011219   True        False         27h     Cluster version is 4.6.0-0.nightly-2020-06-20-011219

$ oc get nodes -o wide
NAME                                         STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
ip-10-0-140-103.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.140.103   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-150-169.us-east-2.compute.internal   Ready    worker   27h   v1.18.3+e1ba7b6   10.0.150.169   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-162-173.us-east-2.compute.internal   Ready    worker   27h   v1.18.3+e1ba7b6   10.0.162.173   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-168-242.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.168.242   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-206-17.us-east-2.compute.internal    Ready    worker   27h   v1.18.3+e1ba7b6   10.0.206.17    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-210-169.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.210.169   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
cloud-credential                           4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
cluster-autoscaler                         4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
config-operator                            4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
console                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      26m
csi-snapshot-controller                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
dns                                        4.6.0-0.nightly-2020-06-20-011219   True        False         False      28m
etcd                                       4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
image-registry                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
ingress                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27m
insights                                   4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-apiserver                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-controller-manager                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-scheduler                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-storage-version-migrator              4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-api                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-approver                           4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-config                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
marketplace                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      28m
monitoring                                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
network                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
node-tuning                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
openshift-apiserver                        4.6.0-0.nightly-2020-06-20-011219   True        False         False      27m
openshift-controller-manager               4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
openshift-samples                          4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
service-ca                                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
storage                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h

Comment 24 errata-xmlrpc 2020-10-27 16:00:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.