Bug 1839098 - Resume a cluster from sleep, Alerts keep firing even though everything is fine
Summary: Resume a cluster from sleep, Alerts keep firing even though everything is fine
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1845199 (view as bug list)
Depends On:
Blocks: 1845369
TreeView+ depends on / blocked
 
Reported: 2020-05-22 13:32 UTC by Wolfgang Kulhanek
Modified: 2020-10-27 16:00 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:00:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 40973 0 None closed KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT weren't set in pod env 2020-11-03 16:56:08 UTC
Github kubernetes kubernetes pull 91500 0 None closed reduce race risk in kubelet for missing KUBERNETES_SERVICE_HOST 2020-11-03 16:56:09 UTC
Github openshift origin pull 25075 0 None closed Bug 1839098: UPSTREAM: 91500: reduce race risk in kubelet for missing KUBERNETES_SERVICE_HOST 2020-11-03 16:56:09 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:00:54 UTC

Description Wolfgang Kulhanek 2020-05-22 13:32:41 UTC
Description of problem:

I am working with engineering on Shutdown / Resume of clusters (put the VMs to sleep to save money before the initial 24h certificate rotation).

My last test succeeded - but I am left with 10 Alerts that will not clear:
- CloudCredentialOperatorDown
- ClusterAutoscalerOperatorDown
- ClusterMachineApproverDown
- ClusterVersionOperatorDown
- KubeAPIDown
- KubeControllerManagerDown
- KubeletDown
- KubeSchedulerDown (that one is showing twice)
- MachineAPIOperatorDown


Version-Release number of selected component (if applicable):
 4.5.0-0.nightly-2020-05-20-053050

Steps to Reproduce:
1. Deploy a cluster
2. Once deployment has finished and everything stabilized stop the VMs (e.g. using the AWS Console)
3. 26 hours later resume the VMs
4. Wait for CSRs to appear (one for each node) and approve the CSRs
5. Wait for all Nodes to show Ready
6. Wait for all Cluster Operators to stabilize and become available (10 mins or so)
7. Open Web Console and examine Alerts

Actual results:
- Alerts firing (even a day later)

Expected results:
- Alerts being cleaned up as the cluster stabilizes


Additional info:

Comment 1 Pawel Krupa 2020-05-22 14:02:34 UTC
All those alerts use `absent(up(<<SOMETHING>>))` expression. Probably investigating one should resolve all of them. Could you paste output of the following prometheus queries:

1. up{job="scheduler"} (from now and from before cluster was put to sleep)

Also did by any chance IP addresses of the master nodes changed after resuming from sleep?

Comment 3 Pawel Krupa 2020-05-25 09:12:21 UTC
@Wolfgang could you also clarify if VMs are stopped or paused?

Comment 13 Ryan Phillips 2020-05-27 14:10:21 UTC
According to liggitt this is a known issue with KUBERNETES_SERVICE_HOST and recommends to use the kubernetes.default.svc fqdn https://github.com/kubernetes/kubernetes/issues/40973#issuecomment-383969128

Comment 14 Wolfgang Kulhanek 2020-05-27 16:23:07 UTC
@Pavel: We stop them. Basically AWS Instance Stop. And then resume by starting them again. 

Ansible Code:

    - when: ACTION == 'stop'
      name: Stop instances by guid tags
      ec2_instance:
        state: stopped
        wait: no
        filters:
          "tag:guid": "{{ guid }}"

    - when: ACTION == 'start'
      name: Start instances by guid tags
      ec2_instance:
        state: started
        wait: no
        filters:
          "tag:guid": "{{ guid }}"

On OpenStack we use `openstack server stop` and `openstack server start`

Comment 16 Pawel Krupa 2020-05-28 08:14:28 UTC
Just a note: prometheus is using kubernetes client-go for discovery and that library is directly requiring KUBERNETES_SERVICE_HOST [1]. So if this is a known issue and recommendation is not to use this env, then the official kubernetes library should reflect that.


[1]: https://github.com/kubernetes/client-go/blob/master/rest/config.go#L464

Comment 17 Tomáš Nožička 2020-05-28 10:35:03 UTC
We are aware that's the case. David created a fix for upstream kubelet https://github.com/kubernetes/kubernetes/pull/91500

Comment 19 Ryan Phillips 2020-06-08 21:05:54 UTC
*** Bug 1845199 has been marked as a duplicate of this bug. ***

Comment 22 Sunil Choudhary 2020-06-22 12:33:04 UTC
Verified. Stopped cluster for 24 hours by stopping VMs from AWS console. Started after 24 hours, approved CSRs and waited for 15-20 minutes. Do not see any Down alerts from console.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-20-011219   True        False         27h     Cluster version is 4.6.0-0.nightly-2020-06-20-011219

$ oc get nodes -o wide
NAME                                         STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
ip-10-0-140-103.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.140.103   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-150-169.us-east-2.compute.internal   Ready    worker   27h   v1.18.3+e1ba7b6   10.0.150.169   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-162-173.us-east-2.compute.internal   Ready    worker   27h   v1.18.3+e1ba7b6   10.0.162.173   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-168-242.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.168.242   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-206-17.us-east-2.compute.internal    Ready    worker   27h   v1.18.3+e1ba7b6   10.0.206.17    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev
ip-10-0-210-169.us-east-2.compute.internal   Ready    master   27h   v1.18.3+e1ba7b6   10.0.210.169   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006192241-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-23.dev.rhaos4.6.git6902246.el8-dev

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
cloud-credential                           4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
cluster-autoscaler                         4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
config-operator                            4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
console                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      26m
csi-snapshot-controller                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
dns                                        4.6.0-0.nightly-2020-06-20-011219   True        False         False      28m
etcd                                       4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
image-registry                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
ingress                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27m
insights                                   4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-apiserver                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-controller-manager                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-scheduler                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
kube-storage-version-migrator              4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-api                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-approver                           4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
machine-config                             4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
marketplace                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      28m
monitoring                                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
network                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
node-tuning                                4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
openshift-apiserver                        4.6.0-0.nightly-2020-06-20-011219   True        False         False      27m
openshift-controller-manager               4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
openshift-samples                          4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-06-20-011219   True        False         False      22m
service-ca                                 4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h
storage                                    4.6.0-0.nightly-2020-06-20-011219   True        False         False      27h

Comment 24 errata-xmlrpc 2020-10-27 16:00:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.