1811924 – [RHV] Master becomes NotReady after several days running

Bug 1811924 - [RHV] Master becomes NotReady after several days running

Summary: [RHV] Master becomes NotReady after several days running

Keywords:
Status:	CLOSED DUPLICATE of bug 1802687
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Jan Zmeskal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1814187
TreeView+	depends on / blocked

Reported:	2020-03-10 08:44 UTC by Liang Xia
Modified:	2020-03-17 13:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1814187 (view as bug list)
Environment:
Last Closed:	2020-03-17 13:41:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Liang Xia 2020-03-10 08:44:31 UTC

Description of problem:
A 4.4 cluster running on RHV for several days, then one of the master become NotReady, and several cluster operators degraded.


Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-27-020932


How reproducible:
1 time of 1 test


Steps to Reproduce:
1. A cluster running on RHV for several days.


Actual results:
One the master becomes NotReady, and several cluster operators degraded.
$ oc get nodes
NAME                       STATUS     ROLES    AGE   VERSION
six-9cdwv-master-0         Ready      master   11d   v1.17.1
six-9cdwv-master-1         Ready      master   11d   v1.17.1
six-9cdwv-master-2         NotReady   master   11d   v1.17.1
six-9cdwv-worker-0-9hklz   Ready      worker   11d   v1.17.1
six-9cdwv-worker-0-kvfzq   Ready      worker   11d   v1.17.1

oc get co | grep -Eiv 'true.*false.*false'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd                                       4.4.0-0.nightly-2020-02-27-020932   True        False         True       11d
kube-apiserver                             4.4.0-0.nightly-2020-02-27-020932   True        False         True       11d
kube-controller-manager                    4.4.0-0.nightly-2020-02-27-020932   True        False         True       11d
kube-scheduler                             4.4.0-0.nightly-2020-02-27-020932   True        False         True       11d
machine-config                             4.4.0-0.nightly-2020-02-27-020932   False       False         True       19h
monitoring                                 4.4.0-0.nightly-2020-02-27-020932   False       True          True       19h
network                                    4.4.0-0.nightly-2020-02-27-020932   True        True          True       11d
openshift-apiserver                        4.4.0-0.nightly-2020-02-27-020932   True        False         True       11d



Expected results:
The cluster is heathy/functional.


Additional info:
Still copying the logs, will attach later.

Comment 2 Jan Zmeskal 2020-03-17 09:59:17 UTC

All right, here's some update regarding the bug. It was first encountered on OCP on RHV cluster that I manually deployed and provided to OpenShift QE for regression testing. After the first cluster degraded in a way describen in comment #1, I decided to deploy a new one. I waited for a few hours after the deployment successfully finished and then I checked the health of the cluster. It looked okay, see here: https://pastebin.com/5J0PmCsn

OpenShift QE then did some preparation and the cluster was left untouched for two or three days. After that, OCP QE told me that all the master were in NotReady state:
[root@system-ge-6 ocp4_installation]# oc get nodes
NAME STATUS ROLES AGE VERSION
six-f9hq8-master-0 NotReady master 3d12h v1.17.1
six-f9hq8-master-1 NotReady master 3d12h v1.17.1
six-f9hq8-master-2 NotReady master 3d12h v1.17.1
six-f9hq8-worker-0-r9rv2 Ready worker 3d12h v1.17.1
six-f9hq8-worker-0-zx287 Ready worker 3d12h v1.17.1

Xingxing Xia noticed that there was high memory usage when describing master nodes:
[root@system-ge-6 xxia-debug]# oc describe no six-f9hq8-master-{0..2} | grep -E "(^Name|cpu.%|memory.%)"
Name: six-f9hq8-master-0
cpu 2612m (74%) 0 (0%)
memory 6809Mi (99%) 512Mi (7%)
Name: six-f9hq8-master-1
cpu 2472m (70%) 0 (0%)
memory 6489Mi (94%) 512Mi (7%)
Name: six-f9hq8-master-2
cpu 2592m (74%) 0 (0%)
memory 6809Mi (99%) 512Mi (7%)

I could confirm that when describing the nodes using oc CLI tool, the memory usage seemed very high:
[root@system-ge-6 ~]# ./oc describe no six-f9hq8-master-0 | grep "memory.*%"
  memory             6809Mi (99%)  512Mi (7%)

However at the same time, when I SSHed to master0, I could see almost 2 GB of memory available from the OS point of view:
[core@six-f9hq8-master-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       5.9Gi       145Mi        17Mi       1.7Gi       1.8Gi
Swap:            0B          0B          0B

Another interesting piece of information is that the master nodes stopped reporting in circa 6 hours intervals:
[root@system-ge-6 ~]# ./oc describe no six-f9hq8-master-{0..2} | grep "Ready"
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  Ready            Unknown   Sun, 15 Mar 2020 08:09:27 +0200   Sun, 15 Mar 2020 08:15:53 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Sat, 14 Mar 2020 20:24:59 +0200   Sat, 14 Mar 2020 20:26:28 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Sun, 15 Mar 2020 02:11:25 +0200   Sun, 15 Mar 2020 02:16:58 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

Unfortunately, kubelet logs on master nodes don't go far enough into the past and oc adm must-gather called against the cluster does not work. I suspect that's because openshift-apiserver operator is down. Therefore I couldn't gather much more info. 

However, Xingxing Xia came up with another interesting insight. This is from OCP on RHV env. Pods on master0 node sorted by the amount of memory they consume. As you can see, openshift-apiserver and openshift-etcd are the most demanding, as is to be expected. However, we can also see 4 openshift-ovirt-infra pods being there in the top consumers:
$ oc describe no six-f9hq8-master-0 | grep "%.*%.*%.*%" | sort -h -r -k 7
  openshift-kube-apiserver                      kube-apiserver-six-f9hq8-master-0                              180m (5%)     0 (0%)    
 1174Mi (17%)     0 (0%)         2d19h
  openshift-etcd                                etcd-six-f9hq8-master-0                                        430m (12%)    0 (0%)    
 860Mi (12%)      0 (0%)         3d19h
  openshift-kube-controller-manager             kube-controller-manager-six-f9hq8-master-0                     220m (6%)     0 (0%)    
 500Mi (7%)       0 (0%)         3d18h
  openshift-sdn                                 ovs-9rx9v                                                      200m (5%)     0 (0%)    
 400Mi (5%)       0 (0%)         3d19h
  openshift-ovirt-infra                         haproxy-six-f9hq8-master-0                                     200m (5%)     0 (0%)    
 400Mi (5%)       0 (0%)         3d19h
  openshift-sdn                                 sdn-9wzg2                                                      100m (2%)     0 (0%)    
 200Mi (2%)       0 (0%)         3d19h
  openshift-ovirt-infra                         mdns-publisher-six-f9hq8-master-0                              100m (2%)     0 (0%)    
 200Mi (2%)       0 (0%)         3d19h
  openshift-ovirt-infra                         keepalived-six-f9hq8-master-0                                  100m (2%)     0 (0%)    
 200Mi (2%)       0 (0%)         3d19h
  openshift-ovirt-infra                         coredns-six-f9hq8-master-0                                     100m (2%)     0 (0%)    
 200Mi (2%)       0 (0%)         3d19h

This is from another, non-RHV env maintained by OpenShift QE:
$ oc describe no <node_name_censored> | grep "%.*%.*%.*%" | sort -h -r -k 7
  openshift-kube-apiserver                kube-apiserver-<censored>                                            330m (9%)     0 (0%)      1174Mi (8%)      0 (0%)         73m
  openshift-etcd                          etcd-<censored>                                                      430m (12%)    0 (0%)      860Mi (6%)       0 (0%)         7h12m
  openshift-kube-controller-manager       kube-controller-manager-<censored>                                   220m (6%)     0 (0%)      500Mi (3%)       0 (0%)         7h
  openshift-sdn                           ovs-j6tj6                                                            100m (2%)     0 (0%)      400Mi (2%)       0 (0%)         7h18m

I then investigated further and found out that this set of four openshift-ovirt-infra pods runs on each of the master node and consumes about 1GiB memory on each of those. I don't know if this is the root cause, but I thought I'd leave this information here in case it's useful.

Comment 3 Jan Zmeskal 2020-03-17 10:13:08 UTC

Also, this is the current state of cluster operators:
[root@system-ge-6 ocp4_installation]# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-03-12-052849   True        True          True       4d19h
cloud-credential                           4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
cluster-autoscaler                         4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
console                                    4.4.0-0.nightly-2020-03-12-052849   False       False         True       2d9h
csi-snapshot-controller                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      34h
dns                                        4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
etcd                                       4.4.0-0.nightly-2020-03-12-052849   True        False         True       4d20h
image-registry                             4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
ingress                                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
insights                                   4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
kube-apiserver                             4.4.0-0.nightly-2020-03-12-052849   True        False         True       4d20h
kube-controller-manager                    4.4.0-0.nightly-2020-03-12-052849   True        False         True       4d20h
kube-scheduler                             4.4.0-0.nightly-2020-03-12-052849   True        False         True       4d20h
kube-storage-version-migrator              4.4.0-0.nightly-2020-03-12-052849   True        False         False      34h
machine-api                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
machine-config                             4.4.0-0.nightly-2020-03-12-052849   False       False         True       2d15h
marketplace                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
monitoring                                 4.4.0-0.nightly-2020-03-12-052849   False       True          True       2d15h
network                                    4.4.0-0.nightly-2020-03-12-052849   True        True          True       4d20h
node-tuning                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
openshift-apiserver                        4.4.0-0.nightly-2020-03-12-052849   False       False         True       2d3h
openshift-controller-manager               4.4.0-0.nightly-2020-03-12-052849   False       True          False      2d3h
openshift-samples                          4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d19h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-03-12-052849   False       True          False      2d3h
service-ca                                 4.4.0-0.nightly-2020-03-12-052849   True        True          False      4d20h
service-catalog-apiserver                  4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
service-catalog-controller-manager         4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h
storage                                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      4d20h

Comment 4 Ryan Phillips 2020-03-17 13:41:46 UTC

4.4.0-0.nightly-2020-02-27-020932 is too old. There is a fix in the attached bug that merged.

*** This bug has been marked as a duplicate of bug 1802687 ***

Note You need to log in before you can comment on or make changes to this bug.