Description of problem: A 4.4 cluster running on RHV for several days, then one of the master become NotReady, and several cluster operators degraded. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-27-020932 How reproducible: 1 time of 1 test Steps to Reproduce: 1. A cluster running on RHV for several days. Actual results: One the master becomes NotReady, and several cluster operators degraded. $ oc get nodes NAME STATUS ROLES AGE VERSION six-9cdwv-master-0 Ready master 11d v1.17.1 six-9cdwv-master-1 Ready master 11d v1.17.1 six-9cdwv-master-2 NotReady master 11d v1.17.1 six-9cdwv-worker-0-9hklz Ready worker 11d v1.17.1 six-9cdwv-worker-0-kvfzq Ready worker 11d v1.17.1 oc get co | grep -Eiv 'true.*false.*false' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.4.0-0.nightly-2020-02-27-020932 True False True 11d kube-apiserver 4.4.0-0.nightly-2020-02-27-020932 True False True 11d kube-controller-manager 4.4.0-0.nightly-2020-02-27-020932 True False True 11d kube-scheduler 4.4.0-0.nightly-2020-02-27-020932 True False True 11d machine-config 4.4.0-0.nightly-2020-02-27-020932 False False True 19h monitoring 4.4.0-0.nightly-2020-02-27-020932 False True True 19h network 4.4.0-0.nightly-2020-02-27-020932 True True True 11d openshift-apiserver 4.4.0-0.nightly-2020-02-27-020932 True False True 11d Expected results: The cluster is heathy/functional. Additional info: Still copying the logs, will attach later.
All right, here's some update regarding the bug. It was first encountered on OCP on RHV cluster that I manually deployed and provided to OpenShift QE for regression testing. After the first cluster degraded in a way describen in comment #1, I decided to deploy a new one. I waited for a few hours after the deployment successfully finished and then I checked the health of the cluster. It looked okay, see here: https://pastebin.com/5J0PmCsn OpenShift QE then did some preparation and the cluster was left untouched for two or three days. After that, OCP QE told me that all the master were in NotReady state: [root@system-ge-6 ocp4_installation]# oc get nodes NAME STATUS ROLES AGE VERSION six-f9hq8-master-0 NotReady master 3d12h v1.17.1 six-f9hq8-master-1 NotReady master 3d12h v1.17.1 six-f9hq8-master-2 NotReady master 3d12h v1.17.1 six-f9hq8-worker-0-r9rv2 Ready worker 3d12h v1.17.1 six-f9hq8-worker-0-zx287 Ready worker 3d12h v1.17.1 Xingxing Xia noticed that there was high memory usage when describing master nodes: [root@system-ge-6 xxia-debug]# oc describe no six-f9hq8-master-{0..2} | grep -E "(^Name|cpu.%|memory.%)" Name: six-f9hq8-master-0 cpu 2612m (74%) 0 (0%) memory 6809Mi (99%) 512Mi (7%) Name: six-f9hq8-master-1 cpu 2472m (70%) 0 (0%) memory 6489Mi (94%) 512Mi (7%) Name: six-f9hq8-master-2 cpu 2592m (74%) 0 (0%) memory 6809Mi (99%) 512Mi (7%) I could confirm that when describing the nodes using oc CLI tool, the memory usage seemed very high: [root@system-ge-6 ~]# ./oc describe no six-f9hq8-master-0 | grep "memory.*%" memory 6809Mi (99%) 512Mi (7%) However at the same time, when I SSHed to master0, I could see almost 2 GB of memory available from the OS point of view: [core@six-f9hq8-master-0 ~]$ free -h total used free shared buff/cache available Mem: 7.8Gi 5.9Gi 145Mi 17Mi 1.7Gi 1.8Gi Swap: 0B 0B 0B Another interesting piece of information is that the master nodes stopped reporting in circa 6 hours intervals: [root@system-ge-6 ~]# ./oc describe no six-f9hq8-master-{0..2} | grep "Ready" Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- Ready Unknown Sun, 15 Mar 2020 08:09:27 +0200 Sun, 15 Mar 2020 08:15:53 +0200 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Sat, 14 Mar 2020 20:24:59 +0200 Sat, 14 Mar 2020 20:26:28 +0200 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Sun, 15 Mar 2020 02:11:25 +0200 Sun, 15 Mar 2020 02:16:58 +0200 NodeStatusUnknown Kubelet stopped posting node status. Unfortunately, kubelet logs on master nodes don't go far enough into the past and oc adm must-gather called against the cluster does not work. I suspect that's because openshift-apiserver operator is down. Therefore I couldn't gather much more info. However, Xingxing Xia came up with another interesting insight. This is from OCP on RHV env. Pods on master0 node sorted by the amount of memory they consume. As you can see, openshift-apiserver and openshift-etcd are the most demanding, as is to be expected. However, we can also see 4 openshift-ovirt-infra pods being there in the top consumers: $ oc describe no six-f9hq8-master-0 | grep "%.*%.*%.*%" | sort -h -r -k 7 openshift-kube-apiserver kube-apiserver-six-f9hq8-master-0 180m (5%) 0 (0%) 1174Mi (17%) 0 (0%) 2d19h openshift-etcd etcd-six-f9hq8-master-0 430m (12%) 0 (0%) 860Mi (12%) 0 (0%) 3d19h openshift-kube-controller-manager kube-controller-manager-six-f9hq8-master-0 220m (6%) 0 (0%) 500Mi (7%) 0 (0%) 3d18h openshift-sdn ovs-9rx9v 200m (5%) 0 (0%) 400Mi (5%) 0 (0%) 3d19h openshift-ovirt-infra haproxy-six-f9hq8-master-0 200m (5%) 0 (0%) 400Mi (5%) 0 (0%) 3d19h openshift-sdn sdn-9wzg2 100m (2%) 0 (0%) 200Mi (2%) 0 (0%) 3d19h openshift-ovirt-infra mdns-publisher-six-f9hq8-master-0 100m (2%) 0 (0%) 200Mi (2%) 0 (0%) 3d19h openshift-ovirt-infra keepalived-six-f9hq8-master-0 100m (2%) 0 (0%) 200Mi (2%) 0 (0%) 3d19h openshift-ovirt-infra coredns-six-f9hq8-master-0 100m (2%) 0 (0%) 200Mi (2%) 0 (0%) 3d19h This is from another, non-RHV env maintained by OpenShift QE: $ oc describe no <node_name_censored> | grep "%.*%.*%.*%" | sort -h -r -k 7 openshift-kube-apiserver kube-apiserver-<censored> 330m (9%) 0 (0%) 1174Mi (8%) 0 (0%) 73m openshift-etcd etcd-<censored> 430m (12%) 0 (0%) 860Mi (6%) 0 (0%) 7h12m openshift-kube-controller-manager kube-controller-manager-<censored> 220m (6%) 0 (0%) 500Mi (3%) 0 (0%) 7h openshift-sdn ovs-j6tj6 100m (2%) 0 (0%) 400Mi (2%) 0 (0%) 7h18m I then investigated further and found out that this set of four openshift-ovirt-infra pods runs on each of the master node and consumes about 1GiB memory on each of those. I don't know if this is the root cause, but I thought I'd leave this information here in case it's useful.
Also, this is the current state of cluster operators: [root@system-ge-6 ocp4_installation]# ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-03-12-052849 True True True 4d19h cloud-credential 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h cluster-autoscaler 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h console 4.4.0-0.nightly-2020-03-12-052849 False False True 2d9h csi-snapshot-controller 4.4.0-0.nightly-2020-03-12-052849 True False False 34h dns 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h etcd 4.4.0-0.nightly-2020-03-12-052849 True False True 4d20h image-registry 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h ingress 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h insights 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h kube-apiserver 4.4.0-0.nightly-2020-03-12-052849 True False True 4d20h kube-controller-manager 4.4.0-0.nightly-2020-03-12-052849 True False True 4d20h kube-scheduler 4.4.0-0.nightly-2020-03-12-052849 True False True 4d20h kube-storage-version-migrator 4.4.0-0.nightly-2020-03-12-052849 True False False 34h machine-api 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h machine-config 4.4.0-0.nightly-2020-03-12-052849 False False True 2d15h marketplace 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h monitoring 4.4.0-0.nightly-2020-03-12-052849 False True True 2d15h network 4.4.0-0.nightly-2020-03-12-052849 True True True 4d20h node-tuning 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h openshift-apiserver 4.4.0-0.nightly-2020-03-12-052849 False False True 2d3h openshift-controller-manager 4.4.0-0.nightly-2020-03-12-052849 False True False 2d3h openshift-samples 4.4.0-0.nightly-2020-03-12-052849 True False False 4d19h operator-lifecycle-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-03-12-052849 False True False 2d3h service-ca 4.4.0-0.nightly-2020-03-12-052849 True True False 4d20h service-catalog-apiserver 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h service-catalog-controller-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h storage 4.4.0-0.nightly-2020-03-12-052849 True False False 4d20h
4.4.0-0.nightly-2020-02-27-020932 is too old. There is a fix in the attached bug that merged. *** This bug has been marked as a duplicate of bug 1802687 ***