There's a bunch of logs like below in the kube-controller-manager, I'm sending this over to networking team for investigation: utils.go:413] couldn't find ipfamilies for headless service: nfs-provisioner/nfs-provisioner. This could happen if controller manager is connected to an old apiserver that does not support ip families yet. EndpointSlices for this Service will use IPv4 as the IP Family based on familyOf(ClusterIP:172.30.78.48). utils.go:413] couldn't find ipfamilies for headless service: openshift-kube-scheduler/scheduler. This could happen if controller manager is connected to an old apiserver that does not support ip families yet. EndpointSlices for this Service will use IPv4 as the IP Family based on familyOf(ClusterIP:172.30.184.131).
The logs in #comment 2 are benign and don't even log as "error", those are "info" and are not fatal to that component. What is fatal is: $ oc logs -c kube-controller-manager -p kube-controller-manager-stage-542-pxdh2-control-plane-1 -n openshift-kube-controller-manager W0709 12:05:41.020127 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Event ended with: Internal error occurred: etcdserver: no leader E0709 12:05:44.672416 1 cronjob_controller.go:123] Failed to extract job list: etcdserver: leader changed E0709 12:05:44.688082 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: leader changed I0709 12:05:45.325942 1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition E0709 12:05:45.326068 1 leaderelection.go:301] Failed to release lock: resource name may not be empty F0709 12:05:45.326177 1 controllermanager.go:320] leaderelection lost Moreover I am getting a lot of ETCD errors when communicating with the cluster using my `oc` client. Errors such as: > Error from server: etcdserver: no leader > Error from server: etcdserver: leader changed I am also getting: $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |" Unable to connect to the server: dial tcp 10.0.96.211:6443: i/o timeout I have had a look at etcd logs and it seems host networking is bad here. $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |" 2021-07-09 04:29:08.933936 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.935325 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.935435 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.935532 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.945405 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.947084 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.948171 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a 2021-07-09 04:29:08.949152 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:09.046211 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:09.047176 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a 2021-07-09 04:29:29.626250 E | rafthttp: failed to dial 9535fda3cd024a74 on stream MsgApp v2 (peer 9535fda3cd024a74 failed to find local node 3ad773b358e20ef9) 2021-07-09 04:30:55.545541 E | rafthttp: failed to dial 18be3c63101419ff on stream MsgApp v2 (EOF) 2021-07-09 04:31:57.295725 E | rafthttp: failed to dial 18be3c63101419ff on stream Message (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9) 2021-07-09 04:31:57.307035 E | rafthttp: failed to dial 18be3c63101419ff on stream Message (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9) 2021-07-09 04:31:57.308150 E | rafthttp: failed to dial 18be3c63101419ff on stream MsgApp v2 (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9) 2021-07-09 04:32:03.964732 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:34316->10.0.98.72:2380: i/o timeout) 2021-07-09 04:32:04.043822 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:34318->10.0.98.72:2380: i/o timeout) 2021-07-09 04:32:04.742957 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:34538->10.0.97.202:2380: i/o timeout) 2021-07-09 04:32:58.426070 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:34524->10.0.98.72:2380: i/o timeout) 2021-07-09 04:32:58.750314 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:34534->10.0.98.72:2380: i/o timeout) 2021-07-09 04:35:15.265285 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:53896->10.0.97.202:2380: i/o timeout) 2021-07-09 04:35:15.441075 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:35260->10.0.98.72:2380: i/o timeout) 2021-07-09 04:35:15.711267 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:35268->10.0.98.72:2380: i/o timeout) 2021-07-09 04:35:53.876677 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:36868->10.0.98.72:2380: i/o timeout) 2021-07-09 04:35:54.044087 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:36890->10.0.98.72:2380: i/o timeout) 2021-07-09 04:36:26.109441 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:53894->10.0.97.202:2380: i/o timeout) 2021-07-09 04:36:26.127098 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:56190->10.0.97.202:2380: i/o timeout) 2021-07-09 04:38:27.107773 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:37416->10.0.98.72:2380: i/o timeout) 2021-07-09 04:39:00.495644 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:40668->10.0.98.72:2380: i/o timeout) 2021-07-09 04:39:00.710517 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:37420->10.0.98.72:2380: i/o timeout) 2021-07-09 04:39:18.375294 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:41840->10.0.98.72:2380: i/o timeout) 2021-07-09 04:40:36.300887 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:57316->10.0.97.202:2380: i/o timeout) 2021-07-09 04:40:36.409797 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:57314->10.0.97.202:2380: i/o timeout) 2021-07-09 04:40:45.964761 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:42066->10.0.98.72:2380: i/o timeout) 2021-07-09 04:40:58.425899 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:43068->10.0.98.72:2380: i/o timeout) 2021-07-09 04:41:41.098213 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:43202->10.0.98.72:2380: i/o timeout) 2021-07-09 04:41:41.187260 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:41864->10.0.98.72:2380: i/o timeout) 2021-07-09 04:43:49.043777 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:44554->10.0.98.72:2380: i/o timeout) 2021-07-09 04:43:57.829097 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:45866->10.0.98.72:2380: i/o timeout) 2021-07-09 04:43:57.837113 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:34060->10.0.97.202:2380: i/o timeout) 2021-07-09 04:43:58.076222 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:34062->10.0.97.202:2380: i/o timeout) 2021-07-09 04:44:28.151754 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37042->10.0.97.202:2380: i/o timeout) 2021-07-09 04:44:42.712540 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37388->10.0.97.202:2380: i/o timeout) 2021-07-09 04:44:42.813469 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:45948->10.0.98.72:2380: i/o timeout) 2021-07-09 04:45:25.864828 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:46522->10.0.98.72:2380: i/o timeout) 2021-07-09 04:45:25.865599 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37612->10.0.97.202:2380: i/o timeout) 2021-07-09 04:45:26.409536 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:37048->10.0.97.202:2380: i/o timeout) 2021-07-09 04:46:55.326730 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:38206->10.0.97.202:2380: i/o timeout) 2021-07-09 04:47:31.612182 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:47112->10.0.98.72:2380: i/o timeout) 2021-07-09 04:47:42.627019 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:39186->10.0.97.202:2380: i/o timeout) 2021-07-09 04:47:43.076385 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:38214->10.0.97.202:2380: i/o timeout) 2021-07-09 04:48:14.780419 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:48524->10.0.98.72:2380: i/o timeout) 2021-07-09 04:48:14.865003 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:39730->10.0.97.202:2380: i/o timeout) 2021-07-09 04:48:22.041452 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:45868->10.0.98.72:2380: i/o timeout) 2021-07-09 04:48:22.126519 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:49036->10.0.98.72:2380: i/o timeout) 2021-07-09 04:50:46.126766 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:40132->10.0.97.202:2380: i/o timeout) 2021-07-09 04:50:46.409398 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:39736->10.0.97.202:2380: i/o timeout) 2021-07-09 04:51:04.063458 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:49122->10.0.98.72:2380: i/o timeout) 2021-07-09 04:51:04.388521 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:41808->10.0.97.202:2380: i/o timeout) 2021-07-09 04:52:17.587898 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:42088->10.0.97.202:2380: i/o timeout) 2021-07-09 04:52:18.075816 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:41810->10.0.97.202:2380: i/o timeout) This has nothing to do with openshift-sdn, since this is host networking. openshift-sdn is also getting the same errors as the kube-controller-manager. I am trying to figure out what might break host networking on this cluster, but it's a bit of a "needle in a haystack" at this point. And no team owns host networking, so there is no one dedicated to investigating such problems.
Well, I am also seeing a bunch of 2021-07-09 04:53:30.130641 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/default\" " with result "error:etcdserver: leader changed" took too long (5.271164223s) to execute 2021-07-09 04:53:30.130760 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-config-managed/\" range_end:\"/kubernetes.io/secrets/openshift-config-managed0\" " with result "error:etcdserver: leader changed" took too long (5.287200807s) to execute 2021-07-09 04:53:30.130817 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took too long (5.28908571s) to execute 2021-07-09 04:53:30.130857 W | etcdserver: read-only range request "key:\"/kubernetes.io/services/endpoints/nfs-provisioner/example.com-nfs\" " with result "error:etcdserver: leader changed" took too long (5.465696086s) to execute 2021-07-09 04:53:30.130917 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-etcd/revision-pruner-3-stage-542-pxdh2-control-plane-1\" " with result "error:etcdserver: leader changed" took too long (5.525643307s) to execute 2021-07-09 04:53:30.130958 W | etcdserver: read-only range request "key:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshots/\" range_end:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshots0\" count_only:true " with result "error:etcdserver: leader changed" took too long (5.775741763s) to execute 2021-07-09 04:53:30.131004 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-apiserver/kube-apiserver-stage-542-pxdh2-control-plane-1\" " with result "error:etcdserver: leader changed" took too long (5.88763832s) to execute 2021-07-09 04:53:30.131036 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-authentication/oauth-openshift\" " with result "error:etcdserver: leader changed" took too long (5.897496644s) to execute 2021-07-09 04:53:30.131072 W | etcdserver: read-only range request "key:\"/openshift.io/routes/\" range_end:\"/openshift.io/routes0\" count_only:true " with result "error:etcdserver: leader changed" took too long (5.904803043s) to execute 2021-07-09 04:53:30.131101 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:etcdserver: leader changed" took too long (5.966481794s) to execute 2021-07-09 04:53:30.131130 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/kube-system/kube-controller-manager\" " with result "error:etcdserver: leader changed" took too long (5.984772551s) to execute 2021-07-09 04:53:30.131164 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/openshift-oauth-apiserver\" " with result "error:etcdserver: leader changed" took too long (6.097365556s) to execute 2021-07-09 04:53:30.131197 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-oauth-apiserver/\" range_end:\"/kubernetes.io/pods/openshift-oauth-apiserver0\" " with result "error:etcdserver: leader changed" took too long (6.270580896s) to execute 2021-07-09 04:53:30.131262 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-authentication/v4-0-config-system-cliconfig\" " with result "error:etcdserver: leader changed" took too long (6.270774531s) to execute 2021-07-09 04:53:30.131292 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/clustercsidrivers/\" range_end:\"/kubernetes.io/operator.openshift.io/clustercsidrivers0\" count_only:true " with result "error:etcdserver: leader changed" took too long (6.374487946s) to execute 2021-07-09 04:53:30.131319 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-oauth-apiserver/apiserver-58c5cff7cb-gltbt\" " with result "error:etcdserver: leader changed" took too long (6.401598016s) to execute 2021-07-09 04:53:30.131349 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/openshiftcontrollermanagers/cluster\" " with result "error:etcdserver: leader changed" took too long (6.401725823s) to execute 2021-07-09 04:53:30.131380 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/telemeter-client\" " with result "error:etcdserver: leader changed" took too long (6.401771618s) to execute 2021-07-09 04:53:30.131459 W | etcdserver: read-only range request "key:\"/kubernetes.io/operators.coreos.com/subscriptions/openshift-kubevirt-infra/\" range_end:\"/kubernetes.io/operators.coreos.com/subscriptions/openshift-kubevirt-infra0\" " with result "error:etcdserver: leader changed" took too long (6.433189998s) to execute 2021-07-09 04:53:30.131492 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-apiserver/revision-pruner-9-stage-542-pxdh2-control-plane-2\" " with result "error:etcdserver: leader changed" took too long (6.457181046s) to execute in the etcd logs. Moreover, I've had a look at the i/o timeout errors seen on etcd-stage-542-pxdh2-control-plane-2 and tried to corrolate that to the "other end", ex: $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |" 2021-07-09 04:35:15.265285 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:53896->10.0.97.202:2380: i/o timeout) $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less 2021-07-09 04:35:15.277224 W | rafthttp: closed an existing TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer) 2021-07-09 04:35:15.277266 I | rafthttp: established a TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer) 2021-07-09 04:35:15.364395 W | rafthttp: lost the TCP streaming connection with peer 3ad773b358e20ef9 (stream Message reader) Also, $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |" 2021-07-09 04:40:36.409797 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:57314->10.0.97.202:2380: i/o timeout) $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less raft2021/07/09 04:40:32 INFO: found conflict at index 35253 [existing term: 99, conflicting term: 100] raft2021/07/09 04:40:32 INFO: replace the unstable entries from index 35253 raft2021/07/09 04:40:32 INFO: raft.node: 9535fda3cd024a74 elected leader 3ad773b358e20ef9 at term 100 2021-07-09 04:40:32.768861 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:etcdserver: leader changed" took too long (1.309121912s) to execute 2021-07-09 04:40:32.806872 I | embed: rejected connection from "10.0.97.202:37774" (error "EOF", ServerName "") 2021-07-09 04:40:36.310474 W | rafthttp: closed an existing TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer) 2021-07-09 04:40:36.310500 I | rafthttp: established a TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer) 2021-07-09 04:40:36.384876 W | rafthttp: lost the TCP streaming connection with peer 18be3c63101419ff (stream Message reader) 2021-07-09 04:40:36.385019 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.97.202:45774->10.0.98.72:2380: i/o timeout) 2021-07-09 04:40:36.385044 I | rafthttp: peer 18be3c63101419ff became inactive (message send to peer failed) 2021-07-09 04:40:36.392982 I | rafthttp: peer 18be3c63101419ff became active 2021-07-09 04:40:36.393147 I | rafthttp: established a TCP streaming connection with peer 18be3c63101419ff (stream Message reader) 2021-07-09 04:40:36.394371 W | rafthttp: closed an existing TCP streaming connection with peer 18be3c63101419ff (stream Message writer) 2021-07-09 04:40:36.394394 I | rafthttp: established a TCP streaming connection with peer 18be3c63101419ff (stream Message writer) I am wondering if these nodes aren't starved of resources actually, given every problem these components seems to be indicating..... Not sure what OpenShift requires, but here's the memory output from node/stage-542-pxdh2-control-plane-1 sh-4.4# cat /proc/meminfo MemTotal: 16409920 kB MemFree: 1011400 kB MemAvailable: 7619488 kB Buffers: 1080 kB Cached: 6614072 kB SwapCached: 0 kB Active: 8879456 kB Inactive: 5080912 kB Active(anon): 7370952 kB Inactive(anon): 47000 kB Active(file): 1508504 kB Inactive(file): 5033912 kB Unevictable: 74328 kB Mlocked: 74328 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 1768 kB Writeback: 0 kB AnonPages: 6240412 kB Mapped: 1517636 kB Shmem: 57108 kB KReclaimable: 403396 kB Slab: 1091016 kB SReclaimable: 403396 kB SUnreclaim: 687620 kB KernelStack: 25168 kB PageTables: 49576 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 8204960 kB Committed_AS: 21091796 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 28288 kB HardwareCorrupted: 0 kB AnonHugePages: 2488320 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 737120 kB DirectMap2M: 16039936 kB DirectMap1G: 2097152 kB sh-4.4# free total used free shared buff/cache available Mem: 16409920 8665292 718044 57272 7026584 7334008 Swap: 0 0 0
Based on the above and below I am sending this bug to etcd for verification of the slow read operations I am seeing in the logs: $ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less 2021-07-09 13:44:08.835780 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-apiserver-operator/kube-apiserver-operator-lock\" " with result "error:etcdserver: request timed out" took too long (7.314632493s) to execute 2021-07-09 13:44:08.835849 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumes/\" range_end:\"/kubernetes.io/persistentvolumes0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.382316227s) to execute 2021-07-09 13:44:08.835891 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/\" range_end:\"/kubernetes.io/pods0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.393195589s) to execute 2021-07-09 13:44:08.835945 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/prometheusrules/\" range_end:\"/kubernetes.io/monitoring.coreos.com/prometheusrules0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395178602s) to execute 2021-07-09 13:44:08.835995 W | etcdserver: read-only range request "key:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshotcontents/\" range_end:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshotcontents0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395384965s) to execute 2021-07-09 13:44:08.836055 W | etcdserver: read-only range request "key:\"/kubernetes.io/migration.k8s.io/storageversionmigrations/\" range_end:\"/kubernetes.io/migration.k8s.io/storageversionmigrations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.39545702s) to execute 2021-07-09 13:44:08.836112 W | etcdserver: read-only range request "key:\"/kubernetes.io/cronjobs/\" range_end:\"/kubernetes.io/cronjobs0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395538946s) to execute 2021-07-09 13:44:08.836148 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/\" range_end:\"/kubernetes.io/roles0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395580861s) to execute 2021-07-09 13:44:08.836181 W | etcdserver: read-only range request "key:\"/kubernetes.io/fileintegrity.openshift.io/fileintegritynodestatuses/\" range_end:\"/kubernetes.io/fileintegrity.openshift.io/fileintegritynodestatuses0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395642367s) to execute 2021-07-09 13:44:08.836213 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/\" range_end:\"/kubernetes.io/monitoring.coreos.com/servicemonitors0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.39568878s) to execute 2021-07-09 13:44:08.836259 W | etcdserver: read-only range request "key:\"/kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/\" range_end:\"/kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395763294s) to execute NOTE: I am not sure that (host) networking is not the culprit here, it might very well be. But those logs look suspicious and might hint at disk issues for etcd. I am constantly getting etcd errors when communicating with this cluster, similar to what has been detailed in #comment 3 I am also asking QE if things was working up until a certain point when an operator was deployed that might have degraded the cluster: https://coreos.slack.com/archives/CH76YSYSC/p1625841459106900?thread_ts=1625835561.095200&cid=CH76YSYSC
** WARNING ** This BZ claims that this bug is of urgent severity and priority. Note that urgent priority means that you just declared emergency within engineering. Engineers are asked to stop whatever they are doing, including putting important release work on hold, potentially risking the release of OCP while working on this case. Be prepared to have a good justification ready and your own and engineering management are aware and has approved this. Urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was assigned to engineering manager with severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1980659#c9 I am unsetting the blocker status. The failure condition which was reproduced in 4.7.19 appears to be infra.
Created attachment 1803462 [details] p99 etcd_network_peer_round_trip_time_seconds_bucket
Created attachment 1803463 [details] p99 etcd_disk_wal_fsync_duration_seconds_bucket
Created attachment 1803464 [details] leader elections
Although the infra instances were reported to have been updated to use local NVMe. The disk latency is still very high with p99 on fsync duration hovering around 100ms and spiking to almost 1s the cluster is in a constant leader election loop. Storage will need to be improved before the test can be used as a valid signal.
There is nothing the OpenShift on OpenStack team can do. Please log a support ticket on PSI's helpdesk. Closing as every evidence point to an infra issue.
To summarize what we have done since, IT has given us a new profile "ocp4-master-ssd" and with this profile, we now have clusters running on same Flexy template that don't have etcd problem. However, when we run staging pipeline, etcd leader change / etcd errors appear again.