Bug 1980659 - several operators are crashlooping when communicating with the kube-apiserver, among other due to: Internal error occurred: etcdserver: no leader
Summary: several operators are crashlooping when communicating with the kube-apiserver...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.7
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Pierre Prinetti
QA Contact: Jon Uriarte
URL:
Whiteboard: EmergencyRequest
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-09 07:21 UTC by ge liu
Modified: 2021-08-16 18:25 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-21 07:48:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
p99 etcd_network_peer_round_trip_time_seconds_bucket (155.51 KB, image/png)
2021-07-20 01:31 UTC, Sam Batschelet
no flags Details
p99 etcd_disk_wal_fsync_duration_seconds_bucket (84.10 KB, image/png)
2021-07-20 01:32 UTC, Sam Batschelet
no flags Details
leader elections (24.22 KB, image/png)
2021-07-20 01:34 UTC, Sam Batschelet
no flags Details

Comment 2 Maciej Szulik 2021-07-09 10:43:44 UTC
There's a bunch of logs like below in the kube-controller-manager, I'm sending this over to networking team for investigation:

utils.go:413] couldn't find ipfamilies for headless service: nfs-provisioner/nfs-provisioner. This could happen if controller manager is connected to an old apiserver that does not support ip families yet. EndpointSlices for this Service will use IPv4 as the IP Family based on familyOf(ClusterIP:172.30.78.48).                                                                                                                                                                              

utils.go:413] couldn't find ipfamilies for headless service: openshift-kube-scheduler/scheduler. This could happen if controller manager is connected to an old apiserver that does not support ip families yet. EndpointSlices for this Service will use IPv4 as the IP Family based on familyOf(ClusterIP:172.30.184.131).

Comment 3 Alexander Constantinescu 2021-07-09 12:20:42 UTC
The logs in #comment 2 are benign and don't even log as "error", those are "info" and are not fatal to that component. What is fatal is: 

$ oc logs -c kube-controller-manager -p  kube-controller-manager-stage-542-pxdh2-control-plane-1 -n openshift-kube-controller-manager
W0709 12:05:41.020127       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Event ended with: Internal error occurred: etcdserver: no leader
E0709 12:05:44.672416       1 cronjob_controller.go:123] Failed to extract job list: etcdserver: leader changed
E0709 12:05:44.688082       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: leader changed
I0709 12:05:45.325942       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
E0709 12:05:45.326068       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
F0709 12:05:45.326177       1 controllermanager.go:320] leaderelection lost

Moreover I am getting a lot of ETCD errors when communicating with the cluster using my `oc` client. Errors such as:

> Error from server: etcdserver: no leader

> Error from server: etcdserver: leader changed

I am also getting:

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |"
Unable to connect to the server: dial tcp 10.0.96.211:6443: i/o timeout

I have had a look at etcd logs and it seems host networking is bad here. 

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |"
2021-07-09 04:29:08.933936 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.935325 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.935435 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.935532 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.945405 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.947084 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.948171 E | rafthttp: failed to find member 9535fda3cd024a74 in cluster 8fd4a992706abc3a
2021-07-09 04:29:08.949152 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:09.046211 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:09.047176 E | rafthttp: failed to find member 18be3c63101419ff in cluster 8fd4a992706abc3a
2021-07-09 04:29:29.626250 E | rafthttp: failed to dial 9535fda3cd024a74 on stream MsgApp v2 (peer 9535fda3cd024a74 failed to find local node 3ad773b358e20ef9)
2021-07-09 04:30:55.545541 E | rafthttp: failed to dial 18be3c63101419ff on stream MsgApp v2 (EOF)
2021-07-09 04:31:57.295725 E | rafthttp: failed to dial 18be3c63101419ff on stream Message (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9)
2021-07-09 04:31:57.307035 E | rafthttp: failed to dial 18be3c63101419ff on stream Message (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9)
2021-07-09 04:31:57.308150 E | rafthttp: failed to dial 18be3c63101419ff on stream MsgApp v2 (peer 18be3c63101419ff failed to find local node 3ad773b358e20ef9)
2021-07-09 04:32:03.964732 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:34316->10.0.98.72:2380: i/o timeout)
2021-07-09 04:32:04.043822 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:34318->10.0.98.72:2380: i/o timeout)
2021-07-09 04:32:04.742957 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:34538->10.0.97.202:2380: i/o timeout)
2021-07-09 04:32:58.426070 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:34524->10.0.98.72:2380: i/o timeout)
2021-07-09 04:32:58.750314 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:34534->10.0.98.72:2380: i/o timeout)
2021-07-09 04:35:15.265285 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:53896->10.0.97.202:2380: i/o timeout)
2021-07-09 04:35:15.441075 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:35260->10.0.98.72:2380: i/o timeout)
2021-07-09 04:35:15.711267 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:35268->10.0.98.72:2380: i/o timeout)
2021-07-09 04:35:53.876677 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:36868->10.0.98.72:2380: i/o timeout)
2021-07-09 04:35:54.044087 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:36890->10.0.98.72:2380: i/o timeout)
2021-07-09 04:36:26.109441 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:53894->10.0.97.202:2380: i/o timeout)
2021-07-09 04:36:26.127098 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:56190->10.0.97.202:2380: i/o timeout)
2021-07-09 04:38:27.107773 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:37416->10.0.98.72:2380: i/o timeout)
2021-07-09 04:39:00.495644 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:40668->10.0.98.72:2380: i/o timeout)
2021-07-09 04:39:00.710517 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:37420->10.0.98.72:2380: i/o timeout)
2021-07-09 04:39:18.375294 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:41840->10.0.98.72:2380: i/o timeout)
2021-07-09 04:40:36.300887 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:57316->10.0.97.202:2380: i/o timeout)
2021-07-09 04:40:36.409797 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:57314->10.0.97.202:2380: i/o timeout)
2021-07-09 04:40:45.964761 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:42066->10.0.98.72:2380: i/o timeout)
2021-07-09 04:40:58.425899 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:43068->10.0.98.72:2380: i/o timeout)
2021-07-09 04:41:41.098213 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:43202->10.0.98.72:2380: i/o timeout)
2021-07-09 04:41:41.187260 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:41864->10.0.98.72:2380: i/o timeout)
2021-07-09 04:43:49.043777 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:44554->10.0.98.72:2380: i/o timeout)
2021-07-09 04:43:57.829097 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:45866->10.0.98.72:2380: i/o timeout)
2021-07-09 04:43:57.837113 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:34060->10.0.97.202:2380: i/o timeout)
2021-07-09 04:43:58.076222 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:34062->10.0.97.202:2380: i/o timeout)
2021-07-09 04:44:28.151754 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37042->10.0.97.202:2380: i/o timeout)
2021-07-09 04:44:42.712540 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37388->10.0.97.202:2380: i/o timeout)
2021-07-09 04:44:42.813469 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:45948->10.0.98.72:2380: i/o timeout)
2021-07-09 04:45:25.864828 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:46522->10.0.98.72:2380: i/o timeout)
2021-07-09 04:45:25.865599 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:37612->10.0.97.202:2380: i/o timeout)
2021-07-09 04:45:26.409536 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:37048->10.0.97.202:2380: i/o timeout)
2021-07-09 04:46:55.326730 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:38206->10.0.97.202:2380: i/o timeout)
2021-07-09 04:47:31.612182 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:47112->10.0.98.72:2380: i/o timeout)
2021-07-09 04:47:42.627019 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:39186->10.0.97.202:2380: i/o timeout)
2021-07-09 04:47:43.076385 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:38214->10.0.97.202:2380: i/o timeout)
2021-07-09 04:48:14.780419 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:48524->10.0.98.72:2380: i/o timeout)
2021-07-09 04:48:14.865003 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:39730->10.0.97.202:2380: i/o timeout)
2021-07-09 04:48:22.041452 E | rafthttp: failed to read 18be3c63101419ff on stream MsgApp v2 (read tcp 10.0.99.65:45868->10.0.98.72:2380: i/o timeout)
2021-07-09 04:48:22.126519 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:49036->10.0.98.72:2380: i/o timeout)
2021-07-09 04:50:46.126766 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:40132->10.0.97.202:2380: i/o timeout)
2021-07-09 04:50:46.409398 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:39736->10.0.97.202:2380: i/o timeout)
2021-07-09 04:51:04.063458 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.99.65:49122->10.0.98.72:2380: i/o timeout)
2021-07-09 04:51:04.388521 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:41808->10.0.97.202:2380: i/o timeout)
2021-07-09 04:52:17.587898 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:42088->10.0.97.202:2380: i/o timeout)
2021-07-09 04:52:18.075816 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:41810->10.0.97.202:2380: i/o timeout)

This has nothing to do with openshift-sdn, since this is host networking. openshift-sdn is also getting the same errors as the kube-controller-manager. 

I am trying to figure out what might break host networking on this cluster, but it's a bit of a "needle in a haystack" at this point. And no team owns host networking, so there is no one dedicated to investigating such problems.

Comment 4 Alexander Constantinescu 2021-07-09 13:34:46 UTC
Well, I am also seeing a bunch of 

2021-07-09 04:53:30.130641 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/default\" " with result "error:etcdserver: leader changed" took too long (5.271164223s) to execute
2021-07-09 04:53:30.130760 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-config-managed/\" range_end:\"/kubernetes.io/secrets/openshift-config-managed0\" " with result "error:etcdserver: leader changed" took too long (5.287200807s) to execute
2021-07-09 04:53:30.130817 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took too long (5.28908571s) to execute
2021-07-09 04:53:30.130857 W | etcdserver: read-only range request "key:\"/kubernetes.io/services/endpoints/nfs-provisioner/example.com-nfs\" " with result "error:etcdserver: leader changed" took too long (5.465696086s) to execute
2021-07-09 04:53:30.130917 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-etcd/revision-pruner-3-stage-542-pxdh2-control-plane-1\" " with result "error:etcdserver: leader changed" took too long (5.525643307s) to execute
2021-07-09 04:53:30.130958 W | etcdserver: read-only range request "key:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshots/\" range_end:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshots0\" count_only:true " with result "error:etcdserver: leader changed" took too long (5.775741763s) to execute
2021-07-09 04:53:30.131004 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-apiserver/kube-apiserver-stage-542-pxdh2-control-plane-1\" " with result "error:etcdserver: leader changed" took too long (5.88763832s) to execute
2021-07-09 04:53:30.131036 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-authentication/oauth-openshift\" " with result "error:etcdserver: leader changed" took too long (5.897496644s) to execute
2021-07-09 04:53:30.131072 W | etcdserver: read-only range request "key:\"/openshift.io/routes/\" range_end:\"/openshift.io/routes0\" count_only:true " with result "error:etcdserver: leader changed" took too long (5.904803043s) to execute
2021-07-09 04:53:30.131101 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:etcdserver: leader changed" took too long (5.966481794s) to execute
2021-07-09 04:53:30.131130 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/kube-system/kube-controller-manager\" " with result "error:etcdserver: leader changed" took too long (5.984772551s) to execute
2021-07-09 04:53:30.131164 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/openshift-oauth-apiserver\" " with result "error:etcdserver: leader changed" took too long (6.097365556s) to execute
2021-07-09 04:53:30.131197 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-oauth-apiserver/\" range_end:\"/kubernetes.io/pods/openshift-oauth-apiserver0\" " with result "error:etcdserver: leader changed" took too long (6.270580896s) to execute
2021-07-09 04:53:30.131262 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-authentication/v4-0-config-system-cliconfig\" " with result "error:etcdserver: leader changed" took too long (6.270774531s) to execute
2021-07-09 04:53:30.131292 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/clustercsidrivers/\" range_end:\"/kubernetes.io/operator.openshift.io/clustercsidrivers0\" count_only:true " with result "error:etcdserver: leader changed" took too long (6.374487946s) to execute
2021-07-09 04:53:30.131319 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-oauth-apiserver/apiserver-58c5cff7cb-gltbt\" " with result "error:etcdserver: leader changed" took too long (6.401598016s) to execute
2021-07-09 04:53:30.131349 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/openshiftcontrollermanagers/cluster\" " with result "error:etcdserver: leader changed" took too long (6.401725823s) to execute
2021-07-09 04:53:30.131380 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/telemeter-client\" " with result "error:etcdserver: leader changed" took too long (6.401771618s) to execute
2021-07-09 04:53:30.131459 W | etcdserver: read-only range request "key:\"/kubernetes.io/operators.coreos.com/subscriptions/openshift-kubevirt-infra/\" range_end:\"/kubernetes.io/operators.coreos.com/subscriptions/openshift-kubevirt-infra0\" " with result "error:etcdserver: leader changed" took too long (6.433189998s) to execute
2021-07-09 04:53:30.131492 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-apiserver/revision-pruner-9-stage-542-pxdh2-control-plane-2\" " with result "error:etcdserver: leader changed" took too long (6.457181046s) to execute

in the etcd logs. 

Moreover, I've had a look at the i/o timeout errors seen on etcd-stage-542-pxdh2-control-plane-2 and tried to corrolate that to the "other end", ex:

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |"
2021-07-09 04:35:15.265285 E | rafthttp: failed to read 9535fda3cd024a74 on stream Message (read tcp 10.0.99.65:53896->10.0.97.202:2380: i/o timeout)

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less
2021-07-09 04:35:15.277224 W | rafthttp: closed an existing TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer)
2021-07-09 04:35:15.277266 I | rafthttp: established a TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer)
2021-07-09 04:35:15.364395 W | rafthttp: lost the TCP streaming connection with peer 3ad773b358e20ef9 (stream Message reader)

Also, 

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-2 -n openshift-etcd | grep "E |"
2021-07-09 04:40:36.409797 E | rafthttp: failed to read 9535fda3cd024a74 on stream MsgApp v2 (read tcp 10.0.99.65:57314->10.0.97.202:2380: i/o timeout)

$ oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less
raft2021/07/09 04:40:32 INFO: found conflict at index 35253 [existing term: 99, conflicting term: 100]
raft2021/07/09 04:40:32 INFO: replace the unstable entries from index 35253
raft2021/07/09 04:40:32 INFO: raft.node: 9535fda3cd024a74 elected leader 3ad773b358e20ef9 at term 100
2021-07-09 04:40:32.768861 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:etcdserver: leader changed" took too long (1.309121912s) to execute
2021-07-09 04:40:32.806872 I | embed: rejected connection from "10.0.97.202:37774" (error "EOF", ServerName "")
2021-07-09 04:40:36.310474 W | rafthttp: closed an existing TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer)
2021-07-09 04:40:36.310500 I | rafthttp: established a TCP streaming connection with peer 3ad773b358e20ef9 (stream Message writer)
2021-07-09 04:40:36.384876 W | rafthttp: lost the TCP streaming connection with peer 18be3c63101419ff (stream Message reader)
2021-07-09 04:40:36.385019 E | rafthttp: failed to read 18be3c63101419ff on stream Message (read tcp 10.0.97.202:45774->10.0.98.72:2380: i/o timeout)
2021-07-09 04:40:36.385044 I | rafthttp: peer 18be3c63101419ff became inactive (message send to peer failed)
2021-07-09 04:40:36.392982 I | rafthttp: peer 18be3c63101419ff became active
2021-07-09 04:40:36.393147 I | rafthttp: established a TCP streaming connection with peer 18be3c63101419ff (stream Message reader)
2021-07-09 04:40:36.394371 W | rafthttp: closed an existing TCP streaming connection with peer 18be3c63101419ff (stream Message writer)
2021-07-09 04:40:36.394394 I | rafthttp: established a TCP streaming connection with peer 18be3c63101419ff (stream Message writer)


I am wondering if these nodes aren't starved of resources actually, given every problem these components seems to be indicating.....

Not sure what OpenShift requires, but here's the memory output from node/stage-542-pxdh2-control-plane-1

sh-4.4# cat /proc/meminfo 
MemTotal:       16409920 kB
MemFree:         1011400 kB
MemAvailable:    7619488 kB
Buffers:            1080 kB
Cached:          6614072 kB
SwapCached:            0 kB
Active:          8879456 kB
Inactive:        5080912 kB
Active(anon):    7370952 kB
Inactive(anon):    47000 kB
Active(file):    1508504 kB
Inactive(file):  5033912 kB
Unevictable:       74328 kB
Mlocked:           74328 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1768 kB
Writeback:             0 kB
AnonPages:       6240412 kB
Mapped:          1517636 kB
Shmem:             57108 kB
KReclaimable:     403396 kB
Slab:            1091016 kB
SReclaimable:     403396 kB
SUnreclaim:       687620 kB
KernelStack:       25168 kB
PageTables:        49576 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8204960 kB
Committed_AS:   21091796 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:            28288 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2488320 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      737120 kB
DirectMap2M:    16039936 kB
DirectMap1G:     2097152 kB
sh-4.4# free
              total        used        free      shared  buff/cache   available
Mem:       16409920     8665292      718044       57272     7026584     7334008
Swap:             0           0           0

Comment 6 Alexander Constantinescu 2021-07-09 15:02:47 UTC
Based on the above and below I am sending this bug to etcd for verification of the slow read operations I am seeing in the logs:

$  oc logs -c etcd etcd-stage-542-pxdh2-control-plane-1 -n openshift-etcd | less
2021-07-09 13:44:08.835780 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-apiserver-operator/kube-apiserver-operator-lock\" " with result "error:etcdserver: request timed out" took too long (7.314632493s) to execute
2021-07-09 13:44:08.835849 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumes/\" range_end:\"/kubernetes.io/persistentvolumes0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.382316227s) to execute
2021-07-09 13:44:08.835891 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/\" range_end:\"/kubernetes.io/pods0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.393195589s) to execute
2021-07-09 13:44:08.835945 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/prometheusrules/\" range_end:\"/kubernetes.io/monitoring.coreos.com/prometheusrules0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395178602s) to execute
2021-07-09 13:44:08.835995 W | etcdserver: read-only range request "key:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshotcontents/\" range_end:\"/kubernetes.io/snapshot.storage.k8s.io/volumesnapshotcontents0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395384965s) to execute
2021-07-09 13:44:08.836055 W | etcdserver: read-only range request "key:\"/kubernetes.io/migration.k8s.io/storageversionmigrations/\" range_end:\"/kubernetes.io/migration.k8s.io/storageversionmigrations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.39545702s) to execute
2021-07-09 13:44:08.836112 W | etcdserver: read-only range request "key:\"/kubernetes.io/cronjobs/\" range_end:\"/kubernetes.io/cronjobs0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395538946s) to execute
2021-07-09 13:44:08.836148 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/\" range_end:\"/kubernetes.io/roles0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395580861s) to execute
2021-07-09 13:44:08.836181 W | etcdserver: read-only range request "key:\"/kubernetes.io/fileintegrity.openshift.io/fileintegritynodestatuses/\" range_end:\"/kubernetes.io/fileintegrity.openshift.io/fileintegritynodestatuses0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395642367s) to execute
2021-07-09 13:44:08.836213 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/\" range_end:\"/kubernetes.io/monitoring.coreos.com/servicemonitors0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.39568878s) to execute
2021-07-09 13:44:08.836259 W | etcdserver: read-only range request "key:\"/kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/\" range_end:\"/kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (7.395763294s) to execute

NOTE: I am not sure that (host) networking is not the culprit here, it might very well be. But those logs look suspicious and might hint at disk issues for etcd. 

I am constantly getting etcd errors when communicating with this cluster, similar to what has been detailed in #comment 3

I am also asking QE if things was working up until a certain point when an operator was deployed that might have degraded the cluster: https://coreos.slack.com/archives/CH76YSYSC/p1625841459106900?thread_ts=1625835561.095200&cid=CH76YSYSC

Comment 7 Michal Fojtik 2021-07-09 15:13:22 UTC
** WARNING **

This BZ claims that this bug is of urgent severity and priority. Note that urgent priority means that you just declared emergency within engineering. 
Engineers are asked to stop whatever they are doing, including putting important release work on hold, potentially risking the release of OCP while working on this case.

Be prepared to have a good justification ready and your own and engineering management are aware and has approved this. Urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was assigned to engineering manager with severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

Comment 11 Sam Batschelet 2021-07-12 16:51:51 UTC
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1980659#c9 I am unsetting the blocker status. The failure condition which was reproduced in 4.7.19 appears to be infra.

Comment 12 Sam Batschelet 2021-07-20 01:31:50 UTC
Created attachment 1803462 [details]
p99 etcd_network_peer_round_trip_time_seconds_bucket

Comment 13 Sam Batschelet 2021-07-20 01:32:56 UTC
Created attachment 1803463 [details]
p99 etcd_disk_wal_fsync_duration_seconds_bucket

Comment 14 Sam Batschelet 2021-07-20 01:34:24 UTC
Created attachment 1803464 [details]
leader elections

Comment 15 Sam Batschelet 2021-07-20 01:56:55 UTC
Although the infra instances were reported to have been updated to use local NVMe. The disk latency is still very high with p99 on fsync duration hovering around 100ms and spiking to almost 1s the cluster is in a constant leader election loop. Storage will need to be improved before the test can be used as a valid signal.

Comment 16 Martin André 2021-07-21 07:48:01 UTC
There is nothing the OpenShift on OpenStack team can do. Please log a support ticket on PSI's helpdesk.
Closing as every evidence point to an infra issue.

Comment 17 To Hung Sze 2021-08-16 18:25:30 UTC
To summarize what we have done since, IT has given us a new profile "ocp4-master-ssd" and with this profile, we now have clusters running on same Flexy template that don't have etcd problem.
However, when we run staging pipeline, etcd leader change / etcd errors appear again.


Note You need to log in before you can comment on or make changes to this bug.