Description of problem: The cluster abruptly lost control plane access. When we checked master nodes noticed that the node went out of memory: Master nodes: LoadAvg: [16 CPU] 0.19 (1%), 0.20 (1%), 0.26 (2%) RAM: 62 GiB total ram From crictl stats we see etcd is taking a huge amount of memory from master nodes: master1: CONTAINER CPU % MEM DISK INODES 2cb5b2650ecca 397.60 66.15GB 209B 10 <-kube-apiserver 3e9c5df1f5df0 344.87 50.28GB 225.7kB 15 <-etcd master2: CONTAINER CPU % MEM DISK INODES e69c2a41d4e4f 2.63 51.07GB 208B 10 <-etcd master3: CONTAINER CPU % MEM DISK INODES a803b74abfcc9 2.00 43.28GB 209B 10 <-etcd Even after restart three master nodes we see they instantly get again all available memory. Version-Release number of selected component (if applicable): Cluster Version: 4.8.17 Channel: stable-4.8 NetworkType: Calico Azure/IPI How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: etcd and kube-apiserver taking all memory in the host. Expected results: etcd and kube-apiserver shouldn't take that much memory Additional info:
progress made today: - we were able to look at the logs from sos report, unfortunately the oldest log goes as far back as '2022-03-11T10:35:00.019918067+00:00'. - no prometheus metrics data was available as of this morning, we have asked the customer to take a capture of their prometheus data. - no logs from master-8 on which apiserver crashed. what we have observed: - from viewing the log, we clearly see API latency that is causing cluster degradation. examples from kube-apiserver logs: > 2022-03-11T19:46:04.201060548+00:00 stderr F Trace[512805231]: ---"Writing http response done" count:1 2235ms (19:46:00.945) apiserver taking more than 2s to serialize and write a LIST of 1 object, not including any round trip from etcd here. This usually indicates: - cpu saturation on the master node the apiserver instance is on, or - underlying network connectivity issues > Trace[460953672]: ---"Listing from storage done" 16274ms (19:46:00.603) listing from etcd (including the round trip time between kube-apiserver and etcd) took too long, we see many of these slow traces. examples for etcd logs: > 2022-03-16T20:45:41.700325055Z {"level":"warn","ts":"2022-03-16T20:45:41.692Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"5.845569603s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/secrets/datastore-redis-rfs/redis-app-acl-old-rnd-rfs\" ","response":"range_response_count:0 size:8"} etcd 'apply' taking longer - "took":"5.845569603s","expected-duration":"100ms" > {"level":"info","ts":"2022-03-11T19:50:53.145Z","caller":"traceutil/trace.go:145","msg":"trace[1242538398] range","detail":"{range_begin:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs/; range_end:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs0; response_count:120; response_revision:1586712027; }","duration":"8.40039734s","start":"2022-03-11T19:50:44.745Z","end":"2022-03-11T19:50:53.145Z","steps":["trace[1242538398] 'agreement among raft nodes before linearized reading' (duration: 8.186373557s)","trace[1242538398] 'range keys from bolt db' (duration: 212.195986ms)"]} bolt db query is in ms range, but the overall agreement took order of seconds. we need more data points to assess what the source is. next steps: - the customer has uploaded the prometheus data dump, we will provision it and look the metrics, the metrics will give us some insights to pin point the root cause.
so looking at the etcd leader election activities for March 11 (from etcd logs): - between 2022-03-11T03:01:51 - 2022-03-11T19:50 > $ grep -rih "is starting a new election at term" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l > 79 > > 2022-03-11T03:01:51.791151216+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:51.555Z","caller":"raft/raft.go:923","msg":"73a1998106ae1525 is starting a new election at term 779"} there were about 60 leader elections from 03:00 to 07:00, averaging 1 every 4-minute. consequently, we saw 56 leader changes consequently > $ grep -rih "elected leader" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l > 56 > > 2022-03-11T03:01:56.099136429+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:56.099Z","caller":"raft/node.go:325","msg":"raft.node: 73a1998106ae1525 elected leader 73a1998106ae1525 at term 780"} so what's causing these leader elections? I see etcd peer communication error that can trigger leader elections: > $ grep -rih "lost TCP streaming connection with remote peer" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | grep "i/o timeout" | wc -l > 307 > > 2022-03-11T19:50:45.868828442+00:00 stderr F {"level":"warn","ts":"2022-03-11T19:50:45.868Z","caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"73a1998106ae1525","remote-peer-id":"cd9034cd5c6fe4a2","error":"read tcp 10.56.142.108:49336->10.56.142.111:2380: i/o timeout"} have we eliminated network connectivity issues or slow disk that can trigger election timeout? I don't have logs from March 12 or later to see if etcd election activities have have improved
> https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c14 we were not able to extract the dump, the tar ball is corrupt.
> we use SSD that are rated up to 5000 iops and we hit that limit, so this explains slow down or freeze after 11mar 02:20. This could explain excessive etcd leader elections https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c21 > The thing we need to understand is why the available memory went down between 10mar 18:00 and 11mar 02:20. After the 11 mar 02:20 the system was not usable. - we don't have any pod logs from March 10 or before - we don't have a working prometheus dump it's going to be challenging to work toward a RCA if we don't have data that covers the incident window, the oldest log is from "2022-03-11T02:48:35.305660977". Can you check if logs were captured for March 10?
> Which are the logs that we are looking for ? typically we need the following for deep debugging: - pod logs (from must gather), if not,then at a minimum kube-apiserver, openshift apiserver, and etcd logs - audit logs - prometheus data capture
Root cause in the support case which led to this bug being filed is understood. Pod count increased by 2.5x and multiple processes were listing pods in all namespaces, etcd and apiserver struggled to render the 700MiB response payload for those concurrent requests and resource exhaustion was experienced. There are strategic efforts underway to reduce the impact of such as well as efforts to identify and address costly clients who may or may not need to perform such expensive requests. I would like to scope this bug down to ensuring that we update OCP versions which use etcd 3.4 to at least 3.4.16 which includes a fix for costly unmarshalling of response payload meant just to generate the log messages which include slow range select queries are present. This is likely to reduce the overall memory consumption by etcd per request that triggers those log messages by 10-30%. See https://github.com/etcd-io/etcd/pull/12871 and https://github.com/etcd-io/etcd/issues/12835 The fixes for this specific resource hog are already present in etcd 3.5.0 and later, so 4.9 and later are fine. This will just reduce resource consumption in 4.6 through 4.8, and hopefully, reduce the likelihood of a bad situation becoming critical.
Rather than rescoping this just down to the etcd bump expected in 4.6-4.8 we will instead close this as notabug with references to other defects we are chasing. Please see the depends on field for this bug for those references.