Bug 2064371

Summary:	[ocp 4.8.17] etcd high memory utilization in three master nodes
Product:	OpenShift Container Platform	Reporter:	Alfredo Pizarro <apizarro>
Component:	Etcd	Assignee:	Allen Ray <alray>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	akashem, cruhm, dhellard, dpateriy, dtarabor, dwest, gscheffe, hshukla, igreen, jfindysz, kgordeev, knakai, lserot, ocasalsa, openshift-bugs-escalate, rhowe, rjamadar, sdodson, sscheink, ssonigra, travi, vismishr, wking
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-26 15:18:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2077975, 2078954, 2078988, 2087049, 2088449
Bug Blocks:

Description Alfredo Pizarro 2022-03-15 16:35:45 UTC

Description of problem:

The cluster abruptly lost control plane access. When we checked master nodes noticed that the node went out of memory:



Master nodes:
LoadAvg:   [16 CPU] 0.19 (1%), 0.20 (1%), 0.26 (2%)
  RAM:
    62 GiB total ram


From crictl stats we see etcd is taking a huge amount of memory from master nodes:

master1:
CONTAINER           CPU %               MEM                 DISK                INODES
2cb5b2650ecca       397.60              66.15GB             209B                10 <-kube-apiserver
3e9c5df1f5df0       344.87              50.28GB             225.7kB             15 <-etcd

master2:
CONTAINER           CPU %               MEM                 DISK                INODES
e69c2a41d4e4f       2.63                51.07GB             208B                10 <-etcd

master3:
CONTAINER           CPU %               MEM                 DISK                INODES
a803b74abfcc9       2.00                43.28GB             209B                10 <-etcd


Even after restart three master nodes we see they instantly get again all available memory. 

Version-Release number of selected component (if applicable):

Cluster Version: 4.8.17
Channel: stable-4.8
NetworkType: Calico
Azure/IPI

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
etcd and kube-apiserver taking all memory in the host.

Expected results:
etcd and kube-apiserver shouldn't take that much memory

Additional info:

Comment 16 Abu Kashem 2022-03-17 22:07:49 UTC

progress made today:
- we were able to look at the logs from sos report, unfortunately the oldest log goes as far back as '2022-03-11T10:35:00.019918067+00:00'.
- no prometheus metrics data was available as of this morning, we have asked the customer to take a capture of their prometheus data.
- no logs from master-8 on which apiserver crashed.

what we have observed:
- from viewing the log, we clearly see API latency that is causing cluster degradation.

examples from kube-apiserver logs:
> 2022-03-11T19:46:04.201060548+00:00 stderr F Trace[512805231]: ---"Writing http response done" count:1 2235ms (19:46:00.945)
apiserver taking more than 2s to serialize and write a LIST of 1 object, not including any round trip from etcd here.
This usually indicates:
  - cpu saturation on the master node the apiserver instance is on, or
  - underlying network connectivity issues

> Trace[460953672]: ---"Listing from storage done" 16274ms (19:46:00.603)
listing from etcd (including the round trip time between kube-apiserver and etcd) took too long, we see many of these slow traces.


examples for etcd logs:
> 2022-03-16T20:45:41.700325055Z {"level":"warn","ts":"2022-03-16T20:45:41.692Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"5.845569603s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/secrets/datastore-redis-rfs/redis-app-acl-old-rnd-rfs\" ","response":"range_response_count:0 size:8"}

etcd 'apply' taking longer - "took":"5.845569603s","expected-duration":"100ms"


> {"level":"info","ts":"2022-03-11T19:50:53.145Z","caller":"traceutil/trace.go:145","msg":"trace[1242538398] range","detail":"{range_begin:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs/; range_end:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs0; response_count:120; response_revision:1586712027; }","duration":"8.40039734s","start":"2022-03-11T19:50:44.745Z","end":"2022-03-11T19:50:53.145Z","steps":["trace[1242538398] 'agreement among raft nodes before linearized reading'  (duration: 8.186373557s)","trace[1242538398] 'range keys from bolt db'  (duration: 212.195986ms)"]}

bolt db query is in ms range, but the overall agreement took order of seconds.


we need more data points to assess what the source is.

next steps:
- the customer has uploaded the prometheus data dump, we will provision it and look the metrics, the metrics will give us some insights to pin point the root cause.

Comment 21 Abu Kashem 2022-03-18 08:40:28 UTC

so looking at the etcd leader election activities for March 11 (from etcd logs):
- between 2022-03-11T03:01:51 - 2022-03-11T19:50
> $ grep -rih "is starting a new election at term" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l
> 79
> 
> 2022-03-11T03:01:51.791151216+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:51.555Z","caller":"raft/raft.go:923","msg":"73a1998106ae1525 is starting a new election at term 779"}

there were about 60 leader elections from 03:00 to 07:00, averaging 1 every 4-minute.
consequently, we saw 56 leader changes consequently

> $ grep -rih "elected leader" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l
> 56
>
> 2022-03-11T03:01:56.099136429+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:56.099Z","caller":"raft/node.go:325","msg":"raft.node: 73a1998106ae1525 elected leader 73a1998106ae1525 at term 780"}

so what's causing these leader elections? I see etcd peer communication error that can trigger leader elections:

> $ grep -rih "lost TCP streaming connection with remote peer" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | grep "i/o timeout" | wc -l
> 307
>
> 2022-03-11T19:50:45.868828442+00:00 stderr F {"level":"warn","ts":"2022-03-11T19:50:45.868Z","caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"73a1998106ae1525","remote-peer-id":"cd9034cd5c6fe4a2","error":"read tcp 10.56.142.108:49336->10.56.142.111:2380: i/o timeout"}

have we eliminated network connectivity issues or slow disk that can trigger election timeout? I don't have logs from March 12 or later to see if etcd election activities have have improved

Comment 22 Abu Kashem 2022-03-18 08:42:03 UTC

> https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c14

we were not able to extract the dump, the tar ball is corrupt.

Comment 23 Abu Kashem 2022-03-18 08:57:51 UTC

> we use SSD that are rated up to 5000 iops and we hit that limit, so this explains slow down or freeze after 11mar 02:20.

This could explain excessive etcd leader elections https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c21


> The thing we need to understand is why the available memory went down between 10mar 18:00 and 11mar 02:20. After the 11 mar 02:20 the system was not usable.

- we don't have any pod logs from March 10 or before
- we don't have a working prometheus dump 

it's going to be challenging to work toward a RCA if we don't have data that covers the incident window, the oldest log is from "2022-03-11T02:48:35.305660977". Can you check if logs were captured for March 10?

Comment 25 Abu Kashem 2022-03-18 12:52:53 UTC

> Which are the logs that we are looking for ?

typically we need the following for deep debugging:
- pod logs (from must gather), if not,then at a minimum kube-apiserver, openshift apiserver, and etcd logs
- audit logs
- prometheus data capture

Comment 49 Scott Dodson 2022-04-18 20:22:53 UTC

Root cause in the support case which led to this bug being filed is understood. Pod count increased by 2.5x and multiple processes were listing pods in all namespaces, etcd and apiserver struggled to render the 700MiB response payload for those concurrent requests and resource exhaustion was experienced. There are strategic efforts underway to reduce the impact of such as well as efforts to identify and address costly clients who may or may not need to perform such expensive requests.

I would like to scope this bug down to ensuring that we update OCP versions which use etcd 3.4 to at least 3.4.16 which includes a fix for costly unmarshalling of response payload meant just to generate the log messages which include slow range select queries are present. This is likely to reduce the overall memory consumption by etcd per request that triggers those log messages by 10-30%.

See https://github.com/etcd-io/etcd/pull/12871 and https://github.com/etcd-io/etcd/issues/12835

The fixes for this specific resource hog are already present in etcd 3.5.0 and later, so 4.9 and later are fine. This will just reduce resource consumption in 4.6 through 4.8, and hopefully, reduce the likelihood of a bad situation becoming critical.

Comment 51 Scott Dodson 2022-04-26 15:17:19 UTC

Rather than rescoping this just down to the etcd bump expected in 4.6-4.8 we will instead close this as notabug with references to other defects we are chasing. Please see the depends on field for this bug for those references.