Bug 2064371
Summary: | [ocp 4.8.17] etcd high memory utilization in three master nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alfredo Pizarro <apizarro> |
Component: | Etcd | Assignee: | Allen Ray <alray> |
Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | akashem, cruhm, dhellard, dpateriy, dtarabor, dwest, gscheffe, hshukla, igreen, jfindysz, kgordeev, knakai, lserot, ocasalsa, openshift-bugs-escalate, rhowe, rjamadar, sdodson, sscheink, ssonigra, travi, vismishr, wking |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-04-26 15:18:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2077975, 2078954, 2078988, 2087049, 2088449 | ||
Bug Blocks: |
Description
Alfredo Pizarro
2022-03-15 16:35:45 UTC
progress made today: - we were able to look at the logs from sos report, unfortunately the oldest log goes as far back as '2022-03-11T10:35:00.019918067+00:00'. - no prometheus metrics data was available as of this morning, we have asked the customer to take a capture of their prometheus data. - no logs from master-8 on which apiserver crashed. what we have observed: - from viewing the log, we clearly see API latency that is causing cluster degradation. examples from kube-apiserver logs: > 2022-03-11T19:46:04.201060548+00:00 stderr F Trace[512805231]: ---"Writing http response done" count:1 2235ms (19:46:00.945) apiserver taking more than 2s to serialize and write a LIST of 1 object, not including any round trip from etcd here. This usually indicates: - cpu saturation on the master node the apiserver instance is on, or - underlying network connectivity issues > Trace[460953672]: ---"Listing from storage done" 16274ms (19:46:00.603) listing from etcd (including the round trip time between kube-apiserver and etcd) took too long, we see many of these slow traces. examples for etcd logs: > 2022-03-16T20:45:41.700325055Z {"level":"warn","ts":"2022-03-16T20:45:41.692Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"5.845569603s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/secrets/datastore-redis-rfs/redis-app-acl-old-rnd-rfs\" ","response":"range_response_count:0 size:8"} etcd 'apply' taking longer - "took":"5.845569603s","expected-duration":"100ms" > {"level":"info","ts":"2022-03-11T19:50:53.145Z","caller":"traceutil/trace.go:145","msg":"trace[1242538398] range","detail":"{range_begin:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs/; range_end:/kubernetes.io/machineconfiguration.openshift.io/machineconfigs0; response_count:120; response_revision:1586712027; }","duration":"8.40039734s","start":"2022-03-11T19:50:44.745Z","end":"2022-03-11T19:50:53.145Z","steps":["trace[1242538398] 'agreement among raft nodes before linearized reading' (duration: 8.186373557s)","trace[1242538398] 'range keys from bolt db' (duration: 212.195986ms)"]} bolt db query is in ms range, but the overall agreement took order of seconds. we need more data points to assess what the source is. next steps: - the customer has uploaded the prometheus data dump, we will provision it and look the metrics, the metrics will give us some insights to pin point the root cause. so looking at the etcd leader election activities for March 11 (from etcd logs): - between 2022-03-11T03:01:51 - 2022-03-11T19:50 > $ grep -rih "is starting a new election at term" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l > 79 > > 2022-03-11T03:01:51.791151216+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:51.555Z","caller":"raft/raft.go:923","msg":"73a1998106ae1525 is starting a new election at term 779"} there were about 60 leader elections from 03:00 to 07:00, averaging 1 every 4-minute. consequently, we saw 56 leader changes consequently > $ grep -rih "elected leader" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | wc -l > 56 > > 2022-03-11T03:01:56.099136429+00:00 stderr F {"level":"info","ts":"2022-03-11T03:01:56.099Z","caller":"raft/node.go:325","msg":"raft.node: 73a1998106ae1525 elected leader 73a1998106ae1525 at term 780"} so what's causing these leader elections? I see etcd peer communication error that can trigger leader elections: > $ grep -rih "lost TCP streaming connection with remote peer" sos-report/var/log/pods/openshift-etcd_etcd-*/etcd/* | grep "2022-03-11T" | grep "i/o timeout" | wc -l > 307 > > 2022-03-11T19:50:45.868828442+00:00 stderr F {"level":"warn","ts":"2022-03-11T19:50:45.868Z","caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"73a1998106ae1525","remote-peer-id":"cd9034cd5c6fe4a2","error":"read tcp 10.56.142.108:49336->10.56.142.111:2380: i/o timeout"} have we eliminated network connectivity issues or slow disk that can trigger election timeout? I don't have logs from March 12 or later to see if etcd election activities have have improved > https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c14
we were not able to extract the dump, the tar ball is corrupt.
> we use SSD that are rated up to 5000 iops and we hit that limit, so this explains slow down or freeze after 11mar 02:20. This could explain excessive etcd leader elections https://bugzilla.redhat.com/show_bug.cgi?id=2064371#c21 > The thing we need to understand is why the available memory went down between 10mar 18:00 and 11mar 02:20. After the 11 mar 02:20 the system was not usable. - we don't have any pod logs from March 10 or before - we don't have a working prometheus dump it's going to be challenging to work toward a RCA if we don't have data that covers the incident window, the oldest log is from "2022-03-11T02:48:35.305660977". Can you check if logs were captured for March 10? > Which are the logs that we are looking for ?
typically we need the following for deep debugging:
- pod logs (from must gather), if not,then at a minimum kube-apiserver, openshift apiserver, and etcd logs
- audit logs
- prometheus data capture
Root cause in the support case which led to this bug being filed is understood. Pod count increased by 2.5x and multiple processes were listing pods in all namespaces, etcd and apiserver struggled to render the 700MiB response payload for those concurrent requests and resource exhaustion was experienced. There are strategic efforts underway to reduce the impact of such as well as efforts to identify and address costly clients who may or may not need to perform such expensive requests. I would like to scope this bug down to ensuring that we update OCP versions which use etcd 3.4 to at least 3.4.16 which includes a fix for costly unmarshalling of response payload meant just to generate the log messages which include slow range select queries are present. This is likely to reduce the overall memory consumption by etcd per request that triggers those log messages by 10-30%. See https://github.com/etcd-io/etcd/pull/12871 and https://github.com/etcd-io/etcd/issues/12835 The fixes for this specific resource hog are already present in etcd 3.5.0 and later, so 4.9 and later are fine. This will just reduce resource consumption in 4.6 through 4.8, and hopefully, reduce the likelihood of a bad situation becoming critical. Rather than rescoping this just down to the etcd bump expected in 4.6-4.8 we will instead close this as notabug with references to other defects we are chasing. Please see the depends on field for this bug for those references. |