1983131 – Defragmenting an etcd member doesn't reduce the DB size (7.5GB) on a setup with ~1000 spoke clusters

Bug 1983131 - Defragmenting an etcd member doesn't reduce the DB size (7.5GB) on a setup with ~1000 spoke clusters

Summary: Defragmenting an etcd member doesn't reduce the DB size (7.5GB) on a setup wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Cluster Lifecycle
Sub Component:
Version:	rhacm-2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.3
Assignee:	Jian Qiu
QA Contact:	Derek Ho
Docs Contact:	Christopher Dawson
URL:
Whiteboard:	EmergencyRequest
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-16 15:23 UTC by Alexander Chuzhoy
Modified:	2022-03-28 03:00 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-06 00:52:39 UTC
Target Upstream Version:
Embargoed:
Flags:	ming: rhacm-2.3+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 14375	0	None	None	None	2021-07-20 14:54:12 UTC
Red Hat Product Errata	RHSA-2021:3016	0	None	None	None	2021-08-06 00:53:10 UTC

Description Alexander Chuzhoy 2021-07-16 15:23:33 UTC

Version:
quay.io/acm-d/acm-custom-registry:2.3.0-DOWNSTREAM-2021-06-28-15-34-59
OCP: 4.8.0-rc.1-x86_64


Steps to reproduce:

Deploy OCP with ACM/Assisted-installer.
Deploy ~1000 spoke SNO through ACM/Assisted-Installer over the course of 2 weeks (received machines in portions,hence the spokes addition happened in portions over the course of 2 weeks).

oc get managedclusters|wc -l
949


Tried adding a machineconfig (adding an additional route) and the change never completed.


oc get node
NAME                                        STATUS                        ROLES    AGE   VERSION
e24-h02-000-r640.rdu2.scalelab.redhat.com   Ready                         master   16d   v1.21.0-rc.0+766a5fe
e24-h03-000-r640.rdu2.scalelab.redhat.com   Ready                         worker   16d   v1.21.0-rc.0+766a5fe
e24-h05-000-r640.rdu2.scalelab.redhat.com   Ready                         master   16d   v1.21.0-rc.0+766a5fe
e24-h06-000-r640.rdu2.scalelab.redhat.com   Ready,SchedulingDisabled      worker   16d   v1.21.0-rc.0+766a5fe
e24-h07-000-r640.rdu2.scalelab.redhat.com   NotReady,SchedulingDisabled   master   16d   v1.21.0-rc.0+766a5fe
[root@e24-h01-000-r640 ~]# 




Tried to delete some stuck pod:
oc delete pod -n openshift-nmstate                                  nmstate-handler-z6zqm --force

Error from server: etcdserver: mvcc: database space exceeded



Went to check the etcd:

[root@e24-h01-000-r640 ~]# oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd
etcd-e24-h02-000-r640.rdu2.scalelab.redhat.com                4/4     Running     0          16d   2000::ace1       e24-h02-000-r640.rdu2.scalelab.redhat.com   <none>           <none>
etcd-e24-h05-000-r640.rdu2.scalelab.redhat.com                4/4     Running     4          16d   2000::ace3       e24-h05-000-r640.rdu2.scalelab.redhat.com   <none>           <none>
etcd-e24-h07-000-r640.rdu2.scalelab.redhat.com                4/4     Running     0          16d   2000::ace5       e24-h07-000-r640.rdu2.scalelab.redhat.com   <none>           <none>


[root@e24-h01-000-r640 ~]# oc rsh  -n openshift-etcd etcd-e24-h07-000-r640.rdu2.scalelab.redhat.com etcdctl endpoint status --cluster -w  table
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
{"level":"warn","ts":"2021-07-16T04:42:10.108Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace5]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace5]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace5]:0 (context deadline exceeded)
{"level":"warn","ts":"2021-07-16T04:42:15.115Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace1]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace1]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace1]:0 (context deadline exceeded)
{"level":"warn","ts":"2021-07-16T04:42:20.123Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace3]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace3]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace3]:0 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  7.5 GB |     false |      false |        16 |  169867226 |          169867226 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  7.5 GB |      true |      false |        16 |  169867531 |          169867530 |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  7.5 GB |     false |      false |        16 |  169867774 |          169867774 |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
command terminated with exit code 1
[root@e24-h01-000-r640 ~]#


Tried to defrag using this procedure: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks



[root@e24-h01-000-r640 ~]# oc rsh -n openshift-etcd etcd-e24-h05-000-r640.rdu2.scalelab.redhat.com
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
sh-4.4# unset ETCDCTL_ENDPOINTS
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
Finished defragmenting etcd member[https://localhost:2379]
sh-4.4# etcdctl endpoint status -w table --cluster
{"level":"warn","ts":"2021-07-16T13:40:28.827Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace5]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace5]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace5]:0 (context deadline exceeded)
{"level":"warn","ts":"2021-07-16T13:40:33.834Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace1]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace1]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace1]:0 (context deadline exceeded)
{"level":"warn","ts":"2021-07-16T13:40:38.841Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace3]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace3]:0: connect: no such file or directory\""}
Failed to get the status of endpoint unixs://[2000::ace3]:0 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  7.5 GB |     false |      false |        16 |  171765109 |          171765109 |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  7.5 GB |      true |      false |        16 |  171765385 |          171765385 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  7.5 GB |     false |      false |        16 |  171765714 |          171765714 |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
sh-4.4#


The defrag didn't work - the db size remained the same.

Comment 1 Alexander Chuzhoy 2021-07-16 15:31:50 UTC

Running the following commands helped reducing the size:

sh-4.4#etcdctl compact $(etcdctl endpoint status --write-out="json" |  egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' -m1)
compacted revision 158774421


sh-4.4# etcdctl endpoint status -w json
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":14288348196171437800,"revision":158774421,"raft_term":16},"version":"3.4.14","dbSize":7475937280,"leader":8730137547783813330,"raftIndex":171897193,"raftTerm":16,"raftAppliedIndex":171897193,"errors":["memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE ","memberID:7377309355919782153 alarm:NOSPACE "],"dbSizeInUse":826146816}}]



sh-4.4# etcdctl --command-timeout=120s --endpoints=https://localhost:2379 defrag
Finished defragmenting etcd member[https://localhost:2379]


sh-4.4# etcdctl endpoint status -w table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| 127.0.0.1:2379 | c64a68fb43e456e8 |  3.4.14 |  577 MB |     false |      false |        16 |  171903945 |          171903945 |  memberID:14288348196171437800 |
|                |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
sh-4.4#

Comment 2 Michal Fojtik 2021-07-16 15:45:38 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

Comment 3 Michal Fojtik 2021-07-16 18:15:38 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

Comment 4 Sam Batschelet 2021-07-18 15:33:41 UTC

after review I found that a surge in manifestworks objects (over 2k per second) put the etcd store over the `ETCD_QUOTA_BACKEND_BYTES` 7GB in under 10m. Because many of these resources were actually mutations on existing keys the mitigation for growth would be compaction. As compaction currently runs every 10m it was a perfect storm. While I think its important to understand the huge growth of manifestworks objects and improve our ability to compact the store without dependencies. I think this would be a very low probability failure case so marking it a non-blocker to 4.8. etcd team will outline a more complete RCA soon.

Comment 5 Sam Batschelet 2021-07-19 16:39:47 UTC

Moving to ACM team to triage state growth.

Comment 7 Alexander Chuzhoy 2021-07-19 20:15:56 UTC

After defragmenting all etcd instances and disarming the alarms the issue went away.

Then added in bulk 10 snos.  The DB size jumped to:
sh-4.4# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  4.3 GB |     false |      false |        17 |  191609426 |          191609426 |        |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  4.3 GB |      true |      false |        17 |  191609428 |          191609428 |        |
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  4.3 GB |     false |      false |        17 |  191609431 |          191609431 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+





Then I edited the cluster's proxy (oc edit proxy cluster) to exclude an additional subnet from being proxied.

machineconfig update was triggered and the nodes got rebooted.


One master doesn't return:
[root@e24-h01-000-r640 f29-h01-000-r640]# oc get node
NAME                                        STATUS                        ROLES    AGE   VERSION
e24-h02-000-r640.rdu2.scalelab.redhat.com   NotReady,SchedulingDisabled   master   20d   v1.21.0-rc.0+766a5fe
e24-h03-000-r640.rdu2.scalelab.redhat.com   Ready                         worker   20d   v1.21.0-rc.0+766a5fe
e24-h05-000-r640.rdu2.scalelab.redhat.com   Ready                         master   20d   v1.21.0-rc.0+766a5fe
e24-h06-000-r640.rdu2.scalelab.redhat.com   Ready                         worker   20d   v1.21.0-rc.0+766a5fe
e24-h07-000-r640.rdu2.scalelab.redhat.com   Ready                         master   20d   v1.21.0-rc.0+766a5fe


and the issue is back:


+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  7.5 GB |     false |      false |        20 |  195391046 |          195391046 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  7.5 GB |     false |      false |        20 |  195391348 |          195391348 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  7.5 GB |      true |      false |        20 |  195391594 |          195391594 |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+



sh-4.4# etcdctl endpoint status -w json
[{"Endpoint":"https://[2000::ace1]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":8730137547783813330,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7515844608,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:8730137547783813330 alarm:NOSPACE ","memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE "],"dbSizeInUse":7510331392}},{"Endpoint":"https://[2000::ace3]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":14288348196171437800,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7515844608,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE "],"dbSizeInUse":7510331392}},{"Endpoint":"https://[2000::ace5]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":7377309355919782153,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7516975104,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE "],"dbSizeInUse":7510298624}}]
sh-4.4#

Comment 8 Alexander Chuzhoy 2021-07-19 20:49:33 UTC

after defragmenting the db,noticed that the size gets bigger without any changes on my end.
in ~20 mins the size was already:
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  7.0 GB |     false |      false |        20 |  196212752 |          196212752 |        |
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  7.1 GB |     false |      false |        20 |  196217519 |          196217479 |        |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  7.1 GB |      true |      false |        20 |  196222207 |          196222170 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Comment 9 Alexander Chuzhoy 2021-07-19 21:26:28 UTC

and the issue is back:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://[2000::ace5]:2379 | 666177c5c5c3b109 |  3.4.14 |  7.5 GB |     false |      false |        20 |  196469377 |          196469377 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace1]:2379 | 7927add86ff438d2 |  3.4.14 |  7.5 GB |     false |      false |        20 |  196469630 |          196469630 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://[2000::ace3]:2379 | c64a68fb43e456e8 |  3.4.14 |  7.5 GB |      true |      false |        20 |  196469893 |          196469893 |   memberID:7377309355919782153 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |   memberID:8730137547783813330 |
|                           |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                           |                  |         |         |           |            |           |            |                    |  memberID:14288348196171437800 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

Comment 10 Omri Hochman 2021-07-19 21:29:36 UTC

(In reply to Sam Batschelet from comment #4)
> after review I found that a surge in manifestworks objects (over 2k per
> second) put the etcd store over the `ETCD_QUOTA_BACKEND_BYTES` 7GB in under
> 10m. Because many of these resources were actually mutations on existing
> keys the mitigation for growth would be compaction. As compaction currently
> runs every 10m it was a perfect storm. While I think its important to
> understand the huge growth of manifestworks objects and improve our ability
> to compact the store without dependencies. I think this would be a very low
> probability failure case so marking it a non-blocker to 4.8. etcd team will
> outline a more complete RCA soon.

I'm not sure it should be considered a blocker for the OCP4.8 etcd side, but I still think it should be tracked as a blocker for Telco release, specifically around the SE for VZ in 4.8.  
-

Comment 12 Le Yang 2021-07-22 09:49:26 UTC

This issued has been fixed. The root cause is Manifestworks containing two versions of the same CRD will result in update conflicts and trigger extra resync. That will create additional requests to apiserver. More details can be found: https://github.com/open-cluster-management/backlog/issues/14345.

Comment 15 hanzhang 2021-08-05 03:52:52 UTC

Verified with ACM 2.3.0-DOWNSTREAM-2021-07-26-18-43-26. With 2k SNOs, we applied machineconfig to let hub restart, and we are not seeing increasement of manifestwork update requests, and etcd db size doesn't grow to 7.5G. 

We have 2 update events of machineconfig, the first one we did see around 300MB etcd db size growth, and it can be related to the machineconfig we applied updated some of the SNOs' status. The second machineconfig update happened one day after the first machineconfig update, and we didn't see any etcd db size growth.

Comment 17 errata-xmlrpc 2021-08-06 00:52:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Advanced Cluster Management for Kubernetes version 2.3), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3016

Note You need to log in before you can comment on or make changes to this bug.