Version: quay.io/acm-d/acm-custom-registry:2.3.0-DOWNSTREAM-2021-06-28-15-34-59 OCP: 4.8.0-rc.1-x86_64 Steps to reproduce: Deploy OCP with ACM/Assisted-installer. Deploy ~1000 spoke SNO through ACM/Assisted-Installer over the course of 2 weeks (received machines in portions,hence the spokes addition happened in portions over the course of 2 weeks). oc get managedclusters|wc -l 949 Tried adding a machineconfig (adding an additional route) and the change never completed. oc get node NAME STATUS ROLES AGE VERSION e24-h02-000-r640.rdu2.scalelab.redhat.com Ready master 16d v1.21.0-rc.0+766a5fe e24-h03-000-r640.rdu2.scalelab.redhat.com Ready worker 16d v1.21.0-rc.0+766a5fe e24-h05-000-r640.rdu2.scalelab.redhat.com Ready master 16d v1.21.0-rc.0+766a5fe e24-h06-000-r640.rdu2.scalelab.redhat.com Ready,SchedulingDisabled worker 16d v1.21.0-rc.0+766a5fe e24-h07-000-r640.rdu2.scalelab.redhat.com NotReady,SchedulingDisabled master 16d v1.21.0-rc.0+766a5fe [root@e24-h01-000-r640 ~]# Tried to delete some stuck pod: oc delete pod -n openshift-nmstate nmstate-handler-z6zqm --force Error from server: etcdserver: mvcc: database space exceeded Went to check the etcd: [root@e24-h01-000-r640 ~]# oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd etcd-e24-h02-000-r640.rdu2.scalelab.redhat.com 4/4 Running 0 16d 2000::ace1 e24-h02-000-r640.rdu2.scalelab.redhat.com <none> <none> etcd-e24-h05-000-r640.rdu2.scalelab.redhat.com 4/4 Running 4 16d 2000::ace3 e24-h05-000-r640.rdu2.scalelab.redhat.com <none> <none> etcd-e24-h07-000-r640.rdu2.scalelab.redhat.com 4/4 Running 0 16d 2000::ace5 e24-h07-000-r640.rdu2.scalelab.redhat.com <none> <none> [root@e24-h01-000-r640 ~]# oc rsh -n openshift-etcd etcd-e24-h07-000-r640.rdu2.scalelab.redhat.com etcdctl endpoint status --cluster -w table Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) {"level":"warn","ts":"2021-07-16T04:42:10.108Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace5]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace5]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace5]:0 (context deadline exceeded) {"level":"warn","ts":"2021-07-16T04:42:15.115Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace1]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace1]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace1]:0 (context deadline exceeded) {"level":"warn","ts":"2021-07-16T04:42:20.123Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace3]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace3]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace3]:0 (context deadline exceeded) +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 7.5 GB | false | false | 16 | 169867226 | 169867226 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 7.5 GB | true | false | 16 | 169867531 | 169867530 | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 7.5 GB | false | false | 16 | 169867774 | 169867774 | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ command terminated with exit code 1 [root@e24-h01-000-r640 ~]# Tried to defrag using this procedure: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks [root@e24-h01-000-r640 ~]# oc rsh -n openshift-etcd etcd-e24-h05-000-r640.rdu2.scalelab.redhat.com Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init) sh-4.4# unset ETCDCTL_ENDPOINTS sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag Finished defragmenting etcd member[https://localhost:2379] sh-4.4# etcdctl endpoint status -w table --cluster {"level":"warn","ts":"2021-07-16T13:40:28.827Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace5]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace5]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace5]:0 (context deadline exceeded) {"level":"warn","ts":"2021-07-16T13:40:33.834Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace1]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace1]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace1]:0 (context deadline exceeded) {"level":"warn","ts":"2021-07-16T13:40:38.841Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///unixs://[2000::ace3]:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial unix [2000::ace3]:0: connect: no such file or directory\""} Failed to get the status of endpoint unixs://[2000::ace3]:0 (context deadline exceeded) +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 7.5 GB | false | false | 16 | 171765109 | 171765109 | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 7.5 GB | true | false | 16 | 171765385 | 171765385 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 7.5 GB | false | false | 16 | 171765714 | 171765714 | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ sh-4.4# The defrag didn't work - the db size remained the same.
Running the following commands helped reducing the size: sh-4.4#etcdctl compact $(etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' -m1) compacted revision 158774421 sh-4.4# etcdctl endpoint status -w json [{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":14288348196171437800,"revision":158774421,"raft_term":16},"version":"3.4.14","dbSize":7475937280,"leader":8730137547783813330,"raftIndex":171897193,"raftTerm":16,"raftAppliedIndex":171897193,"errors":["memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE ","memberID:7377309355919782153 alarm:NOSPACE "],"dbSizeInUse":826146816}}] sh-4.4# etcdctl --command-timeout=120s --endpoints=https://localhost:2379 defrag Finished defragmenting etcd member[https://localhost:2379] sh-4.4# etcdctl endpoint status -w table +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | 127.0.0.1:2379 | c64a68fb43e456e8 | 3.4.14 | 577 MB | false | false | 16 | 171903945 | 171903945 | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ sh-4.4#
** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.
after review I found that a surge in manifestworks objects (over 2k per second) put the etcd store over the `ETCD_QUOTA_BACKEND_BYTES` 7GB in under 10m. Because many of these resources were actually mutations on existing keys the mitigation for growth would be compaction. As compaction currently runs every 10m it was a perfect storm. While I think its important to understand the huge growth of manifestworks objects and improve our ability to compact the store without dependencies. I think this would be a very low probability failure case so marking it a non-blocker to 4.8. etcd team will outline a more complete RCA soon.
Moving to ACM team to triage state growth.
After defragmenting all etcd instances and disarming the alarms the issue went away. Then added in bulk 10 snos. The DB size jumped to: sh-4.4# etcdctl endpoint status -w table +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 4.3 GB | false | false | 17 | 191609426 | 191609426 | | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 4.3 GB | true | false | 17 | 191609428 | 191609428 | | | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 4.3 GB | false | false | 17 | 191609431 | 191609431 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ Then I edited the cluster's proxy (oc edit proxy cluster) to exclude an additional subnet from being proxied. machineconfig update was triggered and the nodes got rebooted. One master doesn't return: [root@e24-h01-000-r640 f29-h01-000-r640]# oc get node NAME STATUS ROLES AGE VERSION e24-h02-000-r640.rdu2.scalelab.redhat.com NotReady,SchedulingDisabled master 20d v1.21.0-rc.0+766a5fe e24-h03-000-r640.rdu2.scalelab.redhat.com Ready worker 20d v1.21.0-rc.0+766a5fe e24-h05-000-r640.rdu2.scalelab.redhat.com Ready master 20d v1.21.0-rc.0+766a5fe e24-h06-000-r640.rdu2.scalelab.redhat.com Ready worker 20d v1.21.0-rc.0+766a5fe e24-h07-000-r640.rdu2.scalelab.redhat.com Ready master 20d v1.21.0-rc.0+766a5fe and the issue is back: +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 7.5 GB | false | false | 20 | 195391046 | 195391046 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 7.5 GB | false | false | 20 | 195391348 | 195391348 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 7.5 GB | true | false | 20 | 195391594 | 195391594 | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ sh-4.4# etcdctl endpoint status -w json [{"Endpoint":"https://[2000::ace1]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":8730137547783813330,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7515844608,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:8730137547783813330 alarm:NOSPACE ","memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE "],"dbSizeInUse":7510331392}},{"Endpoint":"https://[2000::ace3]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":14288348196171437800,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7515844608,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE "],"dbSizeInUse":7510331392}},{"Endpoint":"https://[2000::ace5]:2379","Status":{"header":{"cluster_id":15048564470140409604,"member_id":7377309355919782153,"revision":163718978,"raft_term":20},"version":"3.4.14","dbSize":7516975104,"leader":14288348196171437800,"raftIndex":195412664,"raftTerm":20,"raftAppliedIndex":195412664,"errors":["memberID:7377309355919782153 alarm:NOSPACE ","memberID:14288348196171437800 alarm:NOSPACE ","memberID:8730137547783813330 alarm:NOSPACE "],"dbSizeInUse":7510298624}}] sh-4.4#
after defragmenting the db,noticed that the size gets bigger without any changes on my end. in ~20 mins the size was already: +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 7.0 GB | false | false | 20 | 196212752 | 196212752 | | | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 7.1 GB | false | false | 20 | 196217519 | 196217479 | | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 7.1 GB | true | false | 20 | 196222207 | 196222170 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
and the issue is back: +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | https://[2000::ace5]:2379 | 666177c5c5c3b109 | 3.4.14 | 7.5 GB | false | false | 20 | 196469377 | 196469377 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace1]:2379 | 7927add86ff438d2 | 3.4.14 | 7.5 GB | false | false | 20 | 196469630 | 196469630 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | | https://[2000::ace3]:2379 | c64a68fb43e456e8 | 3.4.14 | 7.5 GB | true | false | 20 | 196469893 | 196469893 | memberID:7377309355919782153 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:8730137547783813330 | | | | | | | | | | | alarm:NOSPACE , | | | | | | | | | | | memberID:14288348196171437800 | | | | | | | | | | | alarm:NOSPACE | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
(In reply to Sam Batschelet from comment #4) > after review I found that a surge in manifestworks objects (over 2k per > second) put the etcd store over the `ETCD_QUOTA_BACKEND_BYTES` 7GB in under > 10m. Because many of these resources were actually mutations on existing > keys the mitigation for growth would be compaction. As compaction currently > runs every 10m it was a perfect storm. While I think its important to > understand the huge growth of manifestworks objects and improve our ability > to compact the store without dependencies. I think this would be a very low > probability failure case so marking it a non-blocker to 4.8. etcd team will > outline a more complete RCA soon. I'm not sure it should be considered a blocker for the OCP4.8 etcd side, but I still think it should be tracked as a blocker for Telco release, specifically around the SE for VZ in 4.8. -
This issued has been fixed. The root cause is Manifestworks containing two versions of the same CRD will result in update conflicts and trigger extra resync. That will create additional requests to apiserver. More details can be found: https://github.com/open-cluster-management/backlog/issues/14345.
Verified with ACM 2.3.0-DOWNSTREAM-2021-07-26-18-43-26. With 2k SNOs, we applied machineconfig to let hub restart, and we are not seeing increasement of manifestwork update requests, and etcd db size doesn't grow to 7.5G. We have 2 update events of machineconfig, the first one we did see around 300MB etcd db size growth, and it can be related to the machineconfig we applied updated some of the SNOs' status. The second machineconfig update happened one day after the first machineconfig update, and we didn't see any etcd db size growth.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Advanced Cluster Management for Kubernetes version 2.3), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3016