Bug 1492891 - etcd writes being blocked when hard quota hit
Summary: etcd writes being blocked when hard quota hit
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.1
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 3.7.0
Assignee: Jan Chaloupka
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-18 21:55 UTC by Max Whittingham
Modified: 2023-09-14 04:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The etcd quota backend was set to 2GB by default. Which resulted in a cluster going into a hold state, blocking all writes into the etcd storage. The default quota backend was increased to 4GB by default to encompass the storage needs of bigger clusters.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:11:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Max Whittingham 2017-09-18 21:55:30 UTC
Description of problem:
etcd writes get blocked when the etcd_quota_backend_quota was hit.

Version-Release number of selected component (if applicable):
OpenShift: 3.6.173.0.5
etcd: 3.1.9


Actual results:
etcdctl3 alarm list


etcdctl3 endpoint status showed hitting the 2GB limit

Comment 1 Clayton Coleman 2017-09-18 22:06:46 UTC
The default quota is 2GB which we hit on us-east-1 and were very close to us-west-2 and us-west-1.  We increased the limit to 4Gb. We should increase the limit out of the box to 8GB for all starter clusters (which is the etcd3 maximum) and add alerting to warn us when we cross the 4GB threshold.

For end users, we should probably increase the default to 4Gb or 8Gb and also recommend alerting on the size value.

The size of the /var/lib/etcd/member/db file should be very close to the internal size, and it's likely we can report on that as a proxy if it is simpler.

Comment 2 Clayton Coleman 2017-09-18 22:33:40 UTC
Hard quota blocks all writes when hit, which means the cluster goes into a hold state until the quota is increased.

Comment 5 Scott Dodson 2017-09-22 14:30:38 UTC
It looks like we're not doing anything to compact old revisions as outlined in the etcd maintenance documentation. Shouldn't we be enabling that so that we don't grow unbounded?

Relevant config vars from that cluster

ETCD_QUOTA_BACKEND_BYTES=4294967000
ETCD_DATA_DIR=/var/lib/etcd/
ETCD_SNAPSHOT_COUNT=60000
ETCD_HEARTBEAT_INTERVAL=2000
ETCD_ELECTION_TIMEOUT=20000
ETCD_DEBUG=false
ETCD_LOG_PACKAGE_LEVELS="*=INFO"


https://coreos.com/etcd/docs/latest/op-guide/maintenance.html

Comment 6 Jan Chaloupka 2017-09-25 09:58:57 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/5518

Comment 7 openshift-github-bot 2017-09-27 22:59:14 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/ead22bb1c3b6d6366502b14b97b7aae5605f8a58
Merge pull request #5518 from ingvagabund/set-quota-backend-bytes-explicitly

Automatic merge from submit-queue

set the etcd backend quota to 4GB by default

Bug: 1492891

Comment 8 Gaoyun Pei 2017-10-11 06:21:30 UTC
Verify this bug with openshift-ansible-3.7.0-0.143.2.git.0.39404c5.el7.noarch.rpm

Etcd backend quota was set to 4GB by default now.

[root@ip-172-18-9-157 ~]# grep ETCD_QUOTA_BACKEND_BYTES /etc/etcd/etcd.conf
ETCD_QUOTA_BACKEND_BYTES=4294967296

Comment 12 errata-xmlrpc 2017-11-28 22:11:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Comment 13 Red Hat Bugzilla 2023-09-14 04:08:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.