Bug 1753348 - Etcd performance degradation and etcd memory usage spiking after upgrade to 4.1.14
Summary: Etcd performance degradation and etcd memory usage spiking after upgrade to 4...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.4.0
Assignee: Sam Batschelet
QA Contact: ge liu
Depends On:
TreeView+ depends on / blocked
Reported: 2019-09-18 16:22 UTC by Chance Zibolski
Modified: 2020-04-13 20:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2020-02-21 10:44:10 UTC
Target Upstream Version:

Attachments (Terms of Use)
etcd-memory-usage-spikes (421.09 KB, image/png)
2019-09-18 16:23 UTC, Chance Zibolski
no flags Details

Description Chance Zibolski 2019-09-18 16:22:45 UTC
Description of problem: I have a 4.1.16 cluster which is experiencing etcd performance degradation starting september 11th, which is shortly after finishing an upgrade from 4.1.14 and starting another to 4.1.15. The issue is also persisting even now after upgrading to 4.1.16. 

The etcd pods in my cluster have experienced rapid growth in memory usage starting the 11th, and prior to then it was stable and low. Additionally I am receiving Prometheus alerts on etcd regularly indicating slow communications.

My recent update history:

    - completionTime: "2019-09-14T01:25:04Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16
    - completionTime: "2019-09-14T00:03:21Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-11T19:58:11Z"
      state: Completed
      verified: true
      version: 4.1.15
    - completionTime: "2019-09-11T19:58:11Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231
      startedTime: "2019-09-05T08:41:49Z"
      state: Completed
      verified: true
      version: 4.1.14
    - completionTime: "2019-09-05T08:41:49Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94
      startedTime: "2019-08-27T18:36:55Z"
      state: Completed
      verified: true
      version: 4.1.13

ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Version-Release number of selected component (if applicable): OCP 4.1.16

How reproducible: This cluster is currently a dev cluster for the metering team, so it's long lived and currently running, and experiencing the issue. Reach out to chancez on CoreOS slack or send me an email for access to debug.

Steps to Reproduce:
1. Upgrade the cluster
2. ???
3. Look at etcd memory usage

Actual results: Etcd has grows from ~6GB memory usage to ~18GB over 2 hours before dropping back down. During the upper end of the spike during the period of it dropping, I can see Prometheus alerts firing on etcd communication latency.

Expected results: Etcd should remain stable and maintain existing performance characteristics.

Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Comment 1 Chance Zibolski 2019-09-18 16:23:31 UTC
Created attachment 1616301 [details]

2 week view of etcd's memory usage is attached.

Note You need to log in before you can comment on or make changes to this bug.