Bug 1753348

Summary: Etcd performance degradation and etcd memory usage spiking after upgrade to 4.1.14
Product: OpenShift Container Platform Reporter: Chance Zibolski <chancez>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: aos-bugs, dmoessne, gblomqui, mfojtik, skolicha, sttts
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-21 10:44:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
etcd-memory-usage-spikes none

Description Chance Zibolski 2019-09-18 16:22:45 UTC
Description of problem: I have a 4.1.16 cluster which is experiencing etcd performance degradation starting september 11th, which is shortly after finishing an upgrade from 4.1.14 and starting another to 4.1.15. The issue is also persisting even now after upgrading to 4.1.16. 

The etcd pods in my cluster have experienced rapid growth in memory usage starting the 11th, and prior to then it was stable and low. Additionally I am receiving Prometheus alerts on etcd regularly indicating slow communications.

My recent update history:

    history:
    - completionTime: "2019-09-14T01:25:04Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16
    - completionTime: "2019-09-14T00:03:21Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-11T19:58:11Z"
      state: Completed
      verified: true
      version: 4.1.15
    - completionTime: "2019-09-11T19:58:11Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231
      startedTime: "2019-09-05T08:41:49Z"
      state: Completed
      verified: true
      version: 4.1.14
    - completionTime: "2019-09-05T08:41:49Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94
      startedTime: "2019-08-27T18:36:55Z"
      state: Completed
      verified: true
      version: 4.1.13

ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9


Version-Release number of selected component (if applicable): OCP 4.1.16


How reproducible: This cluster is currently a dev cluster for the metering team, so it's long lived and currently running, and experiencing the issue. Reach out to chancez on CoreOS slack or send me an email for access to debug.


Steps to Reproduce:
1. Upgrade the cluster
2. ???
3. Look at etcd memory usage

Actual results: Etcd has grows from ~6GB memory usage to ~18GB over 2 hours before dropping back down. During the upper end of the spike during the period of it dropping, I can see Prometheus alerts firing on etcd communication latency.


Expected results: Etcd should remain stable and maintain existing performance characteristics.


Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Comment 1 Chance Zibolski 2019-09-18 16:23:31 UTC
Created attachment 1616301 [details]
etcd-memory-usage-spikes

2 week view of etcd's memory usage is attached.