Description of problem: I have a 4.1.16 cluster which is experiencing etcd performance degradation starting september 11th, which is shortly after finishing an upgrade from 4.1.14 and starting another to 4.1.15. The issue is also persisting even now after upgrading to 4.1.16. The etcd pods in my cluster have experienced rapid growth in memory usage starting the 11th, and prior to then it was stable and low. Additionally I am receiving Prometheus alerts on etcd regularly indicating slow communications. My recent update history: history: - completionTime: "2019-09-14T01:25:04Z" image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7 startedTime: "2019-09-14T00:03:21Z" state: Completed verified: true version: 4.1.16 - completionTime: "2019-09-14T00:03:21Z" image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef startedTime: "2019-09-11T19:58:11Z" state: Completed verified: true version: 4.1.15 - completionTime: "2019-09-11T19:58:11Z" image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231 startedTime: "2019-09-05T08:41:49Z" state: Completed verified: true version: 4.1.14 - completionTime: "2019-09-05T08:41:49Z" image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94 startedTime: "2019-08-27T18:36:55Z" state: Completed verified: true version: 4.1.13 ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9 Version-Release number of selected component (if applicable): OCP 4.1.16 How reproducible: This cluster is currently a dev cluster for the metering team, so it's long lived and currently running, and experiencing the issue. Reach out to chancez on CoreOS slack or send me an email for access to debug. Steps to Reproduce: 1. Upgrade the cluster 2. ??? 3. Look at etcd memory usage Actual results: Etcd has grows from ~6GB memory usage to ~18GB over 2 hours before dropping back down. During the upper end of the spike during the period of it dropping, I can see Prometheus alerts firing on etcd communication latency. Expected results: Etcd should remain stable and maintain existing performance characteristics. Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9
Created attachment 1616301 [details] etcd-memory-usage-spikes 2 week view of etcd's memory usage is attached.