Description of problem: I have a 4.1.16 cluster which is experiencing etcd performance degradation starting september 11th, which is shortly after finishing an upgrade from 4.1.14 and starting another to 4.1.15. The issue is also persisting even now after upgrading to 4.1.16.
The etcd pods in my cluster have experienced rapid growth in memory usage starting the 11th, and prior to then it was stable and low. Additionally I am receiving Prometheus alerts on etcd regularly indicating slow communications.
My recent update history:
- completionTime: "2019-09-14T01:25:04Z"
- completionTime: "2019-09-14T00:03:21Z"
- completionTime: "2019-09-11T19:58:11Z"
- completionTime: "2019-09-05T08:41:49Z"
ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9
Version-Release number of selected component (if applicable): OCP 4.1.16
How reproducible: This cluster is currently a dev cluster for the metering team, so it's long lived and currently running, and experiencing the issue. Reach out to chancez on CoreOS slack or send me an email for access to debug.
Steps to Reproduce:
1. Upgrade the cluster
3. Look at etcd memory usage
Actual results: Etcd has grows from ~6GB memory usage to ~18GB over 2 hours before dropping back down. During the upper end of the spike during the period of it dropping, I can see Prometheus alerts firing on etcd communication latency.
Expected results: Etcd should remain stable and maintain existing performance characteristics.
Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9
Created attachment 1616301 [details]
2 week view of etcd's memory usage is attached.