Created attachment 1712932 [details] dmesg from a node affected. Description of problem: I've installed ocp 4.6 (on openstack), Version-Release number of selected component (if applicable): /root/ocp46-new/openshift-install 4.6.0-0.nightly-2020-08-16-072105 built from commit c52bc55044a19220ec292da3428f4c3d84fdfdc6 release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:4cb280f7473d303ce9ec7ab9b4daa54f00147ea637bbc848c9d1f7a72af5660a How reproducible: Not the first time similar thing happens. happened on 4.3, 4.4, 4.5 eventually. Steps to Reproduce: 1. Install OCP 4.6 2. Use cluster as-normal for a week 3. Actual results: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504} Expected results: auth carries on working Additional info: I have ssh access to masters in case additional logs are required. However, `podman ps` returns no containers. dmesg returns a lot of oom killer operations, short snip: http://pastebin.test.redhat.com/897433 (full dmesg available in attachement) top shows that etcd is severely leaking: 3418 root 20 0 22.9g 14.1g 0 R 55.8 90.3 49:46.42 etcd I'll attach more logs in comments
In private conversation, the reporter claimed etcd was leaking memory.
I have seen some interesting memory usage recently. It would be good if we could get a heap profile to analyze. oc rsh -T -n openshift-etcd $ETCD_POD sh -c 'curl --cert $ETCDCTL_CERT --key $ETCDCTL_KEY --cacert $ETCDCTL_CACERT https://localhost:2379/debug/pprof/heap' > /tmp/heap I understand that apiserver might not be up in which case you should be able to perform simular actions against container directly on host using `crictl exec`. Once we have the profile we should be able to better understand what is going on.
After reviewing the cluster I did take memory profiles of etcd. But I also noticed that the etcd in this cluster was compiled with the wrong version of golang. This was a problem that happened during the build process with art. We are working with art team to clean this up but for the time being any build of etcd that uses golang 1.14 should be considered unstable. I will try to upfdate you with details on when good builds will be available. ``` 2020-08-28 09:36:34.116351 I | etcdmain: etcd Version: 3.4.9 2020-08-28 09:36:34.116356 I | etcdmain: Git SHA: 47063fb 2020-08-28 09:36:34.116360 I | etcdmain: Go Version: go1.14.4 2020-08-28 09:36:34.116366 I | etcdmain: Go OS/Arch: linux/amd64 ```
For the reasons listed above I am marking this a dupe of 1873590. Lets track there. *** This bug has been marked as a duplicate of bug 1873590 ***
Created attachment 1715584 [details] etcd disk and peer RTT latency
Created attachment 1715585 [details] heap profile
I took a deep look at this today and what I believe is happening is the podnetworkconnectivitychecks is writing or reading keys between a couple hundred bytes to over 1MB. As this cluster has been around longer I believe that the key size keeps growing. The result for clusters without exceptional storage is this bloat begins to manifest as memory pressure. ``` etcdctl get /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/apiserver-broker46lpt-s6bxc-master-1-to-kubernetes-apiserver-service-cluster | wc -c 1006802 ``` The total size of these keys is around 9MB ``` etcdctl get --prefix /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/ | wc -c 9260933 ``` I looked at the heap from pprof(https://bugzilla.redhat.com/attachment.cgi?id=1715585) and it appears to me that all of the memory is in Unmarshal which would be explainable by the I/O backpressure. As you can see from the metrics query(https://bugzilla.redhat.com/attachment.cgi?id=1715584) etcd latency is off the chart. I am in the process of uploading the must-gather to google docs (1GB) once that is complete I will reference the link.
Created attachment 1715595 [details] cpu profile
Created attachment 1715596 [details] heap source
link to must-gather https://drive.google.com/file/d/1seggzMdNznqWRCY5rqN6-cI2c-P_Ylt1/view?usp=sharing $ go/bin/go tool pprof ./heap File: etcd Type: inuse_space Time: Sep 21, 2020 at 5:07pm (EDT) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 6904.41MB, 99.30% of 6953.01MB total Dropped 137 nodes (cum <= 34.77MB) Showing top 10 nodes out of 36 flat flat% sum% cum cum% 6380.90MB 91.77% 91.77% 6380.90MB 91.77% go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal 523.51MB 7.53% 99.30% 523.51MB 7.53% go.etcd.io/etcd/etcdserver/etcdserverpb.(*InternalRaftRequest).Marshal 0 0% 99.30% 1700MB 24.45% go.etcd.io/etcd/embed.StartEtcd 0 0% 99.30% 1700MB 24.45% go.etcd.io/etcd/etcdmain.Main 0 0% 99.30% 1700MB 24.45% go.etcd.io/etcd/etcdmain.startEtcd 0 0% 99.30% 1700MB 24.45% go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2 0 0% 99.30% 523.51MB 7.53% go.etcd.io/etcd/etcdserver.(*EtcdServer).Txn 0 0% 99.30% 523.51MB 7.53% go.etcd.io/etcd/etcdserver.(*EtcdServer).processInternalRaftRequestOnce 0 0% 99.30% 523.51MB 7.53% go.etcd.io/etcd/etcdserver.(*EtcdServer).raftRequest 0 0% 99.30% 523.51MB 7.53% go.etcd.io/etcd/etcdserver.(*EtcdServer).raftRequestOnce ```
Created attachment 1715598 [details] podnetworkconnectivitychecks key sizes in bytes
I would like to add that etcd in 3.4 has what appears to be performance issues which it would be nice to have time to address upstream. But for the record I updated etcd container to run latest 3.4.13[1] upstream and saw the exact same memory profile. [1] gcr.io/etcd-development/etcd:v3.4.13
Verified with OCP 4.6.0-0.nightly-2020-09-27-075304, Check the cluster health, $ oc get nodes NAME STATUS ROLES AGE VERSION kewang27osp2-ghtnn-master-0 Ready master 41h v1.19.0+fff8183 kewang27osp2-ghtnn-master-1 Ready master 41h v1.19.0+fff8183 kewang27osp2-ghtnn-master-2 Ready master 41h v1.19.0+fff8183 kewang27osp2-ghtnn-worker-0-84vdd Ready worker 40h v1.19.0+fff8183 kewang27osp2-ghtnn-worker-0-ftzc6 Ready worker 40h v1.19.0+fff8183 kewang27osp2-ghtnn-worker-0-kzp66 Ready worker 40h v1.19.0+fff8183 $ oc get co | grep -v '.True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE $ oc get pods -A | grep -vE 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE We can see node, co and pods all is well as expected. $ oc debug node/kewang27osp2-ghtnn-master-0 sh-4.4# dmesg | grep oom sh-4.4# top top - 09:10:57 up 1 day, 16:24, 0 users, load average: 2.01, 2.00, 1.77 Tasks: 454 total, 2 running, 452 sleeping, 0 stopped, 0 zombie %Cpu(s): 30.7 us, 15.7 sy, 0.0 ni, 45.8 id, 3.0 wa, 1.8 hi, 2.4 si, 0.6 st MiB Mem : 16033.5 total, 167.6 free, 9080.0 used, 6785.9 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9902.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 519361 root 20 0 38876 2608 2160 R 85.0 0.0 0:00.39 lsof 4108558 root 20 0 3105300 2.0g 67276 S 65.0 12.7 127:12.51 kube-apiserver 12456 root 20 0 2040616 230756 28064 S 45.0 1.4 64:38.24 openshift-apise 5713 root 20 0 10.3g 2.4g 192868 S 40.0 15.4 424:19.05 etcd $ oc rsh -n openshift-etcd etcd-kewang27osp2-ghtnn-master-0 sh-4.4# etcd --version etcd Version: 3.4.9 Git SHA: edaae9f Go Version: go1.12.12 Go OS/Arch: linux/amd64 The total size of these podnetworkconnectivitychecks keys: sh-4.4# etcdctl get --prefix /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/ | wc -c 731478 etcd db size: sh-4.4# etcdctl endpoint status https://192.168.1.132:2379, 64e4d45778428cb9, 3.4.9, 182 MB, true, false, 3468, 2266207, 2266207, https://192.168.3.174:2379, 1b4bae5b585a17be, 3.4.9, 182 MB, false, false, 3468, 2266207, 2266207, https://192.168.1.208:2379, 3fc6b761bb2a2713, 3.4.9, 182 MB, false, false, 3468, 2266207, 2266207, While the cluster is running, I did some operators included creating new resource and disconnected one master for a few hours, the etcd db size increased normally.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196