+++ This bug was initially created as a clone of Bug #1859883 +++ Description of problem: During a scale test ovnkube-node processes memory is growing exponentially. Scale tests is trying to create 1000 projects with 16 pods and 3 services per project. During the scale tests, multiple ovnkube-node running on various nodes consumes memory in the range of 6G to 20G. Following is one of the example: ``` ovnkube-node-cbpq7 ovn-controller 0m 5145Mi ovnkube-node-cbpq7 ovnkube-node 1176m 19239Mi. <<<20G ``` top output ``` top - 08:40:28 up 2 days, 13:24, 0 users, load average: 1.76, 1.76, 1.75 Tasks: 243 total, 3 running, 151 sleeping, 0 stopped, 0 zombie %Cpu(s): 26.2 us, 1.8 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st KiB Mem : 32114260 total, 325308 free, 31091092 used, 697860 buff/cache KiB Swap: 0 total, 0 free, 0 used. 19246880 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 631104 root 20 0 59388 6852 5492 R 93.7 0.0 0:24.04 ovs-ofctl 631306 root 20 0 59388 6340 5108 R 88.4 0.0 0:18.60 ovs-ofctl 2745884 root 20 0 25.6g 20.2g 12496 S 26.9 66.0 677:24.98 ovnkube 2265292 root 20 0 5069184 89136 0 S 4.3 0.3 109:27.83 kubelet 2750018 998 10 -10 4089828 3.3g 13308 S 2.0 10.8 210:06.68 ovs-vswitchd 1 root 20 0 250276 13332 3516 S 1.7 0.0 36:07.81 systemd 2264938 dbus 20 0 85312 2060 828 S 0.7 0.0 1:06.59 dbus-daemon 2264963 root 20 0 2766268 87244 8548 S 0.7 0.3 16:32.19 crio 2749035 root 20 0 141988 20724 0 S 0.3 0.1 1:31.81 machine-config- 2749976 998 10 -10 61468 6304 3508 S 0.3 0.0 0:31.86 ovsdb-server ``` Overall ovnkube-node-cbpq7 pod is consuming around 25G of memory. ovs-daemon on the same node at a point consumes more than 7G and end up killed by OOM_Killer ``` ovs-node-dc7bl ovs-daemons 826m 7346Mi ``` Total number of pods on this node are: 14 ``` # oc get pods -A -o wide | grep ip-10-0-211-89.us-west-2.compute.internal mastervertical1 deploymentconfig1-1-fx6h7 1/1 Running 0 32h 10.131.32.6 ip-10-0-211-89.us-west-2.compute.internal <none> <none> mastervertical16 deploymentconfig0-1-gxl9l 1/1 Running 0 31h 10.131.32.8 ip-10-0-211-89.us-west-2.compute.internal <none> <none> mastervertical39 deploymentconfig2v0-1-8xxmd 1/1 Running 0 32h 10.131.32.5 ip-10-0-211-89.us-west-2.compute.internal <none> <none> mastervertical63 deploymentconfig2v0-1-lks6v 1/1 Running 0 32h 10.131.32.7 ip-10-0-211-89.us-west-2.compute.internal <none> <none> mastervertical76 deploymentconfig0-1-2bgcw 0/1 ContainerCreating 0 31h <none> ip-10-0-211-89.us-west-2.compute.internal <none> <none> mastervertical81 deploymentconfig2v0-1-m92v5 1/1 Running 0 32h 10.131.32.4 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-cluster-node-tuning-operator tuned-x9s2x 1/1 Running 6 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-dns dns-default-wn2kl 3/3 Running 17 2d13h 10.131.32.3 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-image-registry node-ca-b9rzq 1/1 Running 6 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-machine-config-operator machine-config-daemon-prnk4 2/2 Running 12 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-monitoring node-exporter-cl2fw 2/2 Running 12 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-multus multus-sv6jw 1/1 Running 4 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-ovn-kubernetes ovnkube-node-cbpq7 2/2 Running 18 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> openshift-ovn-kubernetes ovs-node-dc7bl 1/1 Running 7 2d13h 10.0.211.89 ip-10-0-211-89.us-west-2.compute.internal <none> <none> ``` Total number of flows in br-int : 1774871 ``` oc exec -n openshift-ovn-kubernetes ovnkube-node-cbpq7 -c ovn-controller -- ovs-ofctl dump-aggregate br-int -O OpenFlow15 OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=64349590 byte_count=126004125459 flow_count=1774871 ``` Quick look at the logs shows huge number of failure while adding the pod logical port because it was not able to dump the br-int aggregate flows. It receives around 2351 CNI requests for ADD and DELETE pod interface. ``` # egrep 'CNI request &{ADD | CNI request &{DEL' ovnkube-node-cbpq7-ovnkube-node.log | wc -l 2351 ``` Mostly because of the failure to dump the br-int aggregated flows. ovnkube-node waits for 60 seconds for ovn-ofctl dump-aggretate to dump the flows, which seems pretty reasonable time, but seems like ovs daemon is busy for even longer time to not respond to the request. These observations from logs are mostly the symptoms but root cause might be different. We need to further do memory profile of the ovnkube-node process to see who is retaining so much of memory. Logs (all ovnkube-master logs, ovnkube-ndoe logs, the memory growth logs for ovnkube-node, ovn-controller and ovs-daemons, cluster status etc) are attached. Version-Release number of selected component (if applicable): ocp4.5rc7 How reproducible: Always Steps to Reproduce: 1. Install cluster using 4.5.0-rc.7 2. Deploy a 100 nodes cluster 3. Run Cluster-density (Mastervertical) with at least 1000 namespaces Actual results: Pod creation failures, huge number of pod evictions and OOM for other pods running on the same nodes. Expected results: Pod creation should be successful and pod should not be evicted becaues the CNI components are causing memory bloating. Additional info: This bug is related to the following BZ https://bugzilla.redhat.com/show_bug.cgi?id=1855408 --- Additional comment from Dan Williams on 2020-07-23 17:03:03 UTC --- Another data point... OCP 4.5.3 (ovn-2.13-2.13.0-29), two different nodes, idle cluster, 215 total pods: OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=89326536 byte_count=142356692394 flow_count=4765 OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=152685097 byte_count=86667393290 flow_count=4987 That's not a lot of flows, and certainly much less than 1.7 million... --- Additional comment from Anil Vishnoi on 2020-07-30 21:28:15 UTC --- I enabled the pprof profiling for ovnkube-node and took a snapshot of one of the pod that was consuming around 18G of memory. ovnkube-node-7gnhr ovnkube-node 1534m 18113Mi Looking at the heap profile snapshot, it seems like the active heap memory (inuse_space) used by ovnkube-node process is just 467.48M (pprof) top Showing nodes accounting for 421.77MB, 90.22% of 467.48MB total Dropped 111 nodes (cum <= 2.34MB) Showing top 10 nodes out of 71 flat flat% sum% cum cum% 102.20MB 21.86% 21.86% 102.20MB 21.86% k8s.io/api/core/v1.(*EnvVar).Unmarshal 92.11MB 19.70% 41.56% 92.11MB 19.70% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).Unmarshal 73.55MB 15.73% 57.30% 195.75MB 41.87% k8s.io/api/core/v1.(*Container).Unmarshal 70.69MB 15.12% 72.42% 282.94MB 60.53% k8s.io/api/core/v1.(*PodSpec).Unmarshal 27.50MB 5.88% 78.30% 127.11MB 27.19% k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal 17.50MB 3.74% 82.05% 17.50MB 3.74% k8s.io/api/core/v1.(*VolumeMount).Unmarshal 13.50MB 2.89% 84.94% 23.01MB 4.92% k8s.io/api/core/v1.(*PodStatus).Unmarshal 11.71MB 2.51% 87.44% 304.94MB 65.23% k8s.io/api/core/v1.(*PodList).Unmarshal 7MB 1.50% 88.94% 13MB 2.78% k8s.io/api/core/v1.(*VolumeSource).Unmarshal 6MB 1.28% 90.22% 98.11MB 20.99% k8s.io/apimachinery/pkg/apis/meta/v1.(*ManagedFieldsEntry).Unmarshal But if we look at the alloc_space (heap memory allocation from the start of the process till the point of the snapshot), it seems like ovnkube-node is allocating huge chunk of heap memory for the scale test that runs for just 2 hours (pprof) top Showing nodes accounting for 297.95GB, 97.30% of 306.22GB total Dropped 741 nodes (cum <= 1.53GB) Showing top 10 nodes out of 66 flat flat% sum% cum cum% 110.89GB 36.21% 36.21% 110.89GB 36.21% bytes.makeSlice 41.52GB 13.56% 49.77% 41.52GB 13.56% strconv.appendEscapedRune 34.50GB 11.27% 61.04% 175.92GB 57.45% github.com/ovn-org/ovn-kubernetes/go-controller/pkg/cni.ofctlExec 33.25GB 10.86% 71.90% 33.25GB 10.86% fmt.(*buffer).write 33.25GB 10.86% 82.75% 141.20GB 46.11% fmt.Errorf 33.18GB 10.84% 93.59% 74.70GB 24.39% strconv.appendQuotedWith Given that inuse_space heap memory is only 460M, looks like GC is cleaning up the garbage objects to free up the heap memory. So as per this heap profile, it doesn't seem like it's memory leak in ovnkube-node as such. It seems like at the time of scale test, ovnkube-node is allocating huge chunk of memory to handle the incoming request, but even after GC cleans up the garbage from heap and shirnk the actual heap, GO runtime is not releasing RSS memory back to the OS. GO runtime returns the memory back to the OS if the memory activity is idle for 8 Minutes or so, but given that we are running the scale test and because of the memory bloating pods keep evitcting from the nodes and k8s keep rescheduling the pods. That end up create the add/delete pod CNI request to the ovnkube-node and the memory keep allocating/releasing. Given that the memory activity is not idle, GO runtime holds the memory and never returns it back to the OS. There are two possible fixes for this issue (1) Explictly call runtime.FreeOSMemory, that will trigger the GC to clean up the garbage and will attempt to return released memory back to OS. (2) Trace the part of the ovnkube-node code that is allocating this huge chunk of memory and try to optimize it to reduce the heap memory requirement. Looking at the above top output from the alloc_space, it seems like following 110.89GB 36.21% 36.21% 110.89GB 36.21% bytes.makeSlice 34.50GB 11.27% 61.04% 175.92GB 57.45% github.com/ovn-org/ovn-kubernetes/go-controller/pkg/cni.ofctlExec 33.25GB 10.86% 82.75% 141.20GB 46.11% fmt.Errorf are the top heap memory requesters. So fixing these will avoid the huge memory allocation at the first place and we won't get to this stage. In my opinion, Option 1 is probably quick fix, but it's a workaround and probably degrade ovnkube-node's performance. (2) is the right fix to solve this issue. Currently looking into the logs collected from the gc trace and also looking at the code path to optimize the memory allocation. --- Additional comment from Anil Vishnoi on 2020-07-30 22:05:54 UTC --- --- Additional comment from Anil Vishnoi on 2020-07-30 22:07:22 UTC --- --- Additional comment from Anil Vishnoi on 2020-07-30 22:08:15 UTC ---
Verified on 4.5.15 candidate (4.5.0-0.nightly-2020-10-10-030038). 100 node cluster with 2500 running pods. ovnkube memory increased from 160MB for idle cluster to ~270MB for loaded cluster (max pods/node set to 250).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4228