Bug 1859883 - ovnkube-node memory growing exponentially, possibly causing OOM for other pods
Summary: ovnkube-node memory growing exponentially, possibly causing OOM for other pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Anil Vishnoi
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-46
Depends On:
Blocks: 1862700
TreeView+ depends on / blocked
 
Reported: 2020-07-23 09:17 UTC by Anil Vishnoi
Modified: 2020-10-27 16:17 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1862700 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:16:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovn logs collected during the scale test. Must-gather failed to collect the logs. (4.02 MB, application/gzip)
2020-07-23 09:17 UTC, Anil Vishnoi
no flags Details
pods-heap-profile (19.12 KB, application/gzip)
2020-07-30 22:05 UTC, Anil Vishnoi
no flags Details
pods-heap-profile-2 (74.80 KB, application/gzip)
2020-07-30 22:07 UTC, Anil Vishnoi
no flags Details
pods-cpu-profile-at-high-memory (2.68 KB, application/gzip)
2020-07-30 22:08 UTC, Anil Vishnoi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 226 0 None closed Bug 1859883: Fix ovnkube-node aggressive memory allocation to prevent … 2020-12-15 23:31:22 UTC
Red Hat Bugzilla 1855408 0 urgent CLOSED OVN cluster unstable after running minimal scale test 2021-02-24 15:15:16 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:17:06 UTC

Description Anil Vishnoi 2020-07-23 09:17:11 UTC
Created attachment 1702192 [details]
ovn logs collected during the scale test. Must-gather failed to collect the logs.

Description of problem:
During a scale test ovnkube-node processes memory is growing exponentially. Scale tests is trying to create 1000 projects with 16 pods and 3 services per project. During the scale tests, multiple ovnkube-node running on various nodes consumes memory in the range of 6G to 20G. Following is one of the example:

```
ovnkube-node-cbpq7     ovn-controller   0m           5145Mi
ovnkube-node-cbpq7     ovnkube-node     1176m        19239Mi.  <<<20G
```

top output

```
top - 08:40:28 up 2 days, 13:24,  0 users,  load average: 1.76, 1.76, 1.75
Tasks: 243 total,   3 running, 151 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.2 us,  1.8 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
KiB Mem : 32114260 total,   325308 free, 31091092 used,   697860 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 19246880 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 631104 root      20   0   59388   6852   5492 R  93.7  0.0   0:24.04 ovs-ofctl
 631306 root      20   0   59388   6340   5108 R  88.4  0.0   0:18.60 ovs-ofctl
2745884 root      20   0   25.6g  20.2g  12496 S  26.9 66.0 677:24.98 ovnkube
2265292 root      20   0 5069184  89136      0 S   4.3  0.3 109:27.83 kubelet
2750018 998       10 -10 4089828   3.3g  13308 S   2.0 10.8 210:06.68 ovs-vswitchd
      1 root      20   0  250276  13332   3516 S   1.7  0.0  36:07.81 systemd
2264938 dbus      20   0   85312   2060    828 S   0.7  0.0   1:06.59 dbus-daemon
2264963 root      20   0 2766268  87244   8548 S   0.7  0.3  16:32.19 crio
2749035 root      20   0  141988  20724      0 S   0.3  0.1   1:31.81 machine-config-
2749976 998       10 -10   61468   6304   3508 S   0.3  0.0   0:31.86 ovsdb-server
```

Overall ovnkube-node-cbpq7 pod is consuming around 25G of memory. ovs-daemon on the same node at a point consumes more than 7G and end up killed by OOM_Killer

```
ovs-node-dc7bl         ovs-daemons      826m         7346Mi
```

Total number of pods on this node are: 14

```
# oc get pods -A -o wide | grep ip-10-0-211-89.us-west-2.compute.internal
mastervertical1                                    deploymentconfig1-1-fx6h7                                             1/1     Running             0          32h     10.131.32.6     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
mastervertical16                                   deploymentconfig0-1-gxl9l                                             1/1     Running             0          31h     10.131.32.8     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
mastervertical39                                   deploymentconfig2v0-1-8xxmd                                           1/1     Running             0          32h     10.131.32.5     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
mastervertical63                                   deploymentconfig2v0-1-lks6v                                           1/1     Running             0          32h     10.131.32.7     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
mastervertical76                                   deploymentconfig0-1-2bgcw                                             0/1     ContainerCreating   0          31h     <none>          ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
mastervertical81                                   deploymentconfig2v0-1-m92v5                                           1/1     Running             0          32h     10.131.32.4     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-cluster-node-tuning-operator             tuned-x9s2x                                                           1/1     Running             6          2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-dns                                      dns-default-wn2kl                                                     3/3     Running             17         2d13h   10.131.32.3     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-image-registry                           node-ca-b9rzq                                                         1/1     Running             6          2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-prnk4                                           2/2     Running             12         2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-monitoring                               node-exporter-cl2fw                                                   2/2     Running             12         2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-multus                                   multus-sv6jw                                                          1/1     Running             4          2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-cbpq7                                                    2/2     Running             18         2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
openshift-ovn-kubernetes                           ovs-node-dc7bl                                                        1/1     Running             7          2d13h   10.0.211.89     ip-10-0-211-89.us-west-2.compute.internal    <none>           <none>
```

Total number of flows in br-int : 1774871

```
oc exec -n openshift-ovn-kubernetes ovnkube-node-cbpq7 -c ovn-controller -- ovs-ofctl dump-aggregate br-int -O OpenFlow15
OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=64349590 byte_count=126004125459 flow_count=1774871
```

Quick look at the logs shows huge number of failure while adding the pod logical port because it was not able to dump the br-int aggregate flows. It receives around 2351 CNI requests for ADD and DELETE pod interface.

```
# egrep 'CNI request &{ADD | CNI request &{DEL' ovnkube-node-cbpq7-ovnkube-node.log | wc -l
2351
```

Mostly because of the failure to dump the br-int aggregated flows. ovnkube-node waits for 60 seconds for ovn-ofctl dump-aggretate to dump the flows, which seems pretty reasonable time, but seems like ovs daemon is busy for even longer time to not respond to the request. These observations from logs are mostly the symptoms but root cause might be different. We need to further do memory profile of the ovnkube-node process to see who is retaining so much of memory.

Logs (all ovnkube-master logs, ovnkube-ndoe logs, the memory growth logs for ovnkube-node, ovn-controller and ovs-daemons, cluster status etc) are attached.

Version-Release number of selected component (if applicable):

ocp4.5rc7 

How reproducible:
Always

Steps to Reproduce:
1. Install cluster using 4.5.0-rc.7
2. Deploy a 100 nodes cluster
3. Run Cluster-density (Mastervertical) with at least 1000 namespaces

Actual results:
Pod creation failures, huge number of pod evictions and OOM for other pods running on the same nodes.

Expected results:
Pod creation should be successful and pod should not be evicted becaues the CNI components are causing memory bloating.

Additional info:

This bug is related to the following BZ

https://bugzilla.redhat.com/show_bug.cgi?id=1855408

Comment 1 Dan Williams 2020-07-23 17:03:03 UTC
Another data point... OCP 4.5.3 (ovn-2.13-2.13.0-29), two different nodes, idle cluster, 215 total pods:

OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=89326536 byte_count=142356692394 flow_count=4765
OFPST_AGGREGATE reply (OF1.5) (xid=0x2): packet_count=152685097 byte_count=86667393290 flow_count=4987

That's not a lot of flows, and certainly much less than 1.7 million...

Comment 2 Anil Vishnoi 2020-07-30 21:28:15 UTC
I enabled the pprof profiling for ovnkube-node and took a snapshot of one of the pod that was consuming around 18G of memory.

ovnkube-node-7gnhr     ovnkube-node     1534m        18113Mi

Looking at the heap profile snapshot, it seems like the active heap memory (inuse_space) used by ovnkube-node process is just 467.48M

(pprof) top
Showing nodes accounting for 421.77MB, 90.22% of 467.48MB total
Dropped 111 nodes (cum <= 2.34MB)
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
  102.20MB 21.86% 21.86%   102.20MB 21.86%  k8s.io/api/core/v1.(*EnvVar).Unmarshal
   92.11MB 19.70% 41.56%    92.11MB 19.70%  k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).Unmarshal
   73.55MB 15.73% 57.30%   195.75MB 41.87%  k8s.io/api/core/v1.(*Container).Unmarshal
   70.69MB 15.12% 72.42%   282.94MB 60.53%  k8s.io/api/core/v1.(*PodSpec).Unmarshal
   27.50MB  5.88% 78.30%   127.11MB 27.19%  k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
   17.50MB  3.74% 82.05%    17.50MB  3.74%  k8s.io/api/core/v1.(*VolumeMount).Unmarshal
   13.50MB  2.89% 84.94%    23.01MB  4.92%  k8s.io/api/core/v1.(*PodStatus).Unmarshal
   11.71MB  2.51% 87.44%   304.94MB 65.23%  k8s.io/api/core/v1.(*PodList).Unmarshal
       7MB  1.50% 88.94%       13MB  2.78%  k8s.io/api/core/v1.(*VolumeSource).Unmarshal
       6MB  1.28% 90.22%    98.11MB 20.99%  k8s.io/apimachinery/pkg/apis/meta/v1.(*ManagedFieldsEntry).Unmarshal 

But if we look at the alloc_space (heap memory allocation from the start of the process till the point of the snapshot), it seems like ovnkube-node is allocating huge chunk of heap memory for the scale test that runs for just 2 hours

(pprof) top
Showing nodes accounting for 297.95GB, 97.30% of 306.22GB total
Dropped 741 nodes (cum <= 1.53GB)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  110.89GB 36.21% 36.21%   110.89GB 36.21%  bytes.makeSlice
   41.52GB 13.56% 49.77%    41.52GB 13.56%  strconv.appendEscapedRune
   34.50GB 11.27% 61.04%   175.92GB 57.45%  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/cni.ofctlExec
   33.25GB 10.86% 71.90%    33.25GB 10.86%  fmt.(*buffer).write
   33.25GB 10.86% 82.75%   141.20GB 46.11%  fmt.Errorf
   33.18GB 10.84% 93.59%    74.70GB 24.39%  strconv.appendQuotedWith

Given that inuse_space heap memory is only 460M, looks like GC is cleaning up the garbage objects to free up the heap memory. So as per this heap profile, it doesn't seem like it's memory leak in ovnkube-node as such.

It seems like at the time of scale test, ovnkube-node is allocating huge chunk of memory to handle the incoming request, but even after GC cleans up the garbage from heap and shirnk the actual heap, GO runtime is not releasing RSS memory back to the OS. GO runtime returns the memory back to the OS if the memory activity is idle for 8 Minutes or so, but given that we are running the scale test and because of the memory bloating pods keep evitcting from the nodes and k8s keep rescheduling the pods. That end up create the add/delete pod CNI request to the ovnkube-node and the memory keep allocating/releasing. Given that the memory activity is not idle, GO runtime holds the memory and never returns it back to the OS.  

There are two possible fixes for this issue
(1) Explictly call runtime.FreeOSMemory, that will trigger the GC to clean up the garbage and will attempt to return released memory back to OS.
(2) Trace the part of the ovnkube-node code that is allocating this huge chunk of memory and try to optimize it to reduce the heap memory requirement. Looking at the above top output from the alloc_space, it seems like following

110.89GB 36.21% 36.21%   110.89GB 36.21%  bytes.makeSlice
34.50GB 11.27% 61.04%   175.92GB 57.45%  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/cni.ofctlExec
33.25GB 10.86% 82.75%   141.20GB 46.11%  fmt.Errorf

are the top heap memory requesters. So fixing these will avoid the huge memory allocation at the first place and we won't get to this stage.

In my opinion, Option 1 is probably quick fix, but it's a workaround and probably degrade ovnkube-node's performance. (2) is the right fix to solve this issue. 
Currently looking into the logs collected from the gc trace and also looking at the code path to optimize the memory allocation.

Comment 3 Anil Vishnoi 2020-07-30 22:05:54 UTC
Created attachment 1703006 [details]
pods-heap-profile

Comment 4 Anil Vishnoi 2020-07-30 22:07:22 UTC
Created attachment 1703007 [details]
pods-heap-profile-2

Comment 5 Anil Vishnoi 2020-07-30 22:08:15 UTC
Created attachment 1703008 [details]
pods-cpu-profile-at-high-memory

Comment 8 Anurag saxena 2020-09-02 14:55:03 UTC
Mike, Please re-assign to right QA contact if needed

Comment 11 Mike Fiedler 2020-09-29 18:15:12 UTC
Verified on 4.6.0-0.nightly-2020-09-28-061045

- 100 node OVN cluster on AWS.
- before running workload, ovn-node using ~256MB on each node
- created 1000 projects with 4000 running pods/services
- ovn-node memory usage increased to 512MB/node and stabilized at that level.   No additional growth.

Comment 12 Anurag saxena 2020-09-29 20:26:29 UTC
Thank you, Mike Fiedler for verifying that.

Comment 14 errata-xmlrpc 2020-10-27 16:16:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.