Created attachment 1842645 [details] Logs and ovn-appctl out Description of problem: On a 4.10 nightly baremetal cluster(500 node), ovnkube-node pod is consuming a reasonable memory while running a cluster density workload(30 pods per node) but it is not returning the memory to pool after deleting the test pods. ovn-controller container seems to be holding up it up forever until we restart them manually. Version-Release number of selected component (if applicable): OCP - 4.10.0-0.nightly-2021-10-21-105053 [kni@e16-h12-b02-fc640 ~]$ oc rsh -c ovn-controller ovnkube-node-v4prw sh-4.4# rpm -qa | grep ovn ovn21.09-central-21.09.0-25.el8fdp.x86_64 ovn21.09-vtep-21.09.0-25.el8fdp.x86_64 ovn21.09-21.09.0-25.el8fdp.x86_64 ovn21.09-host-21.09.0-25.el8fdp.x86_64 How reproducible: Often Reproducible on baremetal cluster. Steps to Reproduce: 1. Deploy a healthy cluster 2. Run a pod creation workload(30 pods per node) and watch the memory grows during workload 3. After deleting them, ovnkube-node is not releasing the memory back to the pool. Actual results: ovnkube-node does not release the memory forever until you restart them Expected results: Expecting it to release gradually as it used to do before Additional info:
[root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 46699 total : 32557 cache-conj-id : 0 cache-expr : 23326 cache-matches : 9231 trim count : 2 Mem usage (KB) : 93930 [root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/flush -> CACHE FLUSHED [root@worker417-r640 ~]# [root@worker417-r640 ~]# [root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 16546 total : 16546 cache-conj-id : 0 cache-expr : 12052 cache-matches : 4494 trim count : 3 Mem usage (KB) : 46946 Maybe the cache didn't fall below the watermark enough and thus it didn't trigger the automatic trim?
There's at least one problem with the way ovn-controller trims memory when scaling down. That's due to the fact that one load balancer VIP generates 3 openflows per backend but only one logical flow. ovn-controller is configured by default to trim memory when the lflow-cache goes down under 50% of the previous high water mark. With load balancer flows that means we will stop a bit too early from trimming memory. We can actually see it in the logs that automatic trimming stops happening and the ratio between lflow cache entries and high watermark is approximately 65%. We can fix this by making ovn-controller perform an unconditional trim, just once, a fixed number of seconds after the lflow cache was updated last. This would allow the system to reclaim all possible memory when ovn-controller becomes idle. I sent a patch for that upstream: http://patchwork.ozlabs.org/project/ovn/list/?series=273500&state=* Nevertheless, I'd like to make sure we're not hitting other issues too. Murali, would it be possible to run another test as follows? 1. Use the same ovn-kubernetes image as when the bug was reported: quay.io/itssurya/dev-images:scale-fixes-PR-839-second-deadlock 2. Make sure all ovnkube-node and ovnkube-master pods have been restarted and are using the new image. 3. Before running the test workload, choose one node, find its ovnkube-node pod and delete it, e.g.: oc delete pod ovnkube-node-xxx # This will recreate a pod, ovnkube-node-yyy, but we know for sure # ovn-controller started "clean" there. 4. Raise the memory trimming percentage: oc exec ovnkube-node-yyy -c ovn-controller -- ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=70 5. Run the test workload. 6. Cleanup test resources and wait a bit (30 seconds should be enough) then check memory usage of ovnkube-node-yyy. Thanks, Dumitru
Dumitru, I followed the steps(but using your image - quay.io/dceara0/dev-images:PR839-1118-01), $ oc get pods -o wide | grep 139- ovnkube-node-78lsf 4/4 Running 2 (5d3h ago) 5d3h 192.168.216.152 worker139-fc640 <none> <none> $ oc delete pod ovnkube-node-78lsf pod "ovnkube-node-78lsf" deleted $ oc get pods -o wide | grep 139- ovnkube-node-rdl5d 4/4 Running 2 (40s ago) 44s 192.168.216.152 worker139-fc640 <none> <none> $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=70 Memory stats - After restart ---------------------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 16114 total : 16113 cache-conj-id : 0 cache-expr : 11191 cache-matches : 4922 trim count : 0 Mem usage (KB) : 47243 During Workload --------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 77804 total : 77801 cache-conj-id : 0 cache-expr : 26795 cache-matches : 51006 trim count : 0 Mem usage (KB) : 224347 After cleanup ------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 18680 total : 16113 cache-conj-id : 0 cache-expr : 11191 cache-matches : 4922 trim count : 4 Mem usage (KB) : 47243 Still noticed the same problem, look at the grafana snapshot of ovnkube-node pod memory utilization - https://snapshot.raintank.io/dashboard/snapshot/p8Vm5vRdEtrjZg4SGLipDNSlK3XcS8eu?viewPanel=142&orgId=2
Hi Murali, Thanks for the test! Looking at the lflow cache stats "after cleanup" I see: high-watermark : 18680 total : 16113 This means we're still above the 70% watermark percentage configured for auto cache trimming. I connected to the setup and forced an additional memory trim by increasing the watermark percentage: $ ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=90 This immediately triggered a trim in ovn-controller and memory usage went down from 2.3g RSS to ~1.0g RSS. With the patch I sent for review (http://patchwork.ozlabs.org/project/ovn/list/?series=273500&state=*) this would happen automatically every time ovn-controller detects there's no logical flows being added/removed for at least 30 seconds. So, when that patch (or something similar) is accepted we shouldn't be seeing this problem anymore. Moving to POST. Regards, Dumitru
*** Bug 1988565 has been marked as a duplicate of this bug. ***
@dceara iiuc this should require no CMS configuration to work right? I noticed in your comment you did "$ ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=90". But then you go onto say that is automatic with your patch. So I'm thinking the only potential configuration here for ovn-k is the timer (in case we want something more/less often than 30 sec). Is that right?
(In reply to Tim Rozet from comment #7) > @dceara iiuc this should require no CMS configuration to work > right? I noticed in your comment you did "$ ovs-vsctl set open . > external_ids:ovn-trim-wmark-perc-lflow-cache=90". But then you go onto say > that is automatic with your patch. So I'm thinking the only potential > configuration here for ovn-k is the timer (in case we want something > more/less often than 30 sec). Is that right? Correct, ovn-k shouldn't need to do more than tweaking the timer at this point. However, a smaller value might be detrimental, memory trimming cab be a costly operation.
thanks @
thanks @dceara. Is this easily backportable to earlier versions of OVN? Thinking of backporting it in OCP, which would need 21.09 and 20.12.
(In reply to Tim Rozet from comment #10) > thanks @dceara. Is this easily backportable to earlier versions of OVN? > Thinking of backporting it in OCP, which would need 21.09 and 20.12. Replying just from the perspective of feasibility: - 21.09: should be straightforward - 20.12: we would need to first port the patches added for bug 1967882 However, I think we need a wider audience discussion to see if we should backport these features downstream-only instead of bumping OCP to a newer (and better) OVN version (cc @mmichels).
tested with following script: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.184.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.184.25 external_ids:ovn-enable-lflow-cache=true external_ids:ovn-trim-wmark-perc-lflow-cache=10 systemctl restart ovn-controller ovn-nbctl set NB_GLOBAL . options:northd_probe_interval=180000 ovn-nbctl set connection . inactivity_probe=180000 ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180 ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000 ovn-sbctl set connection . inactivity_probe=180000 ovn-nbctl ls-add public ovn-nbctl lsp-add public ln_p1 ovn-nbctl lsp-set-addresses ln_p1 unknown ovn-nbctl lsp-set-type ln_p1 localnet ovn-nbctl lsp-set-options ln_p1 network_name=nattest controller_pid=$(cat /var/run/ovn/ovn-controller.pid ) grep RSS /proc/$controller_pid/status > test_stat i=1 for m in `seq 0 9`;do for n in `seq 1 99`;do ovn-nbctl lr-add r${i} ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16 ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24 ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2 ovn-nbctl lrp-set-gateway-chassis r${i}_public hv1 # s1 ovn-nbctl ls-add s${i} # s1 - r1 ovn-nbctl lsp-add s${i} s${i}_r${i} ovn-nbctl lsp-set-type s${i}_r${i} router ovn-nbctl lsp-set-addresses s${i}_r${i} router ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i} # s1 - vm1 ovn-nbctl lsp-add s$i vm$i ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2" ovs-vsctl add-port br-int vm$i -- set interface vm$i type=internal external_ids:iface-id=vm$i ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16 ovn-nbctl lsp-add public public_r${i} ovn-nbctl lsp-set-type public_r${i} router ovn-nbctl lsp-set-addresses public_r${i} router ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public let i++ if [ $i -gt 300 ];then break; fi done if [ $i -gt 300 ];then break; fi done #add host vm1 ip netns add vm1 ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal ip link set vm1 netns vm1 ip netns exec vm1 ip link set vm1 address 00:de:ad:01:00:01 ip netns exec vm1 ip addr add 173.0.1.2/24 dev vm1 ip netns exec vm1 ip link set vm1 up ovs-vsctl set Interface vm1 external_ids:iface-id=vm1 ip netns add vm2 ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal ip link set vm2 netns vm2 ip netns exec vm2 ip link set vm2 address 00:de:ad:01:00:02 ip netns exec vm2 ip addr add 173.0.2.2/24 dev vm2 ip netns exec vm2 ip link set vm2 up ovs-vsctl set Interface vm2 external_ids:iface-id=vm2 #set provide network ovs-vsctl add-br nat_test ip link set nat_test up ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=nattest:nat_test ip netns add vm0 ovs-vsctl add-port nat_test vm0 -- set interface vm0 type=internal ip link set vm0 netns vm0 ip netns exec vm0 ip link set vm0 address 00:00:00:00:00:01 ip netns exec vm0 ip addr add 172.16.0.100/16 dev vm0 ip netns exec vm0 ip link set vm0 up ovs-vsctl set Interface vm0 external_ids:iface-id=vm0 ip netns exec vm1 ip route add default via 173.0.1.1 ip netns exec vm2 ip route add default via 173.0.2.1 ovn-nbctl --wait=hv sync sleep 30 ip netns exec vm1 ping 172.16.0.102 -c 1 ip netns exec vm1 ping 172.16.0.100 -c 1 echo "after add all ls" >> test_stat grep RSS /proc/$controller_pid/status >> test_stat ovn-appctl -t ovn-controller lflow-cache/show-stats >> test_stat i=100 for m in `seq 0 9`;do for n in `seq 1 99`;do ovn-nbctl lr-del r${i} ovs-vsctl del-port vm$i ovn-nbctl ls-del s${i} let i++ if [ $i -gt 300 ];then break; fi done if [ $i -gt 300 ];then break; fi done ovn-nbctl --wait=hv sync sleep 60 ip netns exec vm1 ping 172.16.0.102 -c 1 ip netns exec vm1 ping 172.16.0.100 -c 1 echo "after del ls" >> test_stat grep RSS /proc/$controller_pid/status >> test_stat result on ovn-2021-21.09.1-24: VmRSS: 4628 kB after add all ls VmRSS: 986720 kB Enabled: true high-watermark : 201103 total : 201103 cache-conj-id : 0 cache-expr : 195607 cache-matches : 5496 trim count : 0 Mem usage (KB) : 247754 after del ls VmRSS: 986872 kB <=== memory doesn't decrease Enabled: true high-watermark : 201103 total : 27037 cache-conj-id : 0 cache-expr : 25159 cache-matches : 1878 trim count : 0 <== trim count is 0 Mem usage (KB) : 45089 result on ovn-2021-21.12.0-11: VmRSS: 4676 kB after add all ls VmRSS: 1009264 kB Enabled: true high-watermark : 202005 total : 202005 cache-expr : 196507 cache-matches : 5498 trim count : 1 Mem usage (KB) : 229471 after del ls VmRSS: 481368 kB <=== memory decreased Enabled: true high-watermark : 27336 total : 27336 cache-expr : 25456 cache-matches : 1880 trim count : 2 <=== trim count is 2 Mem usage (KB) : 38447 Dumitru, does the result show that the feature take effect?
(In reply to Jianlin Shi from comment #15) > > result on ovn-2021-21.09.1-24: > > VmRSS: 4628 kB > after add all ls > VmRSS: 986720 kB > Enabled: true > high-watermark : 201103 > total : 201103 > cache-conj-id : 0 > cache-expr : 195607 > cache-matches : 5496 > trim count : 0 > Mem usage (KB) : 247754 > after del ls > VmRSS: 986872 kB > > <=== memory doesn't decrease > > Enabled: true > high-watermark : 201103 > total : 27037 > cache-conj-id : 0 > cache-expr : 25159 > cache-matches : 1878 > trim count : 0 > > <== trim count is 0 > > Mem usage (KB) : 45089 > > > result on ovn-2021-21.12.0-11: > > VmRSS: 4676 kB > after add all ls > VmRSS: 1009264 kB > Enabled: true > high-watermark : 202005 > total : 202005 > cache-expr : 196507 > cache-matches : 5498 > trim count : 1 > Mem usage (KB) : 229471 > after del ls > VmRSS: 481368 kB > > <=== memory decreased > > Enabled: true > high-watermark : 27336 > total : 27336 > cache-expr : 25456 > cache-matches : 1880 > trim count : 2 > > <=== trim count is 2 > > Mem usage (KB) : 38447 > > > Dumitru, does the result show that the feature take effect? Looks good to me, thanks!
set VERIFIED per comment 15 and comment 16
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0674
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days