Bug 2024768
| Summary: | ovn-controller is not returning memory back to pool after pod deletion | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Murali Krishnasamy <murali> | ||||
| Component: | OVN | Assignee: | Dumitru Ceara <dceara> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | FDP 21.K | CC: | ctrautma, dblack, dcbw, dceara, jiji, mmichels, smalleni, surya, trozet | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ovn21.12-21.12.0-24.el8fdp | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-02-24 17:47:39 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1958349 | ||||||
| Attachments: |
|
||||||
[root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 46699 total : 32557 cache-conj-id : 0 cache-expr : 23326 cache-matches : 9231 trim count : 2 Mem usage (KB) : 93930 [root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/flush -> CACHE FLUSHED [root@worker417-r640 ~]# [root@worker417-r640 ~]# [root@worker417-r640 ~]# ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 16546 total : 16546 cache-conj-id : 0 cache-expr : 12052 cache-matches : 4494 trim count : 3 Mem usage (KB) : 46946 Maybe the cache didn't fall below the watermark enough and thus it didn't trigger the automatic trim? There's at least one problem with the way ovn-controller trims memory when scaling down. That's due to the fact that one load balancer VIP generates 3 openflows per backend but only one logical flow. ovn-controller is configured by default to trim memory when the lflow-cache goes down under 50% of the previous high water mark. With load balancer flows that means we will stop a bit too early from trimming memory. We can actually see it in the logs that automatic trimming stops happening and the ratio between lflow cache entries and high watermark is approximately 65%. We can fix this by making ovn-controller perform an unconditional trim, just once, a fixed number of seconds after the lflow cache was updated last. This would allow the system to reclaim all possible memory when ovn-controller becomes idle. I sent a patch for that upstream: http://patchwork.ozlabs.org/project/ovn/list/?series=273500&state=* Nevertheless, I'd like to make sure we're not hitting other issues too. Murali, would it be possible to run another test as follows? 1. Use the same ovn-kubernetes image as when the bug was reported: quay.io/itssurya/dev-images:scale-fixes-PR-839-second-deadlock 2. Make sure all ovnkube-node and ovnkube-master pods have been restarted and are using the new image. 3. Before running the test workload, choose one node, find its ovnkube-node pod and delete it, e.g.: oc delete pod ovnkube-node-xxx # This will recreate a pod, ovnkube-node-yyy, but we know for sure # ovn-controller started "clean" there. 4. Raise the memory trimming percentage: oc exec ovnkube-node-yyy -c ovn-controller -- ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=70 5. Run the test workload. 6. Cleanup test resources and wait a bit (30 seconds should be enough) then check memory usage of ovnkube-node-yyy. Thanks, Dumitru Dumitru, I followed the steps(but using your image - quay.io/dceara0/dev-images:PR839-1118-01), $ oc get pods -o wide | grep 139- ovnkube-node-78lsf 4/4 Running 2 (5d3h ago) 5d3h 192.168.216.152 worker139-fc640 <none> <none> $ oc delete pod ovnkube-node-78lsf pod "ovnkube-node-78lsf" deleted $ oc get pods -o wide | grep 139- ovnkube-node-rdl5d 4/4 Running 2 (40s ago) 44s 192.168.216.152 worker139-fc640 <none> <none> $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=70 Memory stats - After restart ---------------------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 16114 total : 16113 cache-conj-id : 0 cache-expr : 11191 cache-matches : 4922 trim count : 0 Mem usage (KB) : 47243 During Workload --------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 77804 total : 77801 cache-conj-id : 0 cache-expr : 26795 cache-matches : 51006 trim count : 0 Mem usage (KB) : 224347 After cleanup ------------- $ oc exec ovnkube-node-rdl5d -c ovn-controller -- ovn-appctl -t ovn-controller lflow-cache/show-stats Enabled: true high-watermark : 18680 total : 16113 cache-conj-id : 0 cache-expr : 11191 cache-matches : 4922 trim count : 4 Mem usage (KB) : 47243 Still noticed the same problem, look at the grafana snapshot of ovnkube-node pod memory utilization - https://snapshot.raintank.io/dashboard/snapshot/p8Vm5vRdEtrjZg4SGLipDNSlK3XcS8eu?viewPanel=142&orgId=2 Hi Murali, Thanks for the test! Looking at the lflow cache stats "after cleanup" I see: high-watermark : 18680 total : 16113 This means we're still above the 70% watermark percentage configured for auto cache trimming. I connected to the setup and forced an additional memory trim by increasing the watermark percentage: $ ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=90 This immediately triggered a trim in ovn-controller and memory usage went down from 2.3g RSS to ~1.0g RSS. With the patch I sent for review (http://patchwork.ozlabs.org/project/ovn/list/?series=273500&state=*) this would happen automatically every time ovn-controller detects there's no logical flows being added/removed for at least 30 seconds. So, when that patch (or something similar) is accepted we shouldn't be seeing this problem anymore. Moving to POST. Regards, Dumitru *** Bug 1988565 has been marked as a duplicate of this bug. *** @dceara iiuc this should require no CMS configuration to work right? I noticed in your comment you did "$ ovs-vsctl set open . external_ids:ovn-trim-wmark-perc-lflow-cache=90". But then you go onto say that is automatic with your patch. So I'm thinking the only potential configuration here for ovn-k is the timer (in case we want something more/less often than 30 sec). Is that right? (In reply to Tim Rozet from comment #7) > @dceara iiuc this should require no CMS configuration to work > right? I noticed in your comment you did "$ ovs-vsctl set open . > external_ids:ovn-trim-wmark-perc-lflow-cache=90". But then you go onto say > that is automatic with your patch. So I'm thinking the only potential > configuration here for ovn-k is the timer (in case we want something > more/less often than 30 sec). Is that right? Correct, ovn-k shouldn't need to do more than tweaking the timer at this point. However, a smaller value might be detrimental, memory trimming cab be a costly operation. thanks @ thanks @dceara. Is this easily backportable to earlier versions of OVN? Thinking of backporting it in OCP, which would need 21.09 and 20.12. (In reply to Tim Rozet from comment #10) > thanks @dceara. Is this easily backportable to earlier versions of OVN? > Thinking of backporting it in OCP, which would need 21.09 and 20.12. Replying just from the perspective of feasibility: - 21.09: should be straightforward - 20.12: we would need to first port the patches added for bug 1967882 However, I think we need a wider audience discussion to see if we should backport these features downstream-only instead of bumping OCP to a newer (and better) OVN version (cc @mmichels). tested with following script:
systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.184.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.184.25 external_ids:ovn-enable-lflow-cache=true external_ids:ovn-trim-wmark-perc-lflow-cache=10
systemctl restart ovn-controller
ovn-nbctl set NB_GLOBAL . options:northd_probe_interval=180000
ovn-nbctl set connection . inactivity_probe=180000
ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180
ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000
ovn-sbctl set connection . inactivity_probe=180000
ovn-nbctl ls-add public
ovn-nbctl lsp-add public ln_p1
ovn-nbctl lsp-set-addresses ln_p1 unknown
ovn-nbctl lsp-set-type ln_p1 localnet
ovn-nbctl lsp-set-options ln_p1 network_name=nattest
controller_pid=$(cat /var/run/ovn/ovn-controller.pid )
grep RSS /proc/$controller_pid/status > test_stat
i=1
for m in `seq 0 9`;do
for n in `seq 1 99`;do
ovn-nbctl lr-add r${i}
ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16
ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24
ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2
ovn-nbctl lrp-set-gateway-chassis r${i}_public hv1
# s1
ovn-nbctl ls-add s${i}
# s1 - r1
ovn-nbctl lsp-add s${i} s${i}_r${i}
ovn-nbctl lsp-set-type s${i}_r${i} router
ovn-nbctl lsp-set-addresses s${i}_r${i} router
ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i}
# s1 - vm1
ovn-nbctl lsp-add s$i vm$i
ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2"
ovs-vsctl add-port br-int vm$i -- set interface vm$i type=internal external_ids:iface-id=vm$i
ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16
ovn-nbctl lsp-add public public_r${i}
ovn-nbctl lsp-set-type public_r${i} router
ovn-nbctl lsp-set-addresses public_r${i} router
ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public
let i++
if [ $i -gt 300 ];then
break;
fi
done
if [ $i -gt 300 ];then
break;
fi
done
#add host vm1
ip netns add vm1
ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 00:de:ad:01:00:01
ip netns exec vm1 ip addr add 173.0.1.2/24 dev vm1
ip netns exec vm1 ip link set vm1 up
ovs-vsctl set Interface vm1 external_ids:iface-id=vm1
ip netns add vm2
ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 00:de:ad:01:00:02
ip netns exec vm2 ip addr add 173.0.2.2/24 dev vm2
ip netns exec vm2 ip link set vm2 up
ovs-vsctl set Interface vm2 external_ids:iface-id=vm2
#set provide network
ovs-vsctl add-br nat_test
ip link set nat_test up
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=nattest:nat_test
ip netns add vm0
ovs-vsctl add-port nat_test vm0 -- set interface vm0 type=internal
ip link set vm0 netns vm0
ip netns exec vm0 ip link set vm0 address 00:00:00:00:00:01
ip netns exec vm0 ip addr add 172.16.0.100/16 dev vm0
ip netns exec vm0 ip link set vm0 up
ovs-vsctl set Interface vm0 external_ids:iface-id=vm0
ip netns exec vm1 ip route add default via 173.0.1.1
ip netns exec vm2 ip route add default via 173.0.2.1
ovn-nbctl --wait=hv sync
sleep 30
ip netns exec vm1 ping 172.16.0.102 -c 1
ip netns exec vm1 ping 172.16.0.100 -c 1
echo "after add all ls" >> test_stat
grep RSS /proc/$controller_pid/status >> test_stat
ovn-appctl -t ovn-controller lflow-cache/show-stats >> test_stat
i=100
for m in `seq 0 9`;do
for n in `seq 1 99`;do
ovn-nbctl lr-del r${i}
ovs-vsctl del-port vm$i
ovn-nbctl ls-del s${i}
let i++
if [ $i -gt 300 ];then
break;
fi
done
if [ $i -gt 300 ];then
break;
fi
done
ovn-nbctl --wait=hv sync
sleep 60
ip netns exec vm1 ping 172.16.0.102 -c 1
ip netns exec vm1 ping 172.16.0.100 -c 1
echo "after del ls" >> test_stat
grep RSS /proc/$controller_pid/status >> test_stat
result on ovn-2021-21.09.1-24:
VmRSS: 4628 kB
after add all ls
VmRSS: 986720 kB
Enabled: true
high-watermark : 201103
total : 201103
cache-conj-id : 0
cache-expr : 195607
cache-matches : 5496
trim count : 0
Mem usage (KB) : 247754
after del ls
VmRSS: 986872 kB
<=== memory doesn't decrease
Enabled: true
high-watermark : 201103
total : 27037
cache-conj-id : 0
cache-expr : 25159
cache-matches : 1878
trim count : 0
<== trim count is 0
Mem usage (KB) : 45089
result on ovn-2021-21.12.0-11:
VmRSS: 4676 kB
after add all ls
VmRSS: 1009264 kB
Enabled: true
high-watermark : 202005
total : 202005
cache-expr : 196507
cache-matches : 5498
trim count : 1
Mem usage (KB) : 229471
after del ls
VmRSS: 481368 kB
<=== memory decreased
Enabled: true
high-watermark : 27336
total : 27336
cache-expr : 25456
cache-matches : 1880
trim count : 2
<=== trim count is 2
Mem usage (KB) : 38447
Dumitru, does the result show that the feature take effect?
(In reply to Jianlin Shi from comment #15) > > result on ovn-2021-21.09.1-24: > > VmRSS: 4628 kB > after add all ls > VmRSS: 986720 kB > Enabled: true > high-watermark : 201103 > total : 201103 > cache-conj-id : 0 > cache-expr : 195607 > cache-matches : 5496 > trim count : 0 > Mem usage (KB) : 247754 > after del ls > VmRSS: 986872 kB > > <=== memory doesn't decrease > > Enabled: true > high-watermark : 201103 > total : 27037 > cache-conj-id : 0 > cache-expr : 25159 > cache-matches : 1878 > trim count : 0 > > <== trim count is 0 > > Mem usage (KB) : 45089 > > > result on ovn-2021-21.12.0-11: > > VmRSS: 4676 kB > after add all ls > VmRSS: 1009264 kB > Enabled: true > high-watermark : 202005 > total : 202005 > cache-expr : 196507 > cache-matches : 5498 > trim count : 1 > Mem usage (KB) : 229471 > after del ls > VmRSS: 481368 kB > > <=== memory decreased > > Enabled: true > high-watermark : 27336 > total : 27336 > cache-expr : 25456 > cache-matches : 1880 > trim count : 2 > > <=== trim count is 2 > > Mem usage (KB) : 38447 > > > Dumitru, does the result show that the feature take effect? Looks good to me, thanks! set VERIFIED per comment 15 and comment 16 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0674 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Created attachment 1842645 [details] Logs and ovn-appctl out Description of problem: On a 4.10 nightly baremetal cluster(500 node), ovnkube-node pod is consuming a reasonable memory while running a cluster density workload(30 pods per node) but it is not returning the memory to pool after deleting the test pods. ovn-controller container seems to be holding up it up forever until we restart them manually. Version-Release number of selected component (if applicable): OCP - 4.10.0-0.nightly-2021-10-21-105053 [kni@e16-h12-b02-fc640 ~]$ oc rsh -c ovn-controller ovnkube-node-v4prw sh-4.4# rpm -qa | grep ovn ovn21.09-central-21.09.0-25.el8fdp.x86_64 ovn21.09-vtep-21.09.0-25.el8fdp.x86_64 ovn21.09-21.09.0-25.el8fdp.x86_64 ovn21.09-host-21.09.0-25.el8fdp.x86_64 How reproducible: Often Reproducible on baremetal cluster. Steps to Reproduce: 1. Deploy a healthy cluster 2. Run a pod creation workload(30 pods per node) and watch the memory grows during workload 3. After deleting them, ovnkube-node is not releasing the memory back to the pool. Actual results: ovnkube-node does not release the memory forever until you restart them Expected results: Expecting it to release gradually as it used to do before Additional info: