Bug 1859924
Summary: | possible memory leak in sb-db raft cluster | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Anil Vishnoi <avishnoi> | ||||
Component: | OVN | Assignee: | Numan Siddique <nusiddiq> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | RHEL 7.7 | CC: | achernet, anbhat, ctrautma, dblack, dcbw, i.maximets, jtaleric, mark.d.gray, msheth, numan.siddique, nusiddiq, rkhan, rsevilla, smalleni, trozet, yjoseph | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovn2.13-20.12.0-1 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-03-25 19:03:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1833373, 1876990, 1877002 | ||||||
Bug Blocks: | 1891002 | ||||||
Attachments: |
|
Description
Anil Vishnoi
2020-07-23 10:20:33 UTC
Anil, will this still be a problem after https://github.com/ovn-org/ovn-kubernetes/pull/1711 lands? Over to Numan as he's working on a patch to reduce the number of flows for each reject ACL. Hi Anil, Is it possible to attach the OVN north db file ? Hi Numan, I observed this issue in parallel to the issue reported in bug https://bugzilla.redhat.com/show_bug.cgi?id=1855408 And the logs are uploaded here : https://drive.google.com/file/d/18dIf6qNP3IQvQOlVAOVjH6ppZuF-jKoV/view?usp=sharing Submitted the patches for review - https://patchwork.ozlabs.org/project/ovn/list/?submitter=77669 which reduces the number of lflows in sb db. With the attached db in this bz and with OVN master, ovn-northd crashes on my laptop with 16gb memory. With OVN master + the above patches, ovn-northd didn't crash. These patches would certainly help in ovn-northd memory usage. Without this patch, the number of logical flows with the attached db is 1869383 and with the patches it is - 667979. Thanks Numan. Similarly on another scale setup I found tons of lflows per service hairpin. I saw 434000 of these: table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.0.83 && tcp && tcp.dst==6443)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; ip4.dst <-> ip4.src; outport <-> inport; next(pipeline=egress,table=5); };) table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.1.1 && tcp && tcp.dst==80)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; ip4.dst <-> ip4.src; outport <-> inport; next(pipeline=egress,table=5); } They look like hairpin flows to me. Can you confirm what they are? Could we cut those down? The action always seems to be the same, but the match criteria hits each service. (In reply to Tim Rozet from comment #7) > Thanks Numan. Similarly on another scale setup I found tons of lflows per > service hairpin. I saw 434000 of these: > > table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est > || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.0.83 && > tcp && tcp.dst==6443)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; > ip4.dst <-> ip4.src; outport <-> inport; next(pipeline=egress,table=5); };) > table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est > || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.1.1 && tcp > && tcp.dst==80)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; ip4.dst <-> > ip4.src; outport <-> inport; next(pipeline=egress,table=5); } > > They look like hairpin flows to me. Can you confirm what they are? Could we > cut those down? The action always seems to be the same, but the match > criteria hits each service. Yes. These are hairpin flows. I'm also looking into cut down these flows if possible. (In reply to Numan Siddique from comment #8) > (In reply to Tim Rozet from comment #7) > > Thanks Numan. Similarly on another scale setup I found tons of lflows per > > service hairpin. I saw 434000 of these: > > > > table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est > > || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.0.83 && > > tcp && tcp.dst==6443)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; > > ip4.dst <-> ip4.src; outport <-> inport; next(pipeline=egress,table=5); };) > > table=6 (ls_in_acl ), priority=2000 , match=(((!ct.trk || !ct.est > > || (ct.est && ct_label.blocked == 1))) && ip4 && (ip4.dst==172.30.1.1 && tcp > > && tcp.dst==80)), action=(reg0 = 0; icmp4 { eth.dst <-> eth.src; ip4.dst <-> > > ip4.src; outport <-> inport; next(pipeline=egress,table=5); } > > > > They look like hairpin flows to me. Can you confirm what they are? Could we > > cut those down? The action always seems to be the same, but the match > > criteria hits each service. > > Yes. These are hairpin flows. I'm also looking into cut down these flows if > possible. Correction. The flows you listed here are not hairpin flows, but lflows added for reject ACL actions. But definitely there are many hairpin flows. In the attached db, there are around 250000 hairpin flows. I'm working in reducing these lflows. The patches to address hairpin flows are submitted for review - https://bugzilla.redhat.com/show_bug.cgi?id=1859924 There is another BZ for the hairpin flows - https://bugzilla.redhat.com/show_bug.cgi?id=1833373 (In reply to Numan Siddique from comment #13) > The patches to address hairpin flows are submitted for review - > https://bugzilla.redhat.com/show_bug.cgi?id=1859924 > this one - https://patchwork.ozlabs.org/project/ovn/list/?series=209175 > There is another BZ for the hairpin flows - > https://bugzilla.redhat.com/show_bug.cgi?id=1833373 *** Bug 1885713 has been marked as a duplicate of this bug. *** *** Bug 1884049 has been marked as a duplicate of this bug. *** *** Bug 1891002 has been marked as a duplicate of this bug. *** Adding testblocker flag as per email from Dustin Fri Dec 4th 4:16pm Per comment 27 the issue is fixed in ovn2.13-20.12.0-1 and later. Fix shipped in FDP 21.A in ovn2.13-20.12.0-1 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |