+++ This bug was initially created as a clone of Bug #1787318 +++ Description of problem: With the attached scaled configuration if logical-switches are deleted ovn-controller might access freed memory and crash. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Start ovn-northd and point it to the attached northbound db (ovnnb_db.db). 2. Start ovn-controller. 3. Start OVS and bind the logical_switch_ports locally: for i in $(ovn-nbctl --bare --columns name find logical_switch_port type=\"\"); do vm=$(echo $i | cut -f 1 -d "-") ovs-vsctl add-port br-int $vm -- set interface $vm type=internal ovs-vsctl set interface $vm external_ids:iface-id=$i done 4. Delete all logical switches: for s in $(ovn-nbctl list logical_switch | grep -E "^name" | cut -f 2 -d ':' | cut -f 2 -d '"'); do ovn-nbctl ls-del $s; done Actual results: ovn-controller might crash: Program received signal SIGSEGV, Segmentation fault. 0x00000000004b8f47 in hmap_first_with_hash (hmap=hmap@entry=0x91da08, hmap=hmap@entry=0x91da08, hash=2346380341) at ./include/openvswitch/hmap.h:328 328 return hmap_next_with_hash__(hmap->buckets[hash & hmap->mask], hash); Expected results: ovn-controller shouldn't use memory after it was freed. Additional info: Fixed upstream by commits: 2a4965c0e187db0c4218556ed9b06f988e88cb62: ovn-controller: Refactor I-P engine_run() tracking. 5ed53faecef12c09330ced445418c961cb1f8caf: ovn-controller: Add per node states to I-P engine. 2117ba0a91f36206d0f3665e8680c15f1f6fa0a0: ovn-controller: Add separate I-P engine node for processing ct-zones. 94cbc59dc0f1cb56e56d1551956efe5824561864: ovn-controller: Fix use of dangling pointers in I-P runtime_data.
Hi Dumitru, I failed to reproduce the issue on ovn2.11-2.11.1-24.el7fdp.x86_64 with steps in https://bugzilla.redhat.com/show_bug.cgi?id=1787318#c3.
Hi Jianlin, The crash was made more visible by commit [1] but this was squashed in the patches for ovn2.11-2.11.1-26 which also fix the crash. The steps described in https://bugzilla.redhat.com/show_bug.cgi?id=1787318#c3 don't work in replicating the issue because they were exercising the code path added by [1]. I don't see a straight forward way of replicating the issue without [1]. There are, in theory, code paths that would trigger the memory corruption but I couldn't hit them. Regards, Dumitru [1] https://github.com/ovn-org/ovn/commit/fc1e1640cd47f255c68488b0ec36052b0af58fd2#diff-452d44dee1f09b8a972c69ef7499a69c
set VERIFIED per comment 3
All these bugs have been verified and have shipped in FDP 20.G or earlier.