+++ This bug was initially created as a clone of Bug #1770295 +++ Description of problem: ovn-controller consumes to much memory and eventually core dumps Version-Release number of selected component (if applicable): ovn2.11-central-2.11.1-0.el7fdn.x86_64 rhosp-openvswitch-ovn-host-2.11-0.1.el7ost.noarch rhosp-openvswitch-ovn-common-2.11-0.1.el7ost.noarch puppet-ovn-12.4.0-2.el7ost.noarch ovn2.11-2.11.1-0.el7fdn.x86_64 ovn2.11-host-2.11.1-0.el7fdn.x86_64 rhosp-openvswitch-ovn-central-2.11-0.1.el7ost.noarch How reproducible: This environment Steps to Reproduce: 1. Let it sit for a while 2. 3. Actual results: ctl01 ovn_controller memory consumption is at 39.27G. ctl02 ovn_controller memory consumption is at 27.72G. ctl03 ovn_controller memory consumption is at 19.87G. Expected results: Additional info: --- Additional comment from David Hill on 2019-11-08 16:32:19 UTC --- We have an actual core dump for that process if you want it , it's available on supportshell/collabshell. --- Additional comment from ffernand on 2019-11-08 23:26:33 UTC --- Core (below) is a sigkill (15). I see the following smoking guns in: 30-sosreport-iad1-ctl01-02513543-2019-11-07-xhpxegf.tar.xz/sosreport-iad1-ctl01-02513543-2019-11-07-xhpxegf/ /var/log/containers/openvswitch/ovn-controller.log 2019-11-07T20:32:42.261Z|431093|lflow|WARN|error parsing match "!ct.new && ct.est && !ct.rpl && ct_label.blocked == 0 && (inport == "4b18b2ae-fc93-4e5b-b979-8e0a8a82792d" && ip4 && ip4.dst == 10.75.32.90/29 && tcp && tcp.dst == 22)": Value contains unmasked 1-bits. 2019-11-07T20:32:48.730Z|431094|timeval|WARN|Unreasonably long 12862ms poll interval (12762ms user, 70ms system) 2019-11-07T20:32:48.730Z|431095|timeval|WARN|faults: 32015 minor, 0 major 2019-11-07T20:32:48.730Z|431096|timeval|WARN|context switches: 0 voluntary, 12 involuntary 2019-11-07T20:32:48.730Z|431097|coverage|INFO|Dropped 3 log messages in last 38 seconds (most recently, 13 seconds ago) due to excessive rate 2019-11-07T20:32:48.730Z|431098|coverage|INFO|Skipping details of duplicate event coverage for hash=a3f3a1ae 2019-11-07T20:32:48.730Z|431099|poll_loop|INFO|Dropped 1 log messages in last 13 seconds (most recently, 13 seconds ago) due to excessive rate 2019-11-07T20:32:48.730Z|431100|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage) =-=- 10-sosreport-iad1-ctl02-2019-11-07-qyqaohs.tar.xz/sosreport-iad1-ctl02-2019-11-07-qyqaohs/ /var/log/containers/openvswitch/ovn-controller.log 2019-11-07T18:30:05.597Z|352338|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (inport == "1e39978a-ade7-4a05-aed4-72423d887ffa" && ip4 && ip4.dst == 10.75.32.90/29 && tcp && tcp.dst == 10022)": Value contains unmasked 1-bits. 2019-11-07T18:32:22.785Z|352417|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2019-11-07T18:32:33.772Z|352418|timeval|WARN|Unreasonably long 10919ms poll interval (10836ms user, 58ms system) 2019-11-07T18:32:33.772Z|352419|timeval|WARN|faults: 32622 minor, 0 major 2019-11-07T18:32:33.772Z|352420|timeval|WARN|context switches: 0 voluntary, 2 involuntary 2019-11-07T18:32:33.772Z|352421|poll_loop|INFO|Dropped 3 log messages in last 11 seconds (most recently, 11 seconds ago) due to excessive rate 2019-11-07T18:32:33.772Z|352422|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage) =-=- 40-sosreport-iad1-ctl03-02513543-2019-11-07-atiskus.tar.xz/sosreport-iad1-ctl03-02513543-2019-11-07-atiskus/ 2019-11-07T20:34:25.165Z|297788|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (inport == "1e39978a-ade7-4a05-aed4-72423d887ffa" && ip4 && ip4.dst == 10.75.32.90/29 && tcp && tcp.dst == 10022)": Value contains unmasked 1-bits. 2019-11-07T20:34:38.075Z|297789|timeval|WARN|Unreasonably long 13035ms poll interval (12911ms user, 87ms system) 2019-11-07T20:34:38.075Z|297790|timeval|WARN|faults: 32411 minor, 0 major 2019-11-07T20:34:38.075Z|297791|timeval|WARN|disk: 0 reads, 8 writes 2019-11-07T20:34:38.075Z|297792|timeval|WARN|context switches: 0 voluntary, 466 involuntary 2019-11-07T20:34:38.075Z|297793|poll_loop|INFO|Dropped 44 log messages in last 38 seconds (most recently, 35 seconds ago) due to excessive rate 2019-11-07T20:34:38.076Z|297794|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (94% CPU usage) ===== Core: [ffernand@c7vm support_02513543]$ gdb -c 20-core.1 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. [New LWP 16] [New LWP 17] [New LWP 15] [New LWP 1] Reading symbols from /usr/bin/ovn-controller...Reading symbols from /usr/lib/debug/usr/bin/ovn-controller.debug...done. done. warning: Ignoring non-absolute filename: <%0*Zx, 0x%0*Zx)> Missing separate debuginfo for %0*Zx, 0x%0*Zx) Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/63/023e5151c4d5bbdb7f6f85a6f95a6b95dc780e warning: Could not load shared library symbols for %0*Zx, 0x%0*Zx). Do you need "set solib-search-path" or "set sysroot"? Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error warning: File "/usr/lib64/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py". To enable execution of this file add add-auto-load-safe-path /usr/lib64/libthread_db-1.0.so line to your configuration file "/home/ffernand/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/home/ffernand/.gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available. Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error warning: File "/usr/lib64/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py". warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available. Core was generated by `/usr/bin/ovn-controller --pidfile --log-file unix:/run/openvswitch/db.sock'. Program terminated with signal 11, Segmentation fault. #0 0x00007fec77afabb7 in _L_unlock_138 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 python-libs-2.7.5-86.el7.x86_64 unbound-libs-1.6.6-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt full #0 0x00007fec77afabb7 in _L_unlock_138 () from /lib64/libc.so.6 No symbol table info available. #1 0x00007fec77afabc3 in _L_unlock_138 () from /lib64/libc.so.6 No symbol table info available. #2 0x00007fec77afaa23 in abort () from /lib64/libc.so.6 No symbol table info available. #3 0x000055c597101bc4 in fatal_signal_run () at lib/fatal-signal.c:213 namebuf = "\360\t\000h\354\177", '\000' <repeats 11 times>, "\020\200" sig_nr = 15 #4 0x000055c59716cdda in poll_block () at lib/poll-loop.c:388 loop = 0x7fec680009f0 node = <optimized out> pollfds = 0x7fec680008c0 elapsed = 2023730222 retval = <optimized out> i = <optimized out> #5 0x000055c59715463e in ovsrcu_postpone_thread (arg=<optimized out>) at lib/ovs-rcu.c:360 seqno = 225 #6 0x000055c5971569df in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 auxp = <optimized out> aux = {start = 0x55c5971545c0 <ovsrcu_postpone_thread>, arg = 0x0, name = "urcu", '\000' <repeats 11 times>} id = 1 subprogram_name = 0x7fec680008c0 "\210" #7 0x00007fec786a3ea5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #8 0x00007fec77bc18cd in lseek64 () from /lib64/libc.so.6 No symbol table info available. #9 0x0000000000000000 in ?? () No symbol table info available. (gdb) (gdb) thread apply all bt Thread 4 (LWP 1): #0 0x00007fec786aa54d in __lll_timedlock_wait () from /lib64/libpthread.so.0 #1 0x00000000ffffff00 in ?? () #2 0x00007fec786a5eb6 in pthread_mutex_trylock () from /lib64/libpthread.so.0 #3 0x00000000c6f084aa in ?? () #4 0x0000000000020020 in ?? () #5 0x0000000000000028 in ?? () #6 0x000055c6b6108fe0 in ?? () #7 0x000055c597cc9900 in ?? () #8 0x00007fec77b495a4 in calloc () from /lib64/libc.so.6 #9 0x000055c597155c88 in ovs_mutex_lock_at (l_=0x5, l_@entry=0x55c5975055e0 <mutex>, where=0xc6f084aa <Address 0xc6f084aa out of bounds>, where@entry=0x55c597231880 "lib/fatal-signal.c:196") at lib/ovs-thread.c:75 #10 0x000055c597101b81 in fatal_signal_run () at lib/fatal-signal.c:196 #11 0x000055c59716cdda in poll_block () at lib/poll-loop.c:388 #12 0x000055c5970b6c6a in main (argc=4, argv=0x7ffc9a00aba8) at ovn/controller/ovn-controller.c:2129 Thread 3 (LWP 15): #0 0x00007fec77bc2a45 in socketpair () from /lib64/libc.so.6 #1 0x00007fec77bbb370 in __vsyslog_chk () from /lib64/libc.so.6 #2 0x00007fec75a6d848 in ?? () #3 0x00007fec77bbb717 in __vsyslog_chk () from /lib64/libc.so.6 #4 0x00007fec77bbba12 in openlog () from /lib64/libc.so.6 #5 0x0000003000000020 in ?? () #6 0x00007fec75a6d920 in ?? () #7 0x00007fec75a6d860 in ?? () #8 0x0000000000000000 in ?? () Thread 2 (LWP 17): #0 0x00007fec786aa54d in __lll_timedlock_wait () from /lib64/libpthread.so.0 #1 0x00000000ffffffff in ?? () #2 0x00007fec786a5eb6 in pthread_mutex_trylock () from /lib64/libpthread.so.0 #3 0x00000000c6f084aa in ?? () #4 0x0000000000020510 in ?? () #5 0x0000000000000028 in ?? () #6 0x00007fec6c000af0 in ?? () #7 0x00007fec74a6bb30 in ?? () #8 0x00007fec77b495a4 in calloc () from /lib64/libc.so.6 #9 0x000055c597155c88 in ovs_mutex_lock_at (l_=0x5, l_@entry=0x55c5975055e0 <mutex>, where=0xc6f084aa <Address 0xc6f084aa out of bounds>, where@entry=0x55c597231880 "lib/fatal-signal.c:196") at lib/ovs-thread.c:75 #10 0x000055c597101b81 in fatal_signal_run () at lib/fatal-signal.c:196 #11 0x000055c59716cdda in poll_block () at lib/poll-loop.c:388 #12 0x000055c59716c0ac in stopwatch_thread (ign=<optimized out>) at lib/stopwatch.c:459 #13 0x000055c5971569df in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #14 0x00007fec786a3ea5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007fec77bc18cd in lseek64 () from /lib64/libc.so.6 #16 0x0000000000000000 in ?? () Thread 1 (LWP 16): #0 0x00007fec77afabb7 in _L_unlock_138 () from /lib64/libc.so.6 #1 0x00007fec77afabc3 in _L_unlock_138 () from /lib64/libc.so.6 #2 0x00007fec77afaa23 in abort () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- #3 0x000055c597101bc4 in fatal_signal_run () at lib/fatal-signal.c:213 #4 0x000055c59716cdda in poll_block () at lib/poll-loop.c:388 #5 0x000055c59715463e in ovsrcu_postpone_thread (arg=<optimized out>) at lib/ovs-rcu.c:360 #6 0x000055c5971569df in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #7 0x00007fec786a3ea5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fec77bc18cd in lseek64 () from /lib64/libc.so.6 #9 0x0000000000000000 in ?? () (gdb) --- Additional comment from ffernand on 2019-11-08 23:50:27 UTC --- Controller 1 is likely leaking memory: # for x in $(find . -name "free_-m") ; do echo $x ; cat $x ; echo ; done ./10-sosreport-iad1-ctl02-2019-11-07-qyqaohs.tar.xz/sosreport-iad1-ctl02-2019-11-07-qyqaohs/sos_commands/memory/free_-m total used free shared buff/cache available Mem: 288589 153855 47259 60 87474 131938 Swap: 0 0 0 ./30-sosreport-iad1-ctl01-02513543-2019-11-07-xhpxegf.tar.xz/sosreport-iad1-ctl01-02513543-2019-11-07-xhpxegf/sos_commands/memory/free_-m total used free shared buff/cache available Mem: 288589 213246 1810 61 73531 72790 Swap: 0 0 0 ./40-sosreport-iad1-ctl03-02513543-2019-11-07-atiskus.tar.xz/sosreport-iad1-ctl03-02513543-2019-11-07-atiskus/sos_commands/memory/free_-m total used free shared buff/cache available Mem: 289409 141210 1110 71 147088 145570 Swap --- Additional comment from ffernand on 2019-11-08 23:56:47 UTC --- This is to be confirmed, but could be a potential cause for leaks that is not in the used code: $ git show dc0e10c068c20c4e59c9c86ecee26baf8ed50e90 commit dc0e10c068c20c4e59c9c86ecee26baf8ed50e90 Author: Han Zhou <hzhou> Date: Mon Oct 21 16:33:40 2019 -0700 ovn-controller.c: Fix memory leak of local_datapath->ports. Fixes: 89f5048f960c ("ovn-controller: Minimize SB DB port_binding lookups.") Acked-by: Numan Siddique <numans> Acked-by: Dumitru Ceara <dceara> Signed-off-by: Han Zhou <hzhou> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c index b46a1d151..9ab98be5c 100644 --- a/controller/ovn-controller.c +++ b/controller/ovn-controller.c @@ -972,6 +972,7 @@ en_runtime_data_cleanup(struct engine_node *node) HMAP_FOR_EACH_SAFE (cur_node, next_node, hmap_node, &data->local_datapaths) { free(cur_node->peer_ports); + free(cur_node->ports); hmap_remove(&data->local_datapaths, &cur_node->hmap_node); free(cur_node); } @@ -1002,6 +1003,7 @@ en_runtime_data_run(struct engine_node *node) struct local_datapath *cur_node, *next_node; HMAP_FOR_EACH_SAFE (cur_node, next_node, hmap_node, local_datapaths) { free(cur_node->peer_ports); + free(cur_node->ports); hmap_remove(local_datapaths, &cur_node->hmap_node); free(cur_node); }
run stress test with following script: systemctl start ovn-northd systemctl start openvswitch ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set Open_vSwitch . external-ids:system-id=hv0 external-ids:ovn-remote=tcp:20.0.30.26:6642 external-ids:ovn-encap-type=geneve external-ids:ovn-encap-ip=20.0.30.26 systemctl start ovn-controller ctl_pid=$(ps aux | grep ovn-controller | grep -v grep | awk '{print $2}') free -m | tee memory.log pmap $ctl_pid >> memory.log echo "start" for i in {11..99} do for j in {11..13} do #free -m ovn-nbctl ls-add s${j}w$i ovn-nbctl lsp-add s${j}w$i a${j}p${i}1 ovn-nbctl lsp-set-addresses a${j}p${i}1 "00:00:00:$j:$i:02 192.$i.$j.2" ovn-nbctl lsp-add s${j}w$i a${j}p${i}2 ovn-nbctl lsp-set-addresses a${j}p${i}2 "00:00:00:$j:$i:03 192.$i.$j.3" ovn-nbctl lsp-add s${j}w$i a${j}p${i}3 ovn-nbctl lsp-set-addresses a${j}p${i}3 "00:00:00:$j:$i:04 192.$i.$j.4" ovs-vsctl add-port br-int a${j}p${i}1 -- set Interface a${j}p${i}1 type=internal external_ids:iface-id=a${j}p${i}1 ovs-vsctl add-port br-int a${j}p${i}2 -- set Interface a${j}p${i}2 type=internal external_ids:iface-id=a${j}p${i}2 ovs-vsctl add-port br-int a${j}p${i}3 -- set Interface a${j}p${i}3 type=internal external_ids:iface-id=a${j}p${i}3 ovn-nbctl create Address_Set name=set$i$j "addresses=192.$i.$j.2,192.$i.$j.3,192.$i.$j.4" ovn-nbctl pg-add pg${i}$j a${j}p${i}1 a${j}p${i}2 a${j}p${i}3 ovn-nbctl acl-add s${j}w$i to-lport 1000 "inport==@pg${i}${j} && ip4 && ip4.src==\$set$i$j" allow #ovs-ofctl dump-flows br-int | wc -l #sleep 5 #ovs-ofctl dump-flows br-int | wc -l #free -m #ovs-ofctl dump-flows br-int | wc -l | tee -a memory.log #echo "i:$i, j:$j" >> memory.log #free -m | tee -a memory.log done done sleep 5 ovs-ofctl dump-flows br-int | wc -l | tee -a memory.log echo "finish add ports" >> memory.log free -m | tee -a memory.log pmap $ctl_pid >> memory.log echo "start del ports" for i in {11..99} do for j in {11..13} do ovn-nbctl ls-del s${j}w$i done done sleep 5 ovs-ofctl dump-flows br-int | wc -l | tee -a memory.log echo "finish del ports" >> memory.log free -m | tee -a memory.log pmap $ctl_pid >> memory.log on ovn2.11.1-13: total used free shared buff/cache available Mem: 63686 1218 61631 63 837 61961 Swap: 31999 0 31999 103151: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 281120K 23810 finish add ports total used free shared buff/cache available Mem: 63686 1424 61371 69 890 61748 Swap: 31999 0 31999 103151: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 328256K 6 finish del ports total used free shared buff/cache available Mem: 63686 1428 61364 69 892 61744 Swap: 31999 0 31999 103151: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 327832K on ovn2.11.1-15: total used free shared buff/cache available Mem: 63686 1194 61651 69 840 61981 Swap: 31999 0 31999 110033: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 281120K 23810 finish add ports total used free shared buff/cache available Mem: 63686 1402 61394 69 889 61771 Swap: 31999 0 31999 110033: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 316448K 6 finish del ports total used free shared buff/cache available Mem: 63686 1405 61387 69 892 61767 Swap: 31999 0 31999 110033: ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --- no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controo ller.pid --detach ...... total 316184K <==== memory for ovn-controller on 15 is much less than that on 13. set VERIFIED per above test. @Dumitru, how do you think?
(In reply to Jianlin Shi from comment #5) > > <==== memory for ovn-controller on 15 is much less than that on 13. > > set VERIFIED per above test. > > @Dumitru, how do you think? Looks good to me. Regards, Dumitru
set VERIFIED per comment 6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0190