+++ This bug was initially created as a clone of Bug #1667007 +++ Description of problem: Memory leak in neutron agents: 1 under 3 controllers 3 computes 1 public network 1 private network 1 router in private network connect to public 4 vms 1 fip Memory of neutron agents is increasing continuosly: [root@undercloud-0 stack]# top -n1 -b -o %MEM | head -15 top - 03:27:32 up 8 days, 47 min, 2 users, load average: 0.85, 0.96, 1.00 Tasks: 442 total, 6 running, 388 sleeping, 0 stopped, 48 zombie %Cpu(s): 38.0 us, 25.4 sy, 0.0 ni, 35.2 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st KiB Mem : 16266508 total, 729112 free, 12441984 used, 3095412 buff/cache KiB Swap: 0 total, 0 free, 0 used. 2323544 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 61429 42435 20 0 1839628 1.5g 5336 S 6.2 9.9 56:24.16 neutron-openvsw 60913 42435 20 0 1812496 1.5g 5364 S 0.0 9.8 112:17.25 neutron-dhcp-ag 46424 42434 20 0 2747520 632520 5360 S 6.2 3.9 104:52.61 mysqld 59411 42418 20 0 753480 508328 3536 S 0.0 3.1 29:16.57 heat-engine 59413 42418 20 0 748004 503048 3540 S 0.0 3.1 30:42.06 heat-engine 59410 42418 20 0 636752 391792 3536 S 0.0 2.4 30:03.96 heat-engine 59409 42418 20 0 611932 366980 3536 S 0.0 2.3 29:38.96 heat-engine 42350 42439 20 0 2988360 350748 3120 S 0.0 2.2 56:17.05 beam.smp [root@compute-0 heat-admin]# top -n1 -b -o %MEM | head -15 top - 08:28:57 up 1 day, 22:13, 1 user, load average: 0.01, 0.08, 0.12 Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.1 us, 3.1 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 5944896 total, 711668 free, 1251384 used, 3981844 buff/cache KiB Swap: 0 total, 0 free, 0 used. 4192928 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29215 42435 20 0 704140 483864 7252 S 0.0 8.1 12:34.94 neutron-openvsw 29075 42436 20 0 1750364 138488 15480 S 6.2 2.3 16:05.75 nova-compute 45236 qemu 20 0 775256 126300 15900 S 0.0 2.1 99:01.81 qemu-kvm 28820 root 20 0 809300 63584 3612 S 0.0 1.1 9:17.30 ceilometer-poll 28658 root 20 0 362276 62096 10132 S 0.0 1.0 25:48.92 ceilometer-poll 3342 openvsw+ 10 -10 309812 47972 16552 S 0.0 0.8 4:03.48 ovs-vswitchd 17600 root 20 0 1330368 44276 17036 S 6.2 0.7 5:04.45 dockerd-current 45041 root 20 0 175848 32380 1564 S 0.0 0.5 0:00.09 privsep-helper [root@compute-0 heat-admin]# top -n1 -b -o %MEM | head -15 top - 08:28:57 up 1 day, 22:13, 1 user, load average: 0.01, 0.08, 0.12 Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.1 us, 3.1 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 5944896 total, 711668 free, 1251384 used, 3981844 buff/cache KiB Swap: 0 total, 0 free, 0 used. 4192928 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29215 42435 20 0 704140 483864 7252 S 0.0 8.1 12:34.94 neutron-openvsw 29075 42436 20 0 1750364 138488 15480 S 6.2 2.3 16:05.75 nova-compute 45236 qemu 20 0 775256 126300 15900 S 0.0 2.1 99:01.81 qemu-kvm 28820 root 20 0 809300 63584 3612 S 0.0 1.1 9:17.30 ceilometer-poll 28658 root 20 0 362276 62096 10132 S 0.0 1.0 25:48.92 ceilometer-poll 3342 openvsw+ 10 -10 309812 47972 16552 S 0.0 0.8 4:03.48 ovs-vswitchd 17600 root 20 0 1330368 44276 17036 S 6.2 0.7 5:04.45 dockerd-current 45041 root 20 0 175848 32380 1564 S 0.0 0.5 0:00.09 privsep-helper Version-Release number of selected component (if applicable): [root@undercloud-0 stack]# cat /etc/rhosp-release Red Hat OpenStack Platform release 14.0.0 RC (Rocky) [root@undercloud-0 stack]# cat core_puddle_version 2019-01-08.1 [root@undercloud-0 stack]# (undercloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) [heat-admin@controller-1 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) How reproducible: Steps to Reproduce: Deploying the enviroment defined in the description of the problem. --- Additional comment from Assaf Muller on 2019-01-17 15:02:42 CET --- Can you confirm which agents you saw this in? OVS and DHCP agent only? Can you confirm that the system was idle during the 48h? Thinking out loud: I'm wondering if, at least in the case of the OVS agent, it's possible to rapidly increase the timing of its continuous loop to reproduce this problem much quicker. --- Additional comment from Candido Campos on 2019-01-17 15:23:55 CET --- the agents affected are l3,dhcp and openvswitch: [root@controller-2 heat-admin]# top -n1 -b -o %MEM | head -15 top - 08:29:23 up 1 day, 22:11, 1 user, load average: 1.77, 1.95, 2.05 Tasks: 446 total, 3 running, 443 sleeping, 0 stopped, 0 zombie %Cpu(s): 26.8 us, 15.5 sy, 0.0 ni, 56.3 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st KiB Mem : 16266500 total, 281420 free, 10327728 used, 5657352 buff/cache KiB Swap: 0 total, 0 free, 0 used. 5093944 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 112607 42435 20 0 1102808 882740 7252 S 0.0 5.4 12:46.51 neutron-openvsw 111928 42435 20 0 1078696 857544 7336 S 0.0 5.3 4:01.36 neutron-l3-agen 111624 42435 20 0 1077792 856648 7344 S 5.9 5.3 26:37.14 neutron-dhcp-ag 63554 42434 20 0 2327228 422840 145040 S 0.0 2.6 20:56.49 mysqld 56997 42439 20 0 2464800 168780 4240 S 0.0 1.0 73:58.73 beam.smp 111377 42436 20 0 920592 164768 7636 S 0.0 1.0 10:08.90 httpd 111378 42436 20 0 847372 164608 7552 S 0.0 1.0 8:34.15 httpd 112033 42435 20 0 379576 145284 2264 S 0.0 0.9 24:04.93 neutron-server yes, I were not executing anything in the test bed. I only deployed the networks and vms and after the top command to monit the memory. We are trying to see where is the leak, but it seems be in python. guppy doesn detect the increase of memory from guppy import hpy h = hpy() print h.heap() but the memory is in the heap of the process if we check /proc/"pid"/smaps(allocated and dirty) the tool memory_profile detects the increase and I go to try to detect the point where is increased: 1909 118.9 MiB 0.0 MiB 'elapsed': elapsed}) 1910 118.9 MiB 0.0 MiB if elapsed < self.polling_interval: 1911 118.9 MiB 0.0 MiB time.sleep(self.polling_interval - elapsed) 1912 else: 1913 LOG.debug("Loop iteration exceeded interval " 1914 "(%(polling_interval)s vs. %(elapsed)s)!", 1915 {'polling_interval': self.polling_interval, 1916 'elapsed': elapsed}) 1917 118.9 MiB 0.0 MiB self.iter_num = self.iter_num + 1 Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py Line # Mem usage Increment Line Contents ================================================ 1861 118.9 MiB 118.9 MiB @profile 1862 def check_ovs_status(self): 1863 118.9 MiB 0.0 MiB try: 1864 # Check for the canary flow 1865 118.9 MiB 0.0 MiB status = self.int_br.check_canary_table() 1866 except Exception: 1867 LOG.exception("Failure while checking for the canary flow") 1868 status = constants.OVS_DEAD 1869 118.9 MiB 0.0 MiB if status == constants.OVS_RESTARTED: 1870 LOG.warning("OVS is restarted. OVSNeutronAgent will reset " 1871 "bridges and recover ports.") 1872 118.9 MiB 0.0 MiB elif status == constants.OVS_DEAD: 1873 LOG.warning("OVS is dead. OVSNeutronAgent will keep running " 1874 "and checking OVS status periodically.") 1875 118.9 MiB 0.0 MiB return status Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py Line # Mem usage Increment Line Contents ================================================ 1846 118.9 MiB 118.9 MiB @profile 1847 def _agent_has_updates(self, polling_manager): 1848 118.9 MiB 0.0 MiB return (polling_manager.is_polling_required or 1849 118.9 MiB 0.0 MiB self.updated_ports or 1850 118.9 MiB 0.0 MiB self.deleted_ports or 1851 118.9 MiB 0.0 MiB self.deactivated_bindings or 1852 118.9 MiB 0.0 MiB self.activated_bindings or 1853 118.9 MiB 0.0 MiB self.sg_agent.firewall_refresh_needed()) Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py but I don't know if I go to can do it. If I cannot do it, I go to try it with valgrind tool. It is a good idea accelerate the loops :) --- Additional comment from on 2019-01-29 10:38:29 CET --- I've also hit this issue but only with L3 and openvswitch agents. And it helped for me to rebuild openvswitch python bindings with this commit: https://github.com/openvswitch/ovs/commit/e120ff1f8e4dbb0b889b26e0be082376a32090bc (it was committed after the latest openvswitch release) --- Additional comment from Candido Campos on 2019-01-31 09:09:07 CET --- When the fix in openvswitch pythin library ovs, the provblem is resolved: [root@controller-0 heat-admin]# top -p 9600 -p 11474 -p 35636 top - 08:04:18 up 15:36, 1 user, load average: 3.32, 3.47, 3.70 Tasks: 3 total, 0 running, 3 sleeping, 0 stopped, 0 zombie %Cpu(s): 41.0 us, 19.8 sy, 0.0 ni, 37.4 id, 0.1 wa, 0.0 hi, 1.7 si, 0.0 st KiB Mem : 16266508 total, 6036340 free, 7333036 used, 2897132 buff/cache KiB Swap: 0 total, 0 free, 0 used. 8330248 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9600 42435 20 0 293436 89424 6392 S 0.7 0.5 7:36.56 neutron-dhcp-ag 11474 42435 20 0 296124 90684 6480 S 0.0 0.6 0:40.40 neutron-l3-agen 35636 42435 20 0 337648 117364 7252 S 0.0 0.7 3:59.12 neutron-openvsw [root@controller-0 heat-admin]# [root@controller-0 heat-admin]# [root@controller-0 heat-admin]# docker exec -t -i -u root neutron_ovs_agent bash ()[root@controller-0 /]# cat /usr/lib64/python2.7/site-packages/ovs/version.py # Generated automatically -- do not modify! -*- buffer-read-only: t -*- VERSION = "2.11.90" --- Additional comment from Candido Campos on 2019-02-03 14:20:26 CET --- The leak is solved with the ckange in ovs library: https://github.com/openvswitch/ovs/commit/e120ff1f8e4dbb0b889b26e0be082376a32090bc --- Additional comment from Candido Campos on 2019-02-03 14:26:00 CET --- This bug is affecting osp13 and osp14 then it should be ported to 2.9 and 2.10, at least