Bug 1672553

Summary: [ovs2.9] Memory leak in C Python JSON library
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Timothy Redaelli <tredaelli>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: other QA Contact: qding
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: atragler, ctrautma, jhsiao, kfida, pvauter, qding, ralongi
Version: FDP 19.B   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-97.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1672551 Environment:
Last Closed: 2019-04-29 09:26:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Timothy Redaelli 2019-02-05 10:26:49 UTC
+++ This bug was initially created as a clone of Bug #1672551 +++

+++ This bug was initially created as a clone of Bug #1667007 +++

Description of problem:

Memory leak in neutron agents:

1 under


3 controllers 

3 computes 

1 public network 

1 private network 

1 router in private network connect to public

4 vms 

1 fip

Memory of neutron agents is increasing continuosly:


 [root@undercloud-0 stack]# top -n1 -b -o %MEM | head -15
top - 03:27:32 up 8 days, 47 min,  2 users,  load average: 0.85, 0.96, 1.00
Tasks: 442 total,   6 running, 388 sleeping,   0 stopped,  48 zombie
%Cpu(s): 38.0 us, 25.4 sy,  0.0 ni, 35.2 id,  0.0 wa,  0.0 hi,  1.4 si,  0.0 st
KiB Mem : 16266508 total,   729112 free, 12441984 used,  3095412 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2323544 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  61429 42435     20   0 1839628   1.5g   5336 S   6.2  9.9  56:24.16 neutron-openvsw
  60913 42435     20   0 1812496   1.5g   5364 S   0.0  9.8 112:17.25 neutron-dhcp-ag
  46424 42434     20   0 2747520 632520   5360 S   6.2  3.9 104:52.61 mysqld
  59411 42418     20   0  753480 508328   3536 S   0.0  3.1  29:16.57 heat-engine
  59413 42418     20   0  748004 503048   3540 S   0.0  3.1  30:42.06 heat-engine
  59410 42418     20   0  636752 391792   3536 S   0.0  2.4  30:03.96 heat-engine
  59409 42418     20   0  611932 366980   3536 S   0.0  2.3  29:38.96 heat-engine
  42350 42439     20   0 2988360 350748   3120 S   0.0  2.2  56:17.05 beam.smp


[root@compute-0 heat-admin]# top -n1 -b -o %MEM | head -15
top - 08:28:57 up 1 day, 22:13,  1 user,  load average: 0.01, 0.08, 0.12
Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.1 us,  3.1 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  5944896 total,   711668 free,  1251384 used,  3981844 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4192928 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  29215 42435     20   0  704140 483864   7252 S   0.0  8.1  12:34.94 neutron-openvsw
  29075 42436     20   0 1750364 138488  15480 S   6.2  2.3  16:05.75 nova-compute
  45236 qemu      20   0  775256 126300  15900 S   0.0  2.1  99:01.81 qemu-kvm
  28820 root      20   0  809300  63584   3612 S   0.0  1.1   9:17.30 ceilometer-poll
  28658 root      20   0  362276  62096  10132 S   0.0  1.0  25:48.92 ceilometer-poll
   3342 openvsw+  10 -10  309812  47972  16552 S   0.0  0.8   4:03.48 ovs-vswitchd
  17600 root      20   0 1330368  44276  17036 S   6.2  0.7   5:04.45 dockerd-current
  45041 root      20   0  175848  32380   1564 S   0.0  0.5   0:00.09 privsep-helper


[root@compute-0 heat-admin]# top -n1 -b -o %MEM | head -15
top - 08:28:57 up 1 day, 22:13,  1 user,  load average: 0.01, 0.08, 0.12
Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.1 us,  3.1 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  5944896 total,   711668 free,  1251384 used,  3981844 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4192928 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  29215 42435     20   0  704140 483864   7252 S   0.0  8.1  12:34.94 neutron-openvsw
  29075 42436     20   0 1750364 138488  15480 S   6.2  2.3  16:05.75 nova-compute
  45236 qemu      20   0  775256 126300  15900 S   0.0  2.1  99:01.81 qemu-kvm
  28820 root      20   0  809300  63584   3612 S   0.0  1.1   9:17.30 ceilometer-poll
  28658 root      20   0  362276  62096  10132 S   0.0  1.0  25:48.92 ceilometer-poll
   3342 openvsw+  10 -10  309812  47972  16552 S   0.0  0.8   4:03.48 ovs-vswitchd
  17600 root      20   0 1330368  44276  17036 S   6.2  0.7   5:04.45 dockerd-current
  45041 root      20   0  175848  32380   1564 S   0.0  0.5   0:00.09 privsep-helper







 

Version-Release number of selected component (if applicable):


 [root@undercloud-0 stack]# cat /etc/rhosp-release 
Red Hat OpenStack Platform release 14.0.0 RC (Rocky)
 [root@undercloud-0 stack]# cat core_puddle_version
2019-01-08.1 [root@undercloud-0 stack]# 

(undercloud) [stack@undercloud-0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

[heat-admin@controller-1 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)


How reproducible:


Steps to Reproduce:

Deploying the enviroment defined in the description of the problem.

--- Additional comment from Assaf Muller on 2019-01-17 15:02:42 CET ---

Can you confirm which agents you saw this in? OVS and DHCP agent only?
Can you confirm that the system was idle during the 48h?

Thinking out loud: I'm wondering if, at least in the case of the OVS agent, it's possible to rapidly increase the timing of its continuous loop to reproduce this problem much quicker.

--- Additional comment from Candido Campos on 2019-01-17 15:23:55 CET ---

the agents affected are l3,dhcp and openvswitch:

[root@controller-2 heat-admin]# top -n1 -b -o %MEM | head -15
top - 08:29:23 up 1 day, 22:11,  1 user,  load average: 1.77, 1.95, 2.05
Tasks: 446 total,   3 running, 443 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.8 us, 15.5 sy,  0.0 ni, 56.3 id,  0.0 wa,  0.0 hi,  1.4 si,  0.0 st
KiB Mem : 16266500 total,   281420 free, 10327728 used,  5657352 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  5093944 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 112607 42435     20   0 1102808 882740   7252 S   0.0  5.4  12:46.51 neutron-openvsw
 111928 42435     20   0 1078696 857544   7336 S   0.0  5.3   4:01.36 neutron-l3-agen
 111624 42435     20   0 1077792 856648   7344 S   5.9  5.3  26:37.14 neutron-dhcp-ag
  63554 42434     20   0 2327228 422840 145040 S   0.0  2.6  20:56.49 mysqld
  56997 42439     20   0 2464800 168780   4240 S   0.0  1.0  73:58.73 beam.smp
 111377 42436     20   0  920592 164768   7636 S   0.0  1.0  10:08.90 httpd
 111378 42436     20   0  847372 164608   7552 S   0.0  1.0   8:34.15 httpd
 112033 42435     20   0  379576 145284   2264 S   0.0  0.9  24:04.93 neutron-server


yes, I were not executing anything in the test bed. I only deployed the networks and vms and after the top command to monit the memory.


We are trying to see where is the leak, but it seems be in python. 
guppy doesn detect the increase of memory

from guppy import hpy 

h = hpy() 

print h.heap()


but the memory is in the heap of the process if we check /proc/"pid"/smaps(allocated and dirty)


the tool memory_profile detects the increase and I go to try to detect the point  where is increased:

  1909    118.9 MiB      0.0 MiB                      'elapsed': elapsed})
  1910    118.9 MiB      0.0 MiB           if elapsed < self.polling_interval:
  1911    118.9 MiB      0.0 MiB               time.sleep(self.polling_interval - elapsed)
  1912                                     else:
  1913                                         LOG.debug("Loop iteration exceeded interval "
  1914                                                   "(%(polling_interval)s vs. %(elapsed)s)!",
  1915                                                   {'polling_interval': self.polling_interval,
  1916                                                    'elapsed': elapsed})
  1917    118.9 MiB      0.0 MiB           self.iter_num = self.iter_num + 1


Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

Line #    Mem usage    Increment   Line Contents
================================================
  1861    118.9 MiB    118.9 MiB       @profile
  1862                                 def check_ovs_status(self):
  1863    118.9 MiB      0.0 MiB           try:
  1864                                         # Check for the canary flow
  1865    118.9 MiB      0.0 MiB               status = self.int_br.check_canary_table()
  1866                                     except Exception:
  1867                                         LOG.exception("Failure while checking for the canary flow")
  1868                                         status = constants.OVS_DEAD
  1869    118.9 MiB      0.0 MiB           if status == constants.OVS_RESTARTED:
  1870                                         LOG.warning("OVS is restarted. OVSNeutronAgent will reset "
  1871                                                     "bridges and recover ports.")
  1872    118.9 MiB      0.0 MiB           elif status == constants.OVS_DEAD:
  1873                                         LOG.warning("OVS is dead. OVSNeutronAgent will keep running "
  1874                                                     "and checking OVS status periodically.")
  1875    118.9 MiB      0.0 MiB           return status


Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

Line #    Mem usage    Increment   Line Contents
================================================
  1846    118.9 MiB    118.9 MiB       @profile
  1847                                 def _agent_has_updates(self, polling_manager):
  1848    118.9 MiB      0.0 MiB           return (polling_manager.is_polling_required or
  1849    118.9 MiB      0.0 MiB                   self.updated_ports or
  1850    118.9 MiB      0.0 MiB                   self.deleted_ports or
  1851    118.9 MiB      0.0 MiB                   self.deactivated_bindings or
  1852    118.9 MiB      0.0 MiB                   self.activated_bindings or
  1853    118.9 MiB      0.0 MiB                   self.sg_agent.firewall_refresh_needed())


Filename: /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py



but I don't know if I go to can do it. If I cannot do it, I go to try it with valgrind tool.


It is a good idea accelerate the loops :)

--- Additional comment from  on 2019-01-29 10:38:29 CET ---

I've also hit this issue but only with L3 and openvswitch agents.
And it helped for me to rebuild openvswitch python bindings with this commit: https://github.com/openvswitch/ovs/commit/e120ff1f8e4dbb0b889b26e0be082376a32090bc (it was committed after the latest openvswitch release)

--- Additional comment from Candido Campos on 2019-01-31 09:09:07 CET ---




When the fix in openvswitch pythin library ovs, the provblem is resolved:

[root@controller-0 heat-admin]# top -p 9600 -p 11474 -p 35636

top - 08:04:18 up 15:36,  1 user,  load average: 3.32, 3.47, 3.70
Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
%Cpu(s): 41.0 us, 19.8 sy,  0.0 ni, 37.4 id,  0.1 wa,  0.0 hi,  1.7 si,  0.0 st
KiB Mem : 16266508 total,  6036340 free,  7333036 used,  2897132 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  8330248 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                      
   9600 42435     20   0  293436  89424   6392 S   0.7  0.5   7:36.56 neutron-dhcp-ag                                                                                                                              
  11474 42435     20   0  296124  90684   6480 S   0.0  0.6   0:40.40 neutron-l3-agen                                                                                                                              
  35636 42435     20   0  337648 117364   7252 S   0.0  0.7   3:59.12 neutron-openvsw                                                                                                                              


[root@controller-0 heat-admin]# 
[root@controller-0 heat-admin]# 
[root@controller-0 heat-admin]# docker exec -t -i -u root neutron_ovs_agent bash
()[root@controller-0 /]# cat /usr/lib64/python2.7/site-packages/ovs/version.py
# Generated automatically -- do not modify!    -*- buffer-read-only: t -*-
VERSION = "2.11.90"

--- Additional comment from Candido Campos on 2019-02-03 14:20:26 CET ---

The leak is solved with the ckange in ovs library:

https://github.com/openvswitch/ovs/commit/e120ff1f8e4dbb0b889b26e0be082376a32090bc

--- Additional comment from Candido Campos on 2019-02-03 14:26:00 CET ---

This bug is affecting osp13 and osp14 then it should be ported to 2.9 and 2.10, at least

Comment 3 qding 2019-03-27 09:54:18 UTC
The issue has been verified by https://bugzilla.redhat.com/show_bug.cgi?id=1667007#c5.
source code is checked in openvswitch-2.9.0-101.el7fdp

Comment 5 errata-xmlrpc 2019-04-29 09:26:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0898