Bug 1704451

Summary: memory leak in teamd
Product: Red Hat Enterprise Linux 7 Reporter: mcolombo
Component: libteamAssignee: Xin Long <lxin>
Status: CLOSED ERRATA QA Contact: LiLiang <liali>
Severity: high Docs Contact: Marc Muehlfeld <mmuehlfe>
Priority: high    
Version: 7.6CC: jmaxwell, lxin, mleitner, network-qe, ptalbert, rkhan, sbrivio
Target Milestone: rcFlags: liali: needinfo-
Target Release: 7.7   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libteam-1.29-1.el7 Doc Type: Bug Fix
Doc Text:
.A memory leak in `libteam` has been fixed Previously, the `libteam` library used an incorrect JSON API when a user queried the status of a network team. As a consequence, the `teamdctl <team_device> state` command leaked memory. With this update, the library uses the correct API, and querying the status of a team no longer leaks memory.
Story Points: ---
Clone Of:
: 1767685 (view as bug list) Environment:
Last Closed: 2020-03-31 20:05:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: Red Hat1723958    

Description mcolombo 2019-04-29 19:15:54 UTC
Description of problem:
teamdctl team state appears to consumed memory if run over a period of time. 

Version-Release number of selected component (if applicable):
teamd

How reproducible:
Every time

Steps to Reproduce:
1.repeatedly query "teamdctl ateam state"
2.check ps -auwwx for memory consumption
3.

Actual results:
Customer notes upto 2GB of memory ussage over 24hr. I found memory consumption go from .2% to 1.3% in just under an hour. 

Expected results:
memory usage to state the same. 

Additional info:
Customer provided a script for how they are tracking this. 


# cat team_mem_leak.sh 
#!/bin/bash
  

while true
do
        /usr/bin/teamdctl ateam0 state > /dev/null
        usleep 100000
        /usr/bin/date "+%Y-%m-%dT%H:%M:%S $(ps -auwwx | fgrep /usr/bin/teamd |fgrep -v fgrep)"
done


This does have some impact as the customer is using SolarFlare ptp daemon with a team interface, sfptpd runs 'teamdctl ateam0 state' to determine the state of the team.


here are the results of my test withthe customers script on the mentioned teamd package. 

# head -5 mem_leak.txt | column -t
USER                       PID   %CPU  %MEM  VSZ     RSS    TTY    STAT  START  TIME   COMMAND
2019-04-29T13:58:29  root  716   0.0   0.2     33672  2076   ?     S      13:51  0:00     /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
root                 4323  0.0   0.0   112660  864    ttyS0  R+    13:58  0:00   grep     -F              /usr/bin/teamd
2019-04-29T13:58:29  root  716   0.0   0.2     33672  2076   ?     S      13:51  0:00     /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
2019-04-29T13:58:29  root  716   0.0   0.2     33672  2076   ?     S      13:51  0:00     /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
---------------------snip------------------------
root                 25202  0.0  0.0  112660  860    ttyS0  R  14:40  0:00   grep  -F              /usr/bin/teamd
2019-04-29T14:40:43  root   716  0.1  1.3     45452  13852  ?  S      13:51  0:03  /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
2019-04-29T14:40:44  root   716  0.1  1.3     45452  13852  ?  S      13:51  0:03  /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
2019-04-29T14:40:44  root   716  0.1  1.3     45452  13852  ?  S      13:51  0:03  /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
root                 25223  0.0  0.0  112660  864    ttyS0  R  14:40  0:00   grep  -F              /usr/bin/teamd
[root@firewalld ~]# tail -2 mem_leak.txt | column -t
2019-04-29T14:40:44  root   716  0.1  1.3     45452  13852  ?  S      13:51  0:03  /usr/bin/teamd  -o              -n  -U  -D  -N  -t  ateam0  -c  {"hwaddr":  "DE:19:5C:DB:52:DC",  "link_watch":  {"name":  "ethtool"},  "runner":  {"name":  "activebackup"}}
root                 25223  0.0  0.0  112660  864    ttyS0  R  14:40  0:00   grep  -F              /usr/bin/teamd

Comment 4 Xin Long 2019-06-03 17:49:48 UTC
I could reproduce this issue easily on my local host, but it seems the memleak detector tool, valgrind can't work on teamd.
by reviewing the libteam code, I'm thinking it could be caused by the jasson lib use, need to confirm it later.

Comment 5 Xin Long 2019-06-05 19:31:45 UTC
The leak happened in teamd_config_actual_dump() when calling get_port_obj(), not sure why the later json_decref() couldn't free all its children objs. It seems a jansson lib issue.

Comment 6 Xin Long 2019-06-06 07:47:15 UTC
(In reply to Xin Long from comment #5)
> The leak happened in teamd_config_actual_dump() when calling get_port_obj(),
> not sure why the later json_decref() couldn't free all its children objs. It
> seems a jansson lib issue.

teamd_config_actual_dump shouldn't have used json_object_set() ,which can json_incref the child object causing later json_decref parent object not to be able to free it. This can be verified by:

# cat jmemleak.c
#include <stdio.h>
#include <jansson.h>

int main()
{
	json_t *root, *child, *gchild_1, *gchild_2;

	root = json_object();
	child = json_object();
	gchild_1 = json_object();
	gchild_2 = json_object();

	json_object_set(root, "ports", child);
	json_object_set(child, "eth1", gchild_1);
	json_object_set(child, "eth2", gchild_2);

	json_decref(root);
	return 0;
}

# gcc jmemleak.c -o jmemleak -ljansson

# valgrind --leak-check=full ./jmemleak
==20039== Memcheck, a memory error detector
==20039== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==20039== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==20039== Command: ./jmemleak
==20039==
==20039==
==20039== HEAP SUMMARY:
==20039==     in use at exit: 706 bytes in 8 blocks
==20039==   total heap usage: 11 allocs, 3 frees, 960 bytes allocated
==20039==
==20039== 706 (72 direct, 634 indirect) bytes in 1 blocks are definitely lost in loss record 8 of 8
==20039==    at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==20039==    by 0x4E3EC6A: json_object (value.c:53)
==20039==    by 0x4006BC: main (in /tmp/jmemleak)
==20039==
==20039== LEAK SUMMARY:
==20039==    definitely lost: 72 bytes in 1 blocks
==20039==    indirectly lost: 634 bytes in 7 blocks
==20039==      possibly lost: 0 bytes in 0 blocks
==20039==    still reachable: 0 bytes in 0 blocks
==20039==         suppressed: 0 bytes in 0 blocks
==20039==
==20039== For counts of detected and suppressed errors, rerun with: -v
==20039== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)


Fix in libteam (change to use json_object_set_new which doesn't do json_incref):

diff --git a/teamd/teamd_config.c b/teamd/teamd_config.c
index 69b25de..34fef1f 100644
--- a/teamd/teamd_config.c
+++ b/teamd/teamd_config.c
@@ -84,7 +84,7 @@ static int get_port_obj(json_t **pport_obj, json_t *config_json,
 		ports_obj = json_object();
 		if (!ports_obj)
 			return -ENOMEM;
-		err = json_object_set(config_json, "ports", ports_obj);
+		err = json_object_set_new(config_json, "ports", ports_obj);
 		if (err) {
 			json_decref(ports_obj);
 			return -ENOMEM;
@@ -95,7 +95,7 @@ static int get_port_obj(json_t **pport_obj, json_t *config_json,
 		port_obj = json_object();
 		if (!port_obj)
 			return -ENOMEM;
-		err = json_object_set(ports_obj, port_name, port_obj);
+		err = json_object_set_new(ports_obj, port_name, port_obj);
 		if (err) {
 			json_decref(port_obj);
 			return -ENOMEM;

Comment 14 LiLiang 2019-10-18 10:11:35 UTC
Hi Long,

Why i can't reproduce this bz using below script, can you tell me how to reproduce it?

while true
do
        /usr/bin/teamdctl ateam0 state > /dev/null
        usleep 100000
        /usr/bin/date "+%Y-%m-%dT%H:%M:%S $(ps -auwwx | fgrep /usr/bin/teamd |fgrep -v fgrep)"
done

Comment 15 LiLiang 2019-10-21 09:03:16 UTC
reproduced:
2019-10-21T04:59:18 root     27348  0.7  0.2 133052 67856 ?        S    03:54   0:29 teamd -d -t ateam0 -c {"link_watch" : {"name" : "ethtool"}}
^C
[root@hp-dl388g8-22 ~]# rpm -q libteam
libteam-1.27-9.el7.x86_64


verified:
2019-10-21T04:59:22 root      28164  1.0  0.0  66576  1396 ?        S    03:58   0:38 teamd -d -t ateam0 -c {"link_watch" : {"name" : "ethtool"}}
^C
[root@dell-per740-19 ~]# rpm -q libteam
libteam-1.29-1.el7.x86_64

Comment 16 LiLiang 2019-10-21 09:06:37 UTC
reproducer:
ip link add aeth0 type dummy
teamd -d -t ateam0 -c '{"link_watch" : {"name" : "ethtool"}}'
teamdctl ateam0 port add aeth0

while true; do
	teamdctl ateam0 config dump actual > /dev/null
	date "+%Y-%m-%dT%H:%M:%S $(ps -auwwx | grep teamd |grep -v grep)"
done

Comment 18 errata-xmlrpc 2020-03-31 20:05:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1133