Bug 2125607

Summary:

winbind leaks memory for each NTLM auth request [rhel-7.9.z]

Product:

Red Hat Enterprise Linux 7

Reporter:

Anton Bobrov <abobrov>

Component:

samba

Assignee:

Andreas Schneider <asn>

Status:

CLOSED MIGRATED

QA Contact:

Denis Karpelevich <dkarpele>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.9

CC:

aboscatt, asn, dkarpele, gdeschner, jhuo, pfilipen

Target Milestone:

Keywords:

MigratedToJIRA, Triaged, ZStream

Target Release:

---

Flags:

abobrov: needinfo+
abobrov: needinfo+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-09-05 13:03:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --num-callers=50	none

Description Anton Bobrov 2022-09-09 12:49:48 UTC

Created attachment 1910673 [details]
valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --num-callers=50

Description of problem:

The customer has reported very slow winbind memory growth which accumulates over long period of time and as result requires periodic service restarts which is of course inconvenient and unpredictable.

Collected valgrind leaks report is attached. It looks like its leaking indirectly via LDAP handles and various small allocations associated with them on libldap part and its underlying dependencies.

There are quite a few things in the valgrind leaks report but it appears that new LDAP handles are being continuously create, like so

[ ..... ]
==23510==    by 0x13571D40: ldap_int_open_connection (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x135850CC: ldap_new_connection (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x135711DE: ldap_open_defconn (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x135863D7: ldap_send_initial_request (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x1357B418: ldap_sasl_bind (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x1357B848: ldap_sasl_bind_s (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0x1357C0E4: ldap_simple_bind_s (in /usr/lib64/libldap-2.4.so.2.10.7)
==23510==    by 0xA8D574F: ??? (in /usr/lib64/libsmbldap.so.2)
==23510==    by 0xA8D66A4: ??? (in /usr/lib64/libsmbldap.so.2)

==23510==    by 0xA8D6D4A: smbldap_search (in /usr/lib64/libsmbldap.so.2)
==23510==    by 0xA8D6D96: smbldap_search_suffix (in /usr/lib64/libsmbldap.so.2)
==23510==    by 0x2043FAC6: smbldap_search_domain_info (in /usr/lib64/samba/libsmbldaphelper-samba4.so)
==23510==    by 0x20223949: pdb_ldapsam_init_common (in /usr/lib64/samba/pdb/ldapsam.so)
==23510==    by 0x6A278A8: make_pdb_method_name (in /usr/lib64/libsamba-passdb.so.0.27.2)
==23510==    by 0x6A27BA3: ??? (in /usr/lib64/libsamba-passdb.so.0.27.2)
==23510==    by 0x6A29CB8: initialize_password_db (in /usr/lib64/libsamba-passdb.so.0.27.2)
==23510==    by 0x12EE8B: main (in /usr/sbin/winbindd)

In the winbind code it appears that the original intent was to cache the LDAP handle and its associated connection and only free it on LDAP_SERVER_DOWN or any sort of reconnect conditions however it looks like (I'm not familiar with related code at all) a new handle is created every time via

pdb_ldapsam_init_common()/pdb_init_ldapsam_common() path

and nothing ever calls libldap ldap_unbind() API (which is libldap way to discard an LDAP handle and free all resources associated with it) unless there is an error condition eg connection problem etc. 
    

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Andreas Schneider 2022-09-12 16:21:35 UTC

Samba is using talloc [1] a hierarchical, reference counted memory pool system with destructors.


pdb_ldapsam_init_common()
  -> pdb_init_ldapsam_common()
     -> smbldap_init()
        -> talloc_set_destructor(*smbldap_state, smbldap_state_destructor);

The memory context of smbldap is the `pdb_method` pointer passed to pdb_ldapsam_init_common(). So when TALLOC_FREE(pdb_method) is called, the smbldap_state_destructor will be called with will do the ldap_unbind().

The smbldap state is stored in the private_data field of the pdb method, so it will be reused as long as the pdb_method exisits. It only exits once as you register the module only once. I do not see a memory leak by Samba here.


[1] https://talloc.samba.org/talloc/doc/html/index.html



Looking at the valgrind log the pdb memcache could be a problem.

Comment 4 Anton Bobrov 2022-09-13 13:08:28 UTC

ok, like i said i have no idea about that code, it just looks like it has accumulated some baggage via LDAP handle there. if it is leaking elsewhere it would make sense to change this bug summary line once you confirm the real root cause then.

Comment 5 Andreas Schneider 2022-10-24 15:06:35 UTC

Could you ask the customer if he is willing to test a package with a possible fix?

Comment 27 Andreas Schneider 2023-04-13 12:44:00 UTC

That means we have additional memory leaks.

To find memory leaks we would need to run AddressSanitizer with the memory leak detector turned on. However there are several small leaks which prevent even starting up. I would need to address them first.

This will be a bigger task. What the customer can do is to run the test binaries with valgrind maybe it will catch something. However this is normally not fun as valgrind slows down things a lot.

Comment 28 Andreas Schneider 2023-04-14 13:45:21 UTC

Well, it would be nice if we could find out what is causing it.

Is it when:

* User authenticate with NTLM
* User authenticate with Kerberos
* We query user information

The hotfix fixes a memory leak, however there might be more.

Comment 30 Andreas Schneider 2023-04-18 11:54:03 UTC

The logs would be to big to digest. The question is which workload increases the memory using. If we would know that we would know in which area of the code to look. It is hard to find this leak as we do not have a clean shutdown and all memory freed. This is a goal for one of the next Samba releases. However we are short on manpower so removing all memory leaks will take some time.

I will have another look into the valgrind logs if I can spot anything suspicious again.

What you can ask the customer if he knows what workload makes the memory leak grow faster ...

Comment 35 RHEL Program Management 2023-09-05 12:11:48 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 36 RHEL Program Management 2023-09-05 13:03:51 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues.