Bug 1703436

Summary: sssd not thread-safe in innetgr()
Product: Red Hat Enterprise Linux 8 Reporter: Ulrich Sibiller <uli42>
Component: sssdAssignee: Sumit Bose <sbose>
Status: CLOSED ERRATA QA Contact: Jakub Vavra <jvavra>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.1CC: apeddire, arajendr, ashankar, atikhono, codonell, dj, dlavu, fweimer, grajaiya, jhrozek, knweiss, lslebodn, mnewsome, mzidek, pbrezina, pfrankli, sbose, sssd-maint, thalman, tscherf
Target Milestone: rcKeywords: Triaged
Target Release: 8.0Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: Linux   
Whiteboard: sync-to-jira
Fixed In Version: sssd-2.5.0-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-09 19:46:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example source code none

Description Ulrich Sibiller 2019-04-26 12:15:00 UTC
Created attachment 1559039 [details]
example source code

Description of problem:

"man 3 innetgr" states "The function is thread-safe." On our system it normally is but fails to be thread-safe if sssd is activated.

We hit this bug while debugging a GPFS (Spectrum Scale) issue. IBM support came up with the attached code example that triggers the problem without gpfs being involved.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Adapt "groups" and "hosts" arrays in the two attached source code files. The code will check if the machine in the hosts array is member of the netgroup mentioned in the  groups array. The host must be a member in the respective netgroup for this test.
2. compile the first example: gcc -lpthread -o netg-lock netg-lock.c
3. compile the second example: gcc -lpthread -o netg-lock2 netg-lock2.c
4. run netg-lock. It will call innetgr() with 256 threads, protected by pthread_mutex_lock/unlock(). This run will print out statistics that report every single check having been successful.
5. run netg-lock2. It will call innetgr() with 256 threads, _not_ protected by pthread_mutex_lock/unlock(). This run will print out statistics that report many failing requests.
 


Actual results:

netg-lock is successful, netg-lock2 is not

Expected results:

both examples should be successful 

Additional info:
Our netgroups configuration is quite big and nested. We have 66 netgroups altogether that have 19037 host entries.

The bug is triggered for a host that is member of a nested group.

$ ldapsearch -x -b <searchbase> ObjectClass=nisNetgroup dn | grep dn: | wc -l
66

$ ldapsearch -x -b dc=rd,dc=corpintra,dc=net ObjectClass=nisNetgroup | grep nisNetgroupTriple: | wc -l
19037

Comment 2 Jakub Hrozek 2019-04-26 13:28:10 UTC
Thank you for the bug report. 

We're going to look at the bug, but because there are quite a few bug reports and cases against SSSD, I wanted to ask if you have a RH subscription? If yes, it would help us to prioritize the work better, if you could also open a RH support case.. 

Coming to the technical aspect of the problem -- after a quick discussion with a colleague, currently I'm not sure what could be the issue. innetgr() is implemented in libc completely as a wrapper around setnetgrent/getnetgrent/endnetgrent (https://sourceware.org/git/?p=glibc.git;a=blob;f=inet/getnetgrent_r.c;hb=1c81d55fc4b07b51adf68558ba74ce975153e580#l451) so nss_sss does not implement innetgr on its own. So either one of setnetgrent/getnetgrent/endnetgrent  is thread-unsafe or libc itself (unlikely?). And sssd deamon is single threaded, using an event loop for pseudo-concurrency. So on the technical side, this needs more investigation, but as I said, having a support case would help prioritize the investigation better.

Comment 3 Sumit Bose 2019-04-29 10:08:45 UTC
(In reply to Ulrich Sibiller from comment #0)
> Created attachment 1559039 [details]
> example source code
> 
> Description of problem:
> 
> "man 3 innetgr" states "The function is thread-safe." On our system it
> normally is but fails to be thread-safe if sssd is activated.
> 

I agree this sentence is in the man page nut there is the table with lists innetgr() as "MT-Unsafe race:netgrent locale" and the sentence "In the above table, netgrent in race:netgrent signifies that if any of the functions setnetgrent(), getnetgrent_r(), innetgr(), getnetgrent(), or endnetgrent() are used in parallel in different threads of a program, then data races could occur."

Additionally the glibc documentation (info libc) only says "MT-Unsafe race:netgrent locale".

I switch the component for this ticket to 'glibc' to ask the maintainers for clarification if innetgr() is expected to the thread-safe for the described use case.

bye,
Sumit

Comment 4 Florian Weimer 2019-04-29 11:17:47 UTC
I think the code backing innetgr in glibc and sssd should be thread-safe as-is.  Unlike many other NSS functions, local state is used for it, in the result buffer.  nss_sss also uses it to keep local state across _nss_sss_setnetgrent and internal_getnetgrent_r.  See the handling of “the struct __netgrent *result” argument.

I believe the race:netgrent annotation for innetgr is a bug.  The glibc manual even lists thread safety as a benefit of innetgr:

     This function tests whether the triple specified by the parameters
     HOST, USER, and DOMAIN is part of the netgroup NETGROUP.  Using
     this function has the advantage that

       1. no other netgroup function can use the global netgroup state
          since internal locking is used and
       2. the function is implemented more efficiently than successive
          calls to the other ‘set’/‘get’/‘endnetgrent’ functions.

I think we need to find out why innetgr is not thread-safe with sssd and then re-evaluate whether we should fix this issue, and update the glibc manual accordingly.  I would appreciate if you could help with that since I'm not familiar with sssd internals.  Thanks.

Comment 5 Sumit Bose 2019-04-29 12:34:21 UTC
Thank you Florian for the details.

I will move the ticket back to the SSSD component.

I guess the issue might be caused by the way concurrent netgroup requests are handled by the SSSD NSS responder process. Can you explain 

"""

       1. no other netgroup function can use the global netgroup state
          since internal locking is used and
       2. the function is implemented more efficiently than successive
          calls to the other ‘set’/‘get’/‘endnetgrent’ functions.
""""
with some more details? I wonder if nss_sss or SSSD's NSS responder can detect somehow that the original call is innetgr() and not a sequence of ‘set’/‘get’/‘endnetgrent’ functions. But after a short look at the code in inet/getnetgrent_r.c it looks to me that innetgr() is still implemented as a sequence of set/multiple gets/end calls and I also didn't see any special locking for innetgr() only for the individual ‘set’/‘get’/‘endnetgrent’ functions.

bye,
Sumit

Comment 6 Florian Weimer 2019-04-29 12:41:49 UTC
I think you can detect the call sequence using the struct __netgrent * argument.  See the nss/nss_files/files-netgrp.c code in glibc.

Comment 7 Sumit Bose 2019-04-29 16:09:47 UTC
(In reply to Florian Weimer from comment #6)
> I think you can detect the call sequence using the struct __netgrent *
> argument.  See the nss/nss_files/files-netgrp.c code in glibc.

Ok, I see. How should NSS modules outside of the glibc source tree use struct __netgrent, since it is not defined in a public header. Currently SSSD and e.g. nss-pam-ldapd are using a copy of the struct definition in their own source tree. But this might be risky, since glibc is free to change in internals of the struct at will (see e.g. https://pagure.io/SSSD/sssd/issue/3993 for a similar case).

Comment 8 Florian Weimer 2019-05-06 11:07:16 UTC
I will try to document this better on the glibc side. I don't think there is a way around defining struct __netgrent in a public header file.

The sssd sources already contain a copy of the definition of struct __netgrent because it is impossible to implement netgroup support without it. (The first few members are used for passing data to the glibc NSS framework.) I think you can reuse the data member to store a pointer to your own internal data.  You must set the data member to NULL in the endnetgrent function and free the resources there.

Comment 9 Sumit Bose 2019-05-06 11:26:48 UTC
(In reply to Florian Weimer from comment #8)
> I will try to document this better on the glibc side. I don't think there is
> a way around defining struct __netgrent in a public header file.
> 
> The sssd sources already contain a copy of the definition of struct
> __netgrent because it is impossible to implement netgroup support without
> it. (The first few members are used for passing data to the glibc NSS
> framework.) I think you can reuse the data member to store a pointer to your
> own internal data.  You must set the data member to NULL in the endnetgrent
> function and free the resources there.

Hi Florian,

thank you for taking care of this in glibc.

I will check from the SSSD side how we can use the data member from __netgrent to make netgroup calls thread safe.

bye,
Sumit

Comment 17 Sumit Bose 2021-03-18 07:52:21 UTC
Upstream ticket:
https://github.com/SSSD/sssd/issues/5540

Comment 18 Pavel Březina 2021-04-21 10:02:23 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/5541

* `master`
    * 29abf94e3aec2c54c76adf61318099097c41ea77 - intg test: test is innetgr() is thread-safe
    * 88eec1c22f188adc49f41b81fe5af03c995c690d - nss client: make innetgr() thread safe

Comment 25 errata-xmlrpc 2021-11-09 19:46:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4435