1703436 – sssd not thread-safe in innetgr()

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1703436 - sssd not thread-safe in innetgr()

Summary: sssd not thread-safe in innetgr()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	8.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	Sumit Bose
QA Contact:	Jakub Vavra
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-26 12:15 UTC by Ulrich Sibiller
Modified:	2024-12-20 18:50 UTC (History)
CC List:	20 users (show)
Fixed In Version:	sssd-2.5.0-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-09 19:46:33 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
example source code (757 bytes, application/gzip) 2019-04-26 12:15 UTC, Ulrich Sibiller	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 5540	None	open	sssd not thread-safe in innetgr()	2021-03-29 14:21:58 UTC
Red Hat Product Errata	RHBA-2021:4435	None	None	None	2021-11-09 19:46:53 UTC
Sourceware	16181	P2	NEW	Threade-safe implementation of innetgr and definition of struct __netgrent	2020-12-07 14:19:33 UTC

Description Ulrich Sibiller 2019-04-26 12:15:00 UTC

Created attachment 1559039 [details]
example source code

Description of problem:

"man 3 innetgr" states "The function is thread-safe." On our system it normally is but fails to be thread-safe if sssd is activated.

We hit this bug while debugging a GPFS (Spectrum Scale) issue. IBM support came up with the attached code example that triggers the problem without gpfs being involved.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Adapt "groups" and "hosts" arrays in the two attached source code files. The code will check if the machine in the hosts array is member of the netgroup mentioned in the  groups array. The host must be a member in the respective netgroup for this test.
2. compile the first example: gcc -lpthread -o netg-lock netg-lock.c
3. compile the second example: gcc -lpthread -o netg-lock2 netg-lock2.c
4. run netg-lock. It will call innetgr() with 256 threads, protected by pthread_mutex_lock/unlock(). This run will print out statistics that report every single check having been successful.
5. run netg-lock2. It will call innetgr() with 256 threads, _not_ protected by pthread_mutex_lock/unlock(). This run will print out statistics that report many failing requests.
 


Actual results:

netg-lock is successful, netg-lock2 is not

Expected results:

both examples should be successful 

Additional info:
Our netgroups configuration is quite big and nested. We have 66 netgroups altogether that have 19037 host entries.

The bug is triggered for a host that is member of a nested group.

$ ldapsearch -x -b <searchbase> ObjectClass=nisNetgroup dn | grep dn: | wc -l
66

$ ldapsearch -x -b dc=rd,dc=corpintra,dc=net ObjectClass=nisNetgroup | grep nisNetgroupTriple: | wc -l
19037

Comment 2 Jakub Hrozek 2019-04-26 13:28:10 UTC

Thank you for the bug report. 

We're going to look at the bug, but because there are quite a few bug reports and cases against SSSD, I wanted to ask if you have a RH subscription? If yes, it would help us to prioritize the work better, if you could also open a RH support case.. 

Coming to the technical aspect of the problem -- after a quick discussion with a colleague, currently I'm not sure what could be the issue. innetgr() is implemented in libc completely as a wrapper around setnetgrent/getnetgrent/endnetgrent (https://sourceware.org/git/?p=glibc.git;a=blob;f=inet/getnetgrent_r.c;hb=1c81d55fc4b07b51adf68558ba74ce975153e580#l451) so nss_sss does not implement innetgr on its own. So either one of setnetgrent/getnetgrent/endnetgrent  is thread-unsafe or libc itself (unlikely?). And sssd deamon is single threaded, using an event loop for pseudo-concurrency. So on the technical side, this needs more investigation, but as I said, having a support case would help prioritize the investigation better.

Comment 3 Sumit Bose 2019-04-29 10:08:45 UTC

(In reply to Ulrich Sibiller from comment #0)
> Created attachment 1559039 [details]
> example source code
> 
> Description of problem:
> 
> "man 3 innetgr" states "The function is thread-safe." On our system it
> normally is but fails to be thread-safe if sssd is activated.
> 

I agree this sentence is in the man page nut there is the table with lists innetgr() as "MT-Unsafe race:netgrent locale" and the sentence "In the above table, netgrent in race:netgrent signifies that if any of the functions setnetgrent(), getnetgrent_r(), innetgr(), getnetgrent(), or endnetgrent() are used in parallel in different threads of a program, then data races could occur."

Additionally the glibc documentation (info libc) only says "MT-Unsafe race:netgrent locale".

I switch the component for this ticket to 'glibc' to ask the maintainers for clarification if innetgr() is expected to the thread-safe for the described use case.

bye,
Sumit

Comment 4 Florian Weimer 2019-04-29 11:17:47 UTC

I think the code backing innetgr in glibc and sssd should be thread-safe as-is.  Unlike many other NSS functions, local state is used for it, in the result buffer.  nss_sss also uses it to keep local state across _nss_sss_setnetgrent and internal_getnetgrent_r.  See the handling of “the struct __netgrent *result” argument.

I believe the race:netgrent annotation for innetgr is a bug.  The glibc manual even lists thread safety as a benefit of innetgr:

     This function tests whether the triple specified by the parameters
     HOST, USER, and DOMAIN is part of the netgroup NETGROUP.  Using
     this function has the advantage that

       1. no other netgroup function can use the global netgroup state
          since internal locking is used and
       2. the function is implemented more efficiently than successive
          calls to the other ‘set’/‘get’/‘endnetgrent’ functions.

I think we need to find out why innetgr is not thread-safe with sssd and then re-evaluate whether we should fix this issue, and update the glibc manual accordingly.  I would appreciate if you could help with that since I'm not familiar with sssd internals.  Thanks.

Comment 5 Sumit Bose 2019-04-29 12:34:21 UTC

Thank you Florian for the details.

I will move the ticket back to the SSSD component.

I guess the issue might be caused by the way concurrent netgroup requests are handled by the SSSD NSS responder process. Can you explain 

"""

       1. no other netgroup function can use the global netgroup state
          since internal locking is used and
       2. the function is implemented more efficiently than successive
          calls to the other ‘set’/‘get’/‘endnetgrent’ functions.
""""
with some more details? I wonder if nss_sss or SSSD's NSS responder can detect somehow that the original call is innetgr() and not a sequence of ‘set’/‘get’/‘endnetgrent’ functions. But after a short look at the code in inet/getnetgrent_r.c it looks to me that innetgr() is still implemented as a sequence of set/multiple gets/end calls and I also didn't see any special locking for innetgr() only for the individual ‘set’/‘get’/‘endnetgrent’ functions.

bye,
Sumit

Comment 6 Florian Weimer 2019-04-29 12:41:49 UTC

I think you can detect the call sequence using the struct __netgrent * argument.  See the nss/nss_files/files-netgrp.c code in glibc.

Comment 7 Sumit Bose 2019-04-29 16:09:47 UTC

(In reply to Florian Weimer from comment #6)
> I think you can detect the call sequence using the struct __netgrent *
> argument.  See the nss/nss_files/files-netgrp.c code in glibc.

Ok, I see. How should NSS modules outside of the glibc source tree use struct __netgrent, since it is not defined in a public header. Currently SSSD and e.g. nss-pam-ldapd are using a copy of the struct definition in their own source tree. But this might be risky, since glibc is free to change in internals of the struct at will (see e.g. https://pagure.io/SSSD/sssd/issue/3993 for a similar case).

Comment 8 Florian Weimer 2019-05-06 11:07:16 UTC

I will try to document this better on the glibc side. I don't think there is a way around defining struct __netgrent in a public header file.

The sssd sources already contain a copy of the definition of struct __netgrent because it is impossible to implement netgroup support without it. (The first few members are used for passing data to the glibc NSS framework.) I think you can reuse the data member to store a pointer to your own internal data.  You must set the data member to NULL in the endnetgrent function and free the resources there.

Comment 9 Sumit Bose 2019-05-06 11:26:48 UTC

(In reply to Florian Weimer from comment #8)
> I will try to document this better on the glibc side. I don't think there is
> a way around defining struct __netgrent in a public header file.
> 
> The sssd sources already contain a copy of the definition of struct
> __netgrent because it is impossible to implement netgroup support without
> it. (The first few members are used for passing data to the glibc NSS
> framework.) I think you can reuse the data member to store a pointer to your
> own internal data.  You must set the data member to NULL in the endnetgrent
> function and free the resources there.

Hi Florian,

thank you for taking care of this in glibc.

I will check from the SSSD side how we can use the data member from __netgrent to make netgroup calls thread safe.

bye,
Sumit

Comment 17 Sumit Bose 2021-03-18 07:52:21 UTC

Upstream ticket:
https://github.com/SSSD/sssd/issues/5540

Comment 18 Pavel Březina 2021-04-21 10:02:23 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/5541

* `master`
    * 29abf94e3aec2c54c76adf61318099097c41ea77 - intg test: test is innetgr() is thread-safe
    * 88eec1c22f188adc49f41b81fe5af03c995c690d - nss client: make innetgr() thread safe

Comment 25 errata-xmlrpc 2021-11-09 19:46:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4435

Note You need to log in before you can comment on or make changes to this bug.

apeddire
arajendr
ashankar
atikhono
codonell
dj
dlavu
fweimer
grajaiya
jhrozek
knweiss
lslebodn
mnewsome
mzidek
pbrezina
pfrankli
sbose
sssd-maint
thalman
tscherf