1502686 – crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1502686 - crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout

Summary: crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	7.4
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	SSSD Maintainers
QA Contact:	Madhuri
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1625213
TreeView+	depends on / blocked

Reported:	2017-10-16 12:34 UTC by aheverle
Modified:	2021-09-09 12:42 UTC (History)
CC List:	27 users (show)
Fixed In Version:	sssd-1.16.0-5.el7
Doc Type:	Bug Fix
Doc Text:	Previously, the Network Security Services (NSS) responder's code used a faulty memory hierarchy for keeping the in-memory representation of a netgroup. Consequently, if the in-memory representation of a netgroup had expired and the netgroup was requested, the "sssd_nss" process sometimes terminated unexpectedly. With this update, the memory hierarchy has been corrected. As a result, the crash no longer occurs when a netgroup is requested whose internal netgroup representation has expired.
Clone Of:
Clones:	1625213 (view as bug list)
Environment:
Last Closed:	2018-04-10 17:18:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 4549	0	None	closed	ABRT crash - /usr/libexec/sssd/sssd_nss in setnetgrent_result_timeout	2021-01-31 11:18:22 UTC
Red Hat Product Errata	RHEA-2018:0929	0	None	None	None	2018-04-10 17:19:41 UTC

Description aheverle 2017-10-16 12:34:47 UTC

-bash-4.2$ sudo su -
ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1507745585
[root@server ~]#
[root@server ~]# abrt-cli list --since 1507745585
id 59a03095e17c65c3ff2761c5c00835085ee3015d
reason:         sssd_nss killed by SIGABRT
time:           Thu 05 Oct 2017 03:24:01 AM EDT
cmdline:        /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
package:        sssd-common-1.15.2-50.el7_4.2
uid:            0 (root)
count:          8
Directory:      /var/spool/abrt/ccpp-2017-10-05-03:19:01-783
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-05-03:19:01-783' for creating a case in Red Hat Customer Portal

The Autoreporting feature is disabled. Please consider enabling it by issuing
'abrt-auto-reporting enabled' as a user with root privileges

Comment 4 Lukas Slebodnik 2017-10-16 12:47:33 UTC

backtrace:
(gdb) bt
#0  0x00007fab82f111f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fab82f128e8 in __GI_abort () at abort.c:90
#2  0x00007fab836aacfc in talloc_abort (reason=0x7fab836b3818 "Bad talloc magic value - unknown value") at ../talloc.c:426
#3  0x00007fab836ab05d in talloc_abort_unknown_value () at ../talloc.c:444
#4  talloc_chunk_from_ptr (ptr=0x55ef28d0ce40) at ../talloc.c:463
#5  __talloc_get_name (ptr=0x55ef28d0ce40) at ../talloc.c:1486
#6  talloc_check_name (ptr=ptr@entry=0x55ef28d0ce40, name=name@entry=0x55ef26e7bc0a "struct nss_enum_ctx") at ../talloc.c:1509
#7  0x000055ef26e60ec7 in nss_setnetgrent_timeout (ev=<optimized out>, te=<optimized out>, current_time=..., pvt=0x55ef28d0ce40) at src/responder/nss/nss_enum.c:270
#8  0x00007fab838c0c97 in tevent_common_loop_timer_delay (ev=0x55ef28cb3a30) at ../tevent_timed.c:369
#9  0x00007fab838c1f49 in epoll_event_loop (tvalp=0x7ffe5f6be230, epoll_ev=0x55ef28cb3cb0) at ../tevent_epoll.c:659
#10 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930
#11 0x00007fab838c02a7 in std_event_loop_once (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:114
#12 0x00007fab838bc0cd in _tevent_loop_once (ev=ev@entry=0x55ef28cb3a30, location=location@entry=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:721
#13 0x00007fab838bc2fb in tevent_common_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:844
#14 0x00007fab838c0247 in std_event_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:145
#15 0x00007fab8742eb33 in server_loop (main_ctx=0x55ef28cb4ec0) at src/util/server.c:718
#16 0x000055ef26e5e04d in main (argc=6, argv=<optimized out>) at src/responder/nss/nsssrv.c:560

And it looks like similar crash in rhel6. https://bugzilla.redhat.com/show_bug.cgi?id=1478525#c5

So our assumption that cache_req refactoring fix the crash was wrong.

Comment 5 Lukas Slebodnik 2017-10-16 12:49:18 UTC

Question for other developers:
Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523?
or it will be better to have separate ticket for it.

Comment 6 Sumit Bose 2017-10-16 13:44:29 UTC

(In reply to Lukas Slebodnik from comment #5)
> Question for other developers:
> Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523?
> or it will be better to have separate ticket for it.

I agree that this looks very similar and I would link this to https://pagure.io/SSSD/sssd/issue/3523 as well.

Comment 7 Lukas Slebodnik 2017-10-16 14:12:08 UTC

Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3523

Comment 31 Fabiano Fidêncio 2017-11-13 16:35:05 UTC

* master: f6a1cef87abdd983d6b5349cd341c9a249826577

Comment 35 Madhuri 2017-12-05 11:13:26 UTC

Verified with
sssd-1.16.0-9.el7.x86_64

Verification steps:
1. Configure ldap server with one instance
2. Configure sssd client with typo in ldap server in default configuration file
3. Remove cache and sssd logs
          # rm -f /var/log/sssd/* /var/lib/sss/db/*

4. # service sssd restart

From sssd domain log messages, sssd is offline,
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [dp_get_options] (0x0400): Option ldap_offline_timeout has value 60
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Going offline!
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Initialize check_if_online_ptask.
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline

# service sssd status
Redirecting to /bin/systemctl status sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-12-05 03:46:16 EST; 22min ago
 Main PID: 24025 (sssd)
   CGroup: /system.slice/sssd.service
           ├─24025 /usr/sbin/sssd -i --logger=files
           ├─24026 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --logger=files
           ├─24027 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           └─24028 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files

Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099138 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099168 2017) [sssd[pam]] [l...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099185 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...460
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...af0
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon.
Dec 05 03:46:17 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[be[LDAP]][24026]: Backend is offline

5.  request for netgroup
# getent netgroup -s sss netgroup_user; sleep 16; pgrep -lf sssd
24076 sssd
24077 sssd_be
24078 sssd_nss
24079 sssd_pam

# cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
domains = LDAP
services = nss, pam

[domain/LDAP]
ldap_search_base = dc=example,dc=com
debug_level = 9
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://typo.server.example.com:636
ldap_tls_cacert = /etc/openldap/certs/cacert.pem
use_fully_qualified_names = True

[nss]
debug_level = 9

[pam]
debug_level = 9


sssd service is running without any crash.

Comment 36 Paul Raines 2017-12-30 14:30:46 UTC

Is there a way to get the updated package now or is the errata coming soon?
I keep having sssd_nss SIGABRT on servers (two last night) and the only solution is power cycle as even login as local root user hangs and never completes.

Comment 38 Paul Raines 2018-01-01 17:55:45 UTC

I can add it happens at log rotation at around 3:30am during the cron.daily/logwatch script.  Had it happen on over 10 servers last night.  Some seem to recover immediately with a new sssd_nss process running, but others, my most busy NFS servers, lock up without even local or serial console working to get a login.  I have to powercycle.  abrt seems to fail on those servers too as after the powercycle the /var/spool/abrt/ccpp-...new directory is just empty (and has that "new" suffix).   On the systems that recovered, the abrt dir is full and a backtrace shows:

#4  0x000055e2fef11f17 in nss_setnetgrent_timeout (ev=<optimized out>,
    te=<optimized out>, current_time=..., pvt=0x55e300853270)
    at src/responder/nss/nss_enum.c:270

Comment 46 German Parente 2018-03-06 09:06:06 UTC

*** Bug 1538555 has been marked as a duplicate of this bug. ***

Comment 48 Paul Raines 2018-03-09 14:11:29 UTC

I see a new sssd package was just created but did not include a fix for this.    What more info is needed?  I can confirm the patch from https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it and I still get constant crashes at logrotate time on those systems where I have not.

Comment 49 Fabiano Fidêncio 2018-03-09 14:19:16 UTC

(In reply to Paul Raines from comment #48)
> I see a new sssd package was just created but did not include a fix for
> this.    What more info is needed?  I can confirm the patch from
> https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it
> and I still get constant crashes at logrotate time on those systems where I
> have not.

Paul,

Which package are you talking about exactly?
This is fixed on sssd-1.16.0-5.el7 (which will be part of the RHEL-7.5 release).

Comment 50 Paul Raines 2018-03-09 14:31:37 UTC

sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by https://access.redhat.com/errata/RHBA-2018:0402

When is 7.5 expected to release?

Comment 51 Fabiano Fidêncio 2018-03-09 14:40:06 UTC

(In reply to Paul Raines from comment #50)
> sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by
> https://access.redhat.com/errata/RHBA-2018:0402

There's no z-stream bug request for 7.4 yet (thus, the patch wasn't backported there).

In case you have a subscription I'd strongly recommend you to work with support then we can have this bug cloned to RHEL-7.4

> 
> When is 7.5 expected to release?

Beta has already been released: https://www.redhat.com/en/blog/red-hat-enterprise-linux-75-beta-now-available

Comment 55 errata-xmlrpc 2018-04-10 17:18:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929

Note You need to log in before you can comment on or make changes to this bug.

amitkuma
apeddire
atolani
atripath
enewland
fidencio
gparente
grajaiya
jhrozek
jowright
jpriddy
kludhwan
knweiss
lmanasko
lslebodn
mbliss
minyu
mkosek
mzidek
nsoman
pbrezina
raines
rbdiri
sbose
sgoveas
sssd-maint
tscherf