Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1502686 - crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout
crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd (Show other bugs)
7.4
All All
high Severity high
: rc
: ---
Assigned To: SSSD Maintainers
Madhuri
: ZStream
Depends On:
Blocks: 1625213
  Show dependency treegraph
 
Reported: 2017-10-16 08:34 EDT by aheverle
Modified: 2018-09-24 09:51 EDT (History)
27 users (show)

See Also:
Fixed In Version: sssd-1.16.0-5.el7
Doc Type: Bug Fix
Doc Text:
Previously, the *Network Security Services* (NSS) responder's code used a faulty memory hierarchy for keeping the in-memory representation of a netgroup. Consequently, if the in-memory representation of a netgroup had expired and the netgroup was requested, the "sssd_nss" process sometimes terminated unexpectedly. With this update, the memory hierarchy has been corrected. As a result, the crash no longer occurs when a netgroup is requested whose internal netgroup representation has expired.
Story Points: ---
Clone Of:
: 1625213 (view as bug list)
Environment:
Last Closed: 2018-04-10 13:18:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0929 None None None 2018-04-10 13:19 EDT

  None (edit)
Description aheverle 2017-10-16 08:34:47 EDT
-bash-4.2$ sudo su -
ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1507745585
[root@server ~]#
[root@server ~]# abrt-cli list --since 1507745585
id 59a03095e17c65c3ff2761c5c00835085ee3015d
reason:         sssd_nss killed by SIGABRT
time:           Thu 05 Oct 2017 03:24:01 AM EDT
cmdline:        /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
package:        sssd-common-1.15.2-50.el7_4.2
uid:            0 (root)
count:          8
Directory:      /var/spool/abrt/ccpp-2017-10-05-03:19:01-783
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-05-03:19:01-783' for creating a case in Red Hat Customer Portal

The Autoreporting feature is disabled. Please consider enabling it by issuing
'abrt-auto-reporting enabled' as a user with root privileges
Comment 4 Lukas Slebodnik 2017-10-16 08:47:33 EDT
backtrace:
(gdb) bt
#0  0x00007fab82f111f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fab82f128e8 in __GI_abort () at abort.c:90
#2  0x00007fab836aacfc in talloc_abort (reason=0x7fab836b3818 "Bad talloc magic value - unknown value") at ../talloc.c:426
#3  0x00007fab836ab05d in talloc_abort_unknown_value () at ../talloc.c:444
#4  talloc_chunk_from_ptr (ptr=0x55ef28d0ce40) at ../talloc.c:463
#5  __talloc_get_name (ptr=0x55ef28d0ce40) at ../talloc.c:1486
#6  talloc_check_name (ptr=ptr@entry=0x55ef28d0ce40, name=name@entry=0x55ef26e7bc0a "struct nss_enum_ctx") at ../talloc.c:1509
#7  0x000055ef26e60ec7 in nss_setnetgrent_timeout (ev=<optimized out>, te=<optimized out>, current_time=..., pvt=0x55ef28d0ce40) at src/responder/nss/nss_enum.c:270
#8  0x00007fab838c0c97 in tevent_common_loop_timer_delay (ev=0x55ef28cb3a30) at ../tevent_timed.c:369
#9  0x00007fab838c1f49 in epoll_event_loop (tvalp=0x7ffe5f6be230, epoll_ev=0x55ef28cb3cb0) at ../tevent_epoll.c:659
#10 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930
#11 0x00007fab838c02a7 in std_event_loop_once (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:114
#12 0x00007fab838bc0cd in _tevent_loop_once (ev=ev@entry=0x55ef28cb3a30, location=location@entry=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:721
#13 0x00007fab838bc2fb in tevent_common_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:844
#14 0x00007fab838c0247 in std_event_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:145
#15 0x00007fab8742eb33 in server_loop (main_ctx=0x55ef28cb4ec0) at src/util/server.c:718
#16 0x000055ef26e5e04d in main (argc=6, argv=<optimized out>) at src/responder/nss/nsssrv.c:560

And it looks like similar crash in rhel6. https://bugzilla.redhat.com/show_bug.cgi?id=1478525#c5

So our assumption that cache_req refactoring fix the crash was wrong.
Comment 5 Lukas Slebodnik 2017-10-16 08:49:18 EDT
Question for other developers:
Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523?
or it will be better to have separate ticket for it.
Comment 6 Sumit Bose 2017-10-16 09:44:29 EDT
(In reply to Lukas Slebodnik from comment #5)
> Question for other developers:
> Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523?
> or it will be better to have separate ticket for it.

I agree that this looks very similar and I would link this to https://pagure.io/SSSD/sssd/issue/3523 as well.
Comment 7 Lukas Slebodnik 2017-10-16 10:12:08 EDT
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3523
Comment 31 Fabiano Fidêncio 2017-11-13 11:35:05 EST
* master: f6a1cef87abdd983d6b5349cd341c9a249826577
Comment 35 Madhuri 2017-12-05 06:13:26 EST
Verified with
sssd-1.16.0-9.el7.x86_64

Verification steps:
1. Configure ldap server with one instance
2. Configure sssd client with typo in ldap server in default configuration file
3. Remove cache and sssd logs
          # rm -f /var/log/sssd/* /var/lib/sss/db/*

4. # service sssd restart

From sssd domain log messages, sssd is offline,
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [dp_get_options] (0x0400): Option ldap_offline_timeout has value 60
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Going offline!
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Initialize check_if_online_ptask.
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline
./sssd_LDAP.log:(Tue Dec  5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline

# service sssd status
Redirecting to /bin/systemctl status sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-12-05 03:46:16 EST; 22min ago
 Main PID: 24025 (sssd)
   CGroup: /system.slice/sssd.service
           ├─24025 /usr/sbin/sssd -i --logger=files
           ├─24026 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --logger=files
           ├─24027 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           └─24028 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files

Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099138 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099168 2017) [sssd[pam]] [l...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16:099185 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...460
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...af0
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec  5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon.
Dec 05 03:46:17 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[be[LDAP]][24026]: Backend is offline

5.  request for netgroup
# getent netgroup -s sss netgroup_user; sleep 16; pgrep -lf sssd
24076 sssd
24077 sssd_be
24078 sssd_nss
24079 sssd_pam

# cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
domains = LDAP
services = nss, pam

[domain/LDAP]
ldap_search_base = dc=example,dc=com
debug_level = 9
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://typo.server.example.com:636
ldap_tls_cacert = /etc/openldap/certs/cacert.pem
use_fully_qualified_names = True

[nss]
debug_level = 9

[pam]
debug_level = 9


sssd service is running without any crash.
Comment 36 Paul Raines 2017-12-30 09:30:46 EST
Is there a way to get the updated package now or is the errata coming soon?
I keep having sssd_nss SIGABRT on servers (two last night) and the only solution is power cycle as even login as local root user hangs and never completes.
Comment 38 Paul Raines 2018-01-01 12:55:45 EST
I can add it happens at log rotation at around 3:30am during the cron.daily/logwatch script.  Had it happen on over 10 servers last night.  Some seem to recover immediately with a new sssd_nss process running, but others, my most busy NFS servers, lock up without even local or serial console working to get a login.  I have to powercycle.  abrt seems to fail on those servers too as after the powercycle the /var/spool/abrt/ccpp-...new directory is just empty (and has that "new" suffix).   On the systems that recovered, the abrt dir is full and a backtrace shows:

#4  0x000055e2fef11f17 in nss_setnetgrent_timeout (ev=<optimized out>,
    te=<optimized out>, current_time=..., pvt=0x55e300853270)
    at src/responder/nss/nss_enum.c:270
Comment 46 German Parente 2018-03-06 04:06:06 EST
*** Bug 1538555 has been marked as a duplicate of this bug. ***
Comment 48 Paul Raines 2018-03-09 09:11:29 EST
I see a new sssd package was just created but did not include a fix for this.    What more info is needed?  I can confirm the patch from https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it and I still get constant crashes at logrotate time on those systems where I have not.
Comment 49 Fabiano Fidêncio 2018-03-09 09:19:16 EST
(In reply to Paul Raines from comment #48)
> I see a new sssd package was just created but did not include a fix for
> this.    What more info is needed?  I can confirm the patch from
> https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it
> and I still get constant crashes at logrotate time on those systems where I
> have not.

Paul,

Which package are you talking about exactly?
This is fixed on sssd-1.16.0-5.el7 (which will be part of the RHEL-7.5 release).
Comment 50 Paul Raines 2018-03-09 09:31:37 EST
sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by https://access.redhat.com/errata/RHBA-2018:0402

When is 7.5 expected to release?
Comment 51 Fabiano Fidêncio 2018-03-09 09:40:06 EST
(In reply to Paul Raines from comment #50)
> sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by
> https://access.redhat.com/errata/RHBA-2018:0402

There's no z-stream bug request for 7.4 yet (thus, the patch wasn't backported there).

In case you have a subscription I'd strongly recommend you to work with support then we can have this bug cloned to RHEL-7.4

> 
> When is 7.5 expected to release?

Beta has already been released: https://www.redhat.com/en/blog/red-hat-enterprise-linux-75-beta-now-available
Comment 55 errata-xmlrpc 2018-04-10 13:18:11 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929

Note You need to log in before you can comment on or make changes to this bug.