Red Hat Bugzilla – Bug 1502686
crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout
Last modified: 2018-09-24 09:51:59 EDT
-bash-4.2$ sudo su - ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1507745585 [root@server ~]# [root@server ~]# abrt-cli list --since 1507745585 id 59a03095e17c65c3ff2761c5c00835085ee3015d reason: sssd_nss killed by SIGABRT time: Thu 05 Oct 2017 03:24:01 AM EDT cmdline: /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files package: sssd-common-1.15.2-50.el7_4.2 uid: 0 (root) count: 8 Directory: /var/spool/abrt/ccpp-2017-10-05-03:19:01-783 Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-05-03:19:01-783' for creating a case in Red Hat Customer Portal The Autoreporting feature is disabled. Please consider enabling it by issuing 'abrt-auto-reporting enabled' as a user with root privileges
backtrace: (gdb) bt #0 0x00007fab82f111f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fab82f128e8 in __GI_abort () at abort.c:90 #2 0x00007fab836aacfc in talloc_abort (reason=0x7fab836b3818 "Bad talloc magic value - unknown value") at ../talloc.c:426 #3 0x00007fab836ab05d in talloc_abort_unknown_value () at ../talloc.c:444 #4 talloc_chunk_from_ptr (ptr=0x55ef28d0ce40) at ../talloc.c:463 #5 __talloc_get_name (ptr=0x55ef28d0ce40) at ../talloc.c:1486 #6 talloc_check_name (ptr=ptr@entry=0x55ef28d0ce40, name=name@entry=0x55ef26e7bc0a "struct nss_enum_ctx") at ../talloc.c:1509 #7 0x000055ef26e60ec7 in nss_setnetgrent_timeout (ev=<optimized out>, te=<optimized out>, current_time=..., pvt=0x55ef28d0ce40) at src/responder/nss/nss_enum.c:270 #8 0x00007fab838c0c97 in tevent_common_loop_timer_delay (ev=0x55ef28cb3a30) at ../tevent_timed.c:369 #9 0x00007fab838c1f49 in epoll_event_loop (tvalp=0x7ffe5f6be230, epoll_ev=0x55ef28cb3cb0) at ../tevent_epoll.c:659 #10 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930 #11 0x00007fab838c02a7 in std_event_loop_once (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:114 #12 0x00007fab838bc0cd in _tevent_loop_once (ev=ev@entry=0x55ef28cb3a30, location=location@entry=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:721 #13 0x00007fab838bc2fb in tevent_common_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:844 #14 0x00007fab838c0247 in std_event_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:145 #15 0x00007fab8742eb33 in server_loop (main_ctx=0x55ef28cb4ec0) at src/util/server.c:718 #16 0x000055ef26e5e04d in main (argc=6, argv=<optimized out>) at src/responder/nss/nsssrv.c:560 And it looks like similar crash in rhel6. https://bugzilla.redhat.com/show_bug.cgi?id=1478525#c5 So our assumption that cache_req refactoring fix the crash was wrong.
Question for other developers: Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523? or it will be better to have separate ticket for it.
(In reply to Lukas Slebodnik from comment #5) > Question for other developers: > Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523? > or it will be better to have separate ticket for it. I agree that this looks very similar and I would link this to https://pagure.io/SSSD/sssd/issue/3523 as well.
Upstream ticket: https://pagure.io/SSSD/sssd/issue/3523
* master: f6a1cef87abdd983d6b5349cd341c9a249826577
Verified with sssd-1.16.0-9.el7.x86_64 Verification steps: 1. Configure ldap server with one instance 2. Configure sssd client with typo in ldap server in default configuration file 3. Remove cache and sssd logs # rm -f /var/log/sssd/* /var/lib/sss/db/* 4. # service sssd restart From sssd domain log messages, sssd is offline, ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [dp_get_options] (0x0400): Option ldap_offline_timeout has value 60 ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error]) ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Going offline! ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Initialize check_if_online_ptask. ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1 ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline ./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline # service sssd status Redirecting to /bin/systemctl status sssd.service ● sssd.service - System Security Services Daemon Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2017-12-05 03:46:16 EST; 22min ago Main PID: 24025 (sssd) CGroup: /system.slice/sssd.service ├─24025 /usr/sbin/sssd -i --logger=files ├─24026 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --logger=files ├─24027 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files └─24028 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099138 2017) [sssd[pam]] [l...ck" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099168 2017) [sssd[pam]] [l...ut" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099185 2017) [sssd[pam]] [l...ck" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...460 Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...af0 Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ut" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck" Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon. Dec 05 03:46:17 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[be[LDAP]][24026]: Backend is offline 5. request for netgroup # getent netgroup -s sss netgroup_user; sleep 16; pgrep -lf sssd 24076 sssd 24077 sssd_be 24078 sssd_nss 24079 sssd_pam # cat /etc/sssd/sssd.conf [sssd] config_file_version = 2 domains = LDAP services = nss, pam [domain/LDAP] ldap_search_base = dc=example,dc=com debug_level = 9 id_provider = ldap auth_provider = ldap ldap_user_home_directory = /home/%u ldap_uri = ldaps://typo.server.example.com:636 ldap_tls_cacert = /etc/openldap/certs/cacert.pem use_fully_qualified_names = True [nss] debug_level = 9 [pam] debug_level = 9 sssd service is running without any crash.
Is there a way to get the updated package now or is the errata coming soon? I keep having sssd_nss SIGABRT on servers (two last night) and the only solution is power cycle as even login as local root user hangs and never completes.
I can add it happens at log rotation at around 3:30am during the cron.daily/logwatch script. Had it happen on over 10 servers last night. Some seem to recover immediately with a new sssd_nss process running, but others, my most busy NFS servers, lock up without even local or serial console working to get a login. I have to powercycle. abrt seems to fail on those servers too as after the powercycle the /var/spool/abrt/ccpp-...new directory is just empty (and has that "new" suffix). On the systems that recovered, the abrt dir is full and a backtrace shows: #4 0x000055e2fef11f17 in nss_setnetgrent_timeout (ev=<optimized out>, te=<optimized out>, current_time=..., pvt=0x55e300853270) at src/responder/nss/nss_enum.c:270
*** Bug 1538555 has been marked as a duplicate of this bug. ***
I see a new sssd package was just created but did not include a fix for this. What more info is needed? I can confirm the patch from https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it and I still get constant crashes at logrotate time on those systems where I have not.
(In reply to Paul Raines from comment #48) > I see a new sssd package was just created but did not include a fix for > this. What more info is needed? I can confirm the patch from > https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it > and I still get constant crashes at logrotate time on those systems where I > have not. Paul, Which package are you talking about exactly? This is fixed on sssd-1.16.0-5.el7 (which will be part of the RHEL-7.5 release).
sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by https://access.redhat.com/errata/RHBA-2018:0402 When is 7.5 expected to release?
(In reply to Paul Raines from comment #50) > sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by > https://access.redhat.com/errata/RHBA-2018:0402 There's no z-stream bug request for 7.4 yet (thus, the patch wasn't backported there). In case you have a subscription I'd strongly recommend you to work with support then we can have this bug cloned to RHEL-7.4 > > When is 7.5 expected to release? Beta has already been released: https://www.redhat.com/en/blog/red-hat-enterprise-linux-75-beta-now-available
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0929