Bug 1502686
| Summary: | crash - /usr/libexec/sssd/sssd_nss in nss_setnetgrent_timeout | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | aheverle | |
| Component: | sssd | Assignee: | SSSD Maintainers <sssd-maint> | |
| Status: | CLOSED ERRATA | QA Contact: | Madhuri <mupadhye> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 7.4 | CC: | amitkuma, apeddire, atolani, atripath, enewland, fidencio, gparente, grajaiya, jhrozek, jowright, jpriddy, kludhwan, knweiss, lmanasko, lslebodn, mbliss, minyu, mkosek, mzidek, nsoman, pbrezina, raines, rbdiri, sbose, sgoveas, sssd-maint, tscherf | |
| Target Milestone: | rc | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | sssd-1.16.0-5.el7 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the *Network Security Services* (NSS) responder's code used a faulty memory hierarchy for keeping the in-memory representation of a netgroup. Consequently, if the in-memory representation of a netgroup had expired and the netgroup was requested, the "sssd_nss" process sometimes terminated unexpectedly. With this update, the memory hierarchy has been corrected. As a result, the crash no longer occurs when a netgroup is requested whose internal netgroup representation has expired.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1625213 (view as bug list) | Environment: | ||
| Last Closed: | 2018-04-10 17:18:11 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1625213 | |||
|
Description
aheverle
2017-10-16 12:34:47 UTC
backtrace: (gdb) bt #0 0x00007fab82f111f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fab82f128e8 in __GI_abort () at abort.c:90 #2 0x00007fab836aacfc in talloc_abort (reason=0x7fab836b3818 "Bad talloc magic value - unknown value") at ../talloc.c:426 #3 0x00007fab836ab05d in talloc_abort_unknown_value () at ../talloc.c:444 #4 talloc_chunk_from_ptr (ptr=0x55ef28d0ce40) at ../talloc.c:463 #5 __talloc_get_name (ptr=0x55ef28d0ce40) at ../talloc.c:1486 #6 talloc_check_name (ptr=ptr@entry=0x55ef28d0ce40, name=name@entry=0x55ef26e7bc0a "struct nss_enum_ctx") at ../talloc.c:1509 #7 0x000055ef26e60ec7 in nss_setnetgrent_timeout (ev=<optimized out>, te=<optimized out>, current_time=..., pvt=0x55ef28d0ce40) at src/responder/nss/nss_enum.c:270 #8 0x00007fab838c0c97 in tevent_common_loop_timer_delay (ev=0x55ef28cb3a30) at ../tevent_timed.c:369 #9 0x00007fab838c1f49 in epoll_event_loop (tvalp=0x7ffe5f6be230, epoll_ev=0x55ef28cb3cb0) at ../tevent_epoll.c:659 #10 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930 #11 0x00007fab838c02a7 in std_event_loop_once (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:114 #12 0x00007fab838bc0cd in _tevent_loop_once (ev=ev@entry=0x55ef28cb3a30, location=location@entry=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:721 #13 0x00007fab838bc2fb in tevent_common_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent.c:844 #14 0x00007fab838c0247 in std_event_loop_wait (ev=0x55ef28cb3a30, location=0x7fab8744eec7 "src/util/server.c:718") at ../tevent_standard.c:145 #15 0x00007fab8742eb33 in server_loop (main_ctx=0x55ef28cb4ec0) at src/util/server.c:718 #16 0x000055ef26e5e04d in main (argc=6, argv=<optimized out>) at src/responder/nss/nsssrv.c:560 And it looks like similar crash in rhel6. https://bugzilla.redhat.com/show_bug.cgi?id=1478525#c5 So our assumption that cache_req refactoring fix the crash was wrong. Question for other developers: Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523? or it will be better to have separate ticket for it. (In reply to Lukas Slebodnik from comment #5) > Question for other developers: > Do we want to track it as part of https://pagure.io/SSSD/sssd/issue/3523? > or it will be better to have separate ticket for it. I agree that this looks very similar and I would link this to https://pagure.io/SSSD/sssd/issue/3523 as well. Upstream ticket: https://pagure.io/SSSD/sssd/issue/3523 * master: f6a1cef87abdd983d6b5349cd341c9a249826577 Verified with
sssd-1.16.0-9.el7.x86_64
Verification steps:
1. Configure ldap server with one instance
2. Configure sssd client with typo in ldap server in default configuration file
3. Remove cache and sssd logs
# rm -f /var/log/sssd/* /var/lib/sss/db/*
4. # service sssd restart
From sssd domain log messages, sssd is offline,
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [dp_get_options] (0x0400): Option ldap_offline_timeout has value 60
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Going offline!
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_mark_offline] (0x2000): Initialize check_if_online_ptask.
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline
./sssd_LDAP.log:(Tue Dec 5 03:46:16 2017) [sssd[be[LDAP]]] [be_ptask_offline_cb] (0x0400): Back end is offline
# service sssd status
Redirecting to /bin/systemctl status sssd.service
● sssd.service - System Security Services Daemon
Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2017-12-05 03:46:16 EST; 22min ago
Main PID: 24025 (sssd)
CGroup: /system.slice/sssd.service
├─24025 /usr/sbin/sssd -i --logger=files
├─24026 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --logger=files
├─24027 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
└─24028 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099138 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099168 2017) [sssd[pam]] [l...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16:099185 2017) [sssd[pam]] [l...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...460
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...af0
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ut"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[24025]: (Tue Dec 5 03:46:16 2017) [sssd[pam]] [ldb] (0x...ck"
Dec 05 03:46:16 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon.
Dec 05 03:46:17 ibm-x3650m4-01-vm-06.lab.eng.bos.redhat.com sssd[be[LDAP]][24026]: Backend is offline
5. request for netgroup
# getent netgroup -s sss netgroup_user; sleep 16; pgrep -lf sssd
24076 sssd
24077 sssd_be
24078 sssd_nss
24079 sssd_pam
# cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
domains = LDAP
services = nss, pam
[domain/LDAP]
ldap_search_base = dc=example,dc=com
debug_level = 9
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://typo.server.example.com:636
ldap_tls_cacert = /etc/openldap/certs/cacert.pem
use_fully_qualified_names = True
[nss]
debug_level = 9
[pam]
debug_level = 9
sssd service is running without any crash.
Is there a way to get the updated package now or is the errata coming soon? I keep having sssd_nss SIGABRT on servers (two last night) and the only solution is power cycle as even login as local root user hangs and never completes. I can add it happens at log rotation at around 3:30am during the cron.daily/logwatch script. Had it happen on over 10 servers last night. Some seem to recover immediately with a new sssd_nss process running, but others, my most busy NFS servers, lock up without even local or serial console working to get a login. I have to powercycle. abrt seems to fail on those servers too as after the powercycle the /var/spool/abrt/ccpp-...new directory is just empty (and has that "new" suffix). On the systems that recovered, the abrt dir is full and a backtrace shows:
#4 0x000055e2fef11f17 in nss_setnetgrent_timeout (ev=<optimized out>,
te=<optimized out>, current_time=..., pvt=0x55e300853270)
at src/responder/nss/nss_enum.c:270
*** Bug 1538555 has been marked as a duplicate of this bug. *** I see a new sssd package was just created but did not include a fix for this. What more info is needed? I can confirm the patch from https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it and I still get constant crashes at logrotate time on those systems where I have not. (In reply to Paul Raines from comment #48) > I see a new sssd package was just created but did not include a fix for > this. What more info is needed? I can confirm the patch from > https://pagure.io/SSSD/sssd/issue/3523 works on those systems I applied it > and I still get constant crashes at logrotate time on those systems where I > have not. Paul, Which package are you talking about exactly? This is fixed on sssd-1.16.0-5.el7 (which will be part of the RHEL-7.5 release). sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by https://access.redhat.com/errata/RHBA-2018:0402 When is 7.5 expected to release? (In reply to Paul Raines from comment #50) > sssd-1.15.2-50.el7_4.11.src.rpm just released yesterday by > https://access.redhat.com/errata/RHBA-2018:0402 There's no z-stream bug request for 7.4 yet (thus, the patch wasn't backported there). In case you have a subscription I'd strongly recommend you to work with support then we can have this bug cloned to RHEL-7.4 > > When is 7.5 expected to release? Beta has already been released: https://www.redhat.com/en/blog/red-hat-enterprise-linux-75-beta-now-available Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0929 |