Hide Forgot
Description of problem: If enumeration is enabled SSSD seems to hang after few minutes upon restart. Customer mentioned after running in to this issue , non of the LDAP or *LOCAL* Users are not able to login. They have to manually restart the system. They used to run RHEL 7.2 with enumeration. After upgrading , its not operational after few minutes. We see lots of errors and unable to understand if they are all related. ~~~ Nov 16 11:02:31 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE Nov 16 11:04:01 host0 systemd[1]: sssd.service stop-final-sigterm timed out. Killing. Nov 16 11:04:01 host0 systemd[1]: Unit sssd.service entered failed state. Nov 16 11:04:01 host0 systemd[1]: sssd.service failed. Nov 16 11:05:24 host0 sssd: Starting up Nov 16 11:05:24 host0 sssd[be[rba.gov.au]]: Starting up Nov 16 11:05:24 host0 sssd[nss]: Starting up Nov 16 11:05:24 host0 sssd[pam]: Starting up Nov 16 11:05:24 host0 sssd[sudo]: Starting up Nov 16 11:05:24 host0 sssd[pac]: Starting up Nov 16 11:06:05 host0 sssd[be[rba.gov.au]]: Starting up Nov 16 11:06:46 host0 abrt-hook-ccpp: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core Nov 16 11:06:48 host0 sssd[be[rba.gov.au]]: Starting up Nov 16 11:16:45 host0 sssd[nss]: Starting up Nov 16 11:19:02 host0 sssd[nss]: Shutting down Nov 16 11:19:02 host0 sssd[pac]: Shutting down Nov 16 11:19:02 host0 sssd[pam]: Shutting down Nov 16 11:19:02 host0 sssd[sudo]: Shutting down Nov 16 11:19:25 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE ---------------- (Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0200): Marking nss as started. (Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 2443 (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0040): Returned with: 0 (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [nss][17955] (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Child [nss] terminated with a signal (Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [rba.gov.au][9579] (Wed Nov 16 11:19:25 2016) [sssd] [watchdog_handler] (0x0010): Watchdog timer overflow, killing process! (Wed Nov 16 11:19:25 2016) [sssd] [orderly_shutdown] (0x0010): SIGTERM: killing children (Wed Nov 16 11:19:25 2016) [sssd] [sbus_remove_watch] (0x2000): 0x7f5e64f34730/0x7f5e64f35bf0 ~~~~ Also we see SSSD exiting with coredumps. ~~~ Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au audispd[1269]: node=hlit-fsst-tep01.rba.gov.au type=ANOM_ABEND msg=audit(1479254806.954:674): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3118 comm="sssd_be" reason="memory violation" sig=6 Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au abrt-hook-ccpp[4266]: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core ~~~ Version-Release number of selected component (if applicable): sssd-1.14.0-43.el7.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: SSSD should be able operate even with the enumeration, like in RHEL 7.2 SSSD was bit slow with enumeration however it worked. Additional info: From the backtrack we see we are receiving a signal and calling orderly_shutdown with status 1. Where's this signal coming from ? #31 0x00007f2448bd2a49 in __run_exit_handlers (status=status@entry=1, listp=0x7f2448f546c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77 #32 0x00007f2448bd2a95 in __GI_exit (status=status@entry=1) at exit.c:99 #33 0x00007f244cee898d in orderly_shutdown (status=1) at src/util/server.c:257 #34 <signal handler called> Secondly we see its hitting a issue when it try to do some memory related operations as a part of orderly_shutdown. #0 0x00007f2448bcf1d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f2448bd08c8 in __GI_abort () at abort.c:90 #2 0x00007f244936717c in talloc_abort (reason=0x7f2449370638 "Bad talloc magic value - unknown value") at ../talloc.c:399 #3 0x00007f244936d469 in talloc_abort_unknown_value () at ../talloc.c:417 #4 talloc_chunk_from_ptr (ptr=0x7f245fa9e1d0) at ../talloc.c:436 #5 _talloc_free_internal (ptr=0x7f245fa9e1d0, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1016 #6 0x00007f244936d01b in _talloc_free_children_internal (location=0x7f2449370a12 "../talloc.c:2631", ptr=0x7f245f9b5b10, tc=0x7f245f9b5ab0) at ../talloc.c:1525 #7 _talloc_free_internal (ptr=0x7f245f9b5b10, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1072
Created attachment 1222311 [details] backtrace from the corefile backtrace from the corefile
Created attachment 1222313 [details] sosreport
I found that the watchdog timeout is actually the timeout configuration option in sssd.conf, adding 'timeout=600' to all sections of sssd.conf on our test system resolved the issue on our test system. The customer's logs show the same symptoms, I am going to ask them to add the timeout option to their sssd.conf. We will keep you posted on the progress in this case. Thank you, Jatin
Upstream ticket: https://fedorahosted.org/sssd/ticket/3266
master: * e6a5f8c58539fc31fd81fac89cfc85703b4250ea * 087162b85e191af51637904702813969b35eaadc sssd-1-14: * 0606a71b698c4acf954ba7284e62acbd0aa5e52d * 442985a7af2262fab57f56c7a8cd40af10081610
*** Bug 1379774 has been marked as a duplicate of this bug. ***
Verified the bug on SSSD version: sssd-1.15.2-29.el7.x86_64 Steps followed during verification: 1. Setup ldap with with 25000 users. 2. Setup client with enumeration = true in sssd.conf 3. Run user auth and lookup to verify whether sssd_be process is hanged or not. Already automated these steps in the SSSD Performance test suite for bz1418943, executed the regression round. See beaker link: https://beaker.engineering.redhat.com/jobs/1860968 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: Sleeping for 5 seconds :: [ PASS ] :: Command 'getent -s sss passwd puser15999' (Expected 0, got 0) :: [ LOG ] :: Authentication successfull, as expected :: [ LOG ] :: Duration: 3m 16s :: [ LOG ] :: Assertions: 1 good, 0 bad :: [ PASS ] :: RESULT: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2294