Bug 1396912 - If a long-running task (e.g. enumeration) blocks the sssd_be process, sssd_be can deadlock
Summary: If a long-running task (e.g. enumeration) blocks the sssd_be process, sssd_be...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd
Version: 7.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: SSSD Maintainers
QA Contact: Amith
URL:
Whiteboard:
: 1379774 (view as bug list)
Depends On: 1416780
Blocks: 1418943
TreeView+ depends on / blocked
 
Reported: 2016-11-21 06:56 UTC by fjayalat
Modified: 2020-06-11 13:05 UTC (History)
23 users (show)

Fixed In Version: sssd-1.15.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1418943 (view as bug list)
Environment:
Last Closed: 2017-08-01 09:02:33 UTC
Target Upstream Version:


Attachments (Terms of Use)
backtrace from the corefile (11.44 KB, text/plain)
2016-11-21 07:01 UTC, fjayalat
no flags Details
sosreport (12.75 MB, application/x-gzip)
2016-11-21 07:12 UTC, fjayalat
no flags Details


Links
System ID Priority Status Summary Last Updated
Github SSSD sssd issues 4299 None closed Deadlock in watchdog's signal handler 2020-05-20 13:17:53 UTC
Red Hat Product Errata RHEA-2017:2294 normal SHIPPED_LIVE sssd bug fix and enhancement update 2017-08-01 12:39:55 UTC

Description fjayalat 2016-11-21 06:56:51 UTC
Description of problem:

If enumeration is enabled SSSD seems to hang after few minutes upon restart.

Customer mentioned after running in to this issue , non of the LDAP or *LOCAL* Users are not able to login. They have to manually restart the system.

They used to run RHEL 7.2 with enumeration. After upgrading , its not operational after few minutes.

We see lots of errors and unable to understand if they are all related.

~~~

Nov 16 11:02:31 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE
Nov 16 11:04:01 host0 systemd[1]: sssd.service stop-final-sigterm timed out. Killing.
Nov 16 11:04:01 host0 systemd[1]: Unit sssd.service entered failed state.
Nov 16 11:04:01 host0 systemd[1]: sssd.service failed.
Nov 16 11:05:24 host0 sssd: Starting up
Nov 16 11:05:24 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:05:24 host0 sssd[nss]: Starting up
Nov 16 11:05:24 host0 sssd[pam]: Starting up
Nov 16 11:05:24 host0 sssd[sudo]: Starting up
Nov 16 11:05:24 host0 sssd[pac]: Starting up
Nov 16 11:06:05 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:06:46 host0 abrt-hook-ccpp: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core
Nov 16 11:06:48 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:16:45 host0 sssd[nss]: Starting up
Nov 16 11:19:02 host0 sssd[nss]: Shutting down
Nov 16 11:19:02 host0 sssd[pac]: Shutting down
Nov 16 11:19:02 host0 sssd[pam]: Shutting down
Nov 16 11:19:02 host0 sssd[sudo]: Shutting down
Nov 16 11:19:25 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE

----------------

(Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0200): Marking nss as started.
(Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 2443
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0040): Returned with: 0
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [nss][17955]
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Child [nss] terminated with a signal
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [rba.gov.au][9579]
(Wed Nov 16 11:19:25 2016) [sssd] [watchdog_handler] (0x0010): Watchdog timer overflow, killing process!
(Wed Nov 16 11:19:25 2016) [sssd] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Wed Nov 16 11:19:25 2016) [sssd] [sbus_remove_watch] (0x2000): 0x7f5e64f34730/0x7f5e64f35bf0

~~~~



Also we see SSSD exiting with coredumps.

~~~
Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au audispd[1269]: node=hlit-fsst-tep01.rba.gov.au type=ANOM_ABEND msg=audit(1479254806.954:674): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3118 comm="sssd_be" reason="memory violation" sig=6
Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au abrt-hook-ccpp[4266]: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core
~~~




Version-Release number of selected component (if applicable):

sssd-1.14.0-43.el7.x86_64


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

SSSD should be able operate even with the enumeration, like in RHEL 7.2
SSSD was bit slow with enumeration however it worked.

Additional info:


From the backtrack we see we are receiving a signal and calling orderly_shutdown with status 1. Where's this signal coming from ?

#31 0x00007f2448bd2a49 in __run_exit_handlers (status=status@entry=1, listp=0x7f2448f546c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#32 0x00007f2448bd2a95 in __GI_exit (status=status@entry=1) at exit.c:99
#33 0x00007f244cee898d in orderly_shutdown (status=1) at src/util/server.c:257
#34 <signal handler called>


Secondly we see its hitting a issue when it try to do some memory related operations as a part of orderly_shutdown.

#0  0x00007f2448bcf1d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f2448bd08c8 in __GI_abort () at abort.c:90
#2  0x00007f244936717c in talloc_abort (reason=0x7f2449370638 "Bad talloc magic value - unknown value") at ../talloc.c:399
#3  0x00007f244936d469 in talloc_abort_unknown_value () at ../talloc.c:417
#4  talloc_chunk_from_ptr (ptr=0x7f245fa9e1d0) at ../talloc.c:436
#5  _talloc_free_internal (ptr=0x7f245fa9e1d0, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1016
#6  0x00007f244936d01b in _talloc_free_children_internal (location=0x7f2449370a12 "../talloc.c:2631", ptr=0x7f245f9b5b10, tc=0x7f245f9b5ab0) at ../talloc.c:1525
#7  _talloc_free_internal (ptr=0x7f245f9b5b10, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1072

Comment 1 fjayalat 2016-11-21 07:01:03 UTC
Created attachment 1222311 [details]
backtrace from the corefile

backtrace from the corefile

Comment 2 fjayalat 2016-11-21 07:12:24 UTC
Created attachment 1222313 [details]
sosreport

Comment 22 Jatin Nansi 2016-12-08 06:21:50 UTC
I found that the watchdog timeout is actually the timeout configuration option in sssd.conf, adding 'timeout=600' to all sections of sssd.conf on our test system resolved the issue on our test system. The customer's logs show the same symptoms, I am going to ask them to add the timeout option to their sssd.conf. We will keep you posted on the progress in this case.

Thank you,
Jatin

Comment 29 Jakub Hrozek 2017-02-01 13:05:38 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/3266

Comment 30 Jakub Hrozek 2017-02-01 13:06:21 UTC
master:
* e6a5f8c58539fc31fd81fac89cfc85703b4250ea
* 087162b85e191af51637904702813969b35eaadc 

sssd-1-14:
* 0606a71b698c4acf954ba7284e62acbd0aa5e52d
* 442985a7af2262fab57f56c7a8cd40af10081610

Comment 31 Jakub Hrozek 2017-02-01 13:09:19 UTC
*** Bug 1379774 has been marked as a duplicate of this bug. ***

Comment 43 Amith 2017-05-18 03:37:01 UTC
Verified the bug on SSSD version: sssd-1.15.2-29.el7.x86_64

Steps followed during verification:

1. Setup ldap with with 25000 users.

2. Setup client with enumeration = true in sssd.conf

3. Run user auth and lookup to verify whether sssd_be process is hanged or not.

Already automated these steps in the SSSD Performance test suite for bz1418943, executed the regression round.
See beaker link: https://beaker.engineering.redhat.com/jobs/1860968

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   LOG    ] :: Sleeping for 5 seconds
:: [   PASS   ] :: Command 'getent -s sss passwd puser15999' (Expected 0, got 0)
:: [   LOG    ] :: Authentication successfull, as expected
:: [   LOG    ] :: Duration: 3m 16s
:: [   LOG    ] :: Assertions: 1 good, 0 bad
:: [   PASS   ] :: RESULT: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943

Comment 44 errata-xmlrpc 2017-08-01 09:02:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2294


Note You need to log in before you can comment on or make changes to this bug.