Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1396912

Summary:

If a long-running task (e.g. enumeration) blocks the sssd_be process, sssd_be can deadlock

Product:

Red Hat Enterprise Linux 7

Reporter:

fjayalat

Component:

sssd

Assignee:

SSSD Maintainers <sssd-maint>

Status:

CLOSED ERRATA

QA Contact:

Amith <apeetham>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.3

CC:

afarley, apeetham, arajendr, arusso, asakure, atolani, christoph.maser, den1987, fidencio, fjayalat, gparente, grajaiya, hartsjc, jhrozek, jnansi, knweiss, lslebodn, mkosek, mzidek, pbrezina, sgoveas, toby, tscherf

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

sssd-1.15.0-2.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1418943 (view as bug list)

Environment:

Last Closed:

2017-08-01 09:02:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1416780

Bug Blocks:

1418943

Attachments:

Description	Flags
backtrace from the corefile	none
sosreport	none

Description fjayalat 2016-11-21 06:56:51 UTC

Description of problem:

If enumeration is enabled SSSD seems to hang after few minutes upon restart.

Customer mentioned after running in to this issue , non of the LDAP or *LOCAL* Users are not able to login. They have to manually restart the system.

They used to run RHEL 7.2 with enumeration. After upgrading , its not operational after few minutes.

We see lots of errors and unable to understand if they are all related.

~~~

Nov 16 11:02:31 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE
Nov 16 11:04:01 host0 systemd[1]: sssd.service stop-final-sigterm timed out. Killing.
Nov 16 11:04:01 host0 systemd[1]: Unit sssd.service entered failed state.
Nov 16 11:04:01 host0 systemd[1]: sssd.service failed.
Nov 16 11:05:24 host0 sssd: Starting up
Nov 16 11:05:24 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:05:24 host0 sssd[nss]: Starting up
Nov 16 11:05:24 host0 sssd[pam]: Starting up
Nov 16 11:05:24 host0 sssd[sudo]: Starting up
Nov 16 11:05:24 host0 sssd[pac]: Starting up
Nov 16 11:06:05 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:06:46 host0 abrt-hook-ccpp: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core
Nov 16 11:06:48 host0 sssd[be[rba.gov.au]]: Starting up
Nov 16 11:16:45 host0 sssd[nss]: Starting up
Nov 16 11:19:02 host0 sssd[nss]: Shutting down
Nov 16 11:19:02 host0 sssd[pac]: Shutting down
Nov 16 11:19:02 host0 sssd[pam]: Shutting down
Nov 16 11:19:02 host0 sssd[sudo]: Shutting down
Nov 16 11:19:25 host0 systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE

----------------

(Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0200): Marking nss as started.
(Wed Nov 16 11:16:45 2016) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 2443
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0040): Returned with: 0
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [nss][17955]
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Child [nss] terminated with a signal
(Wed Nov 16 11:19:02 2016) [sssd] [monitor_quit] (0x0020): Terminating [rba.gov.au][9579]
(Wed Nov 16 11:19:25 2016) [sssd] [watchdog_handler] (0x0010): Watchdog timer overflow, killing process!
(Wed Nov 16 11:19:25 2016) [sssd] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Wed Nov 16 11:19:25 2016) [sssd] [sbus_remove_watch] (0x2000): 0x7f5e64f34730/0x7f5e64f35bf0

~~~~



Also we see SSSD exiting with coredumps.

~~~
Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au audispd[1269]: node=hlit-fsst-tep01.rba.gov.au type=ANOM_ABEND msg=audit(1479254806.954:674): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3118 comm="sssd_be" reason="memory violation" sig=6
Nov 16 11:06:46 hlit-fsst-tep01.rba.gov.au abrt-hook-ccpp[4266]: Process 3118 (sssd_be) of user 0 killed by SIGABRT - dumping core
~~~




Version-Release number of selected component (if applicable):

sssd-1.14.0-43.el7.x86_64


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

SSSD should be able operate even with the enumeration, like in RHEL 7.2
SSSD was bit slow with enumeration however it worked.

Additional info:


From the backtrack we see we are receiving a signal and calling orderly_shutdown with status 1. Where's this signal coming from ?

#31 0x00007f2448bd2a49 in __run_exit_handlers (status=status@entry=1, listp=0x7f2448f546c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#32 0x00007f2448bd2a95 in __GI_exit (status=status@entry=1) at exit.c:99
#33 0x00007f244cee898d in orderly_shutdown (status=1) at src/util/server.c:257
#34 <signal handler called>


Secondly we see its hitting a issue when it try to do some memory related operations as a part of orderly_shutdown.

#0  0x00007f2448bcf1d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f2448bd08c8 in __GI_abort () at abort.c:90
#2  0x00007f244936717c in talloc_abort (reason=0x7f2449370638 "Bad talloc magic value - unknown value") at ../talloc.c:399
#3  0x00007f244936d469 in talloc_abort_unknown_value () at ../talloc.c:417
#4  talloc_chunk_from_ptr (ptr=0x7f245fa9e1d0) at ../talloc.c:436
#5  _talloc_free_internal (ptr=0x7f245fa9e1d0, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1016
#6  0x00007f244936d01b in _talloc_free_children_internal (location=0x7f2449370a12 "../talloc.c:2631", ptr=0x7f245f9b5b10, tc=0x7f245f9b5ab0) at ../talloc.c:1525
#7  _talloc_free_internal (ptr=0x7f245f9b5b10, location=0x7f2449370a12 "../talloc.c:2631") at ../talloc.c:1072

Comment 1 fjayalat 2016-11-21 07:01:03 UTC

Created attachment 1222311 [details]
backtrace from the corefile

backtrace from the corefile

Comment 2 fjayalat 2016-11-21 07:12:24 UTC

Created attachment 1222313 [details]
sosreport

Comment 22 Jatin Nansi 2016-12-08 06:21:50 UTC

I found that the watchdog timeout is actually the timeout configuration option in sssd.conf, adding 'timeout=600' to all sections of sssd.conf on our test system resolved the issue on our test system. The customer's logs show the same symptoms, I am going to ask them to add the timeout option to their sssd.conf. We will keep you posted on the progress in this case.

Thank you,
Jatin

Comment 29 Jakub Hrozek 2017-02-01 13:05:38 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/3266

Comment 30 Jakub Hrozek 2017-02-01 13:06:21 UTC

master:
* e6a5f8c58539fc31fd81fac89cfc85703b4250ea
* 087162b85e191af51637904702813969b35eaadc 

sssd-1-14:
* 0606a71b698c4acf954ba7284e62acbd0aa5e52d
* 442985a7af2262fab57f56c7a8cd40af10081610

Comment 31 Jakub Hrozek 2017-02-01 13:09:19 UTC

*** Bug 1379774 has been marked as a duplicate of this bug. ***

Comment 43 Amith 2017-05-18 03:37:01 UTC

Verified the bug on SSSD version: sssd-1.15.2-29.el7.x86_64

Steps followed during verification:

1. Setup ldap with with 25000 users.

2. Setup client with enumeration = true in sssd.conf

3. Run user auth and lookup to verify whether sssd_be process is hanged or not.

Already automated these steps in the SSSD Performance test suite for bz1418943, executed the regression round.
See beaker link: https://beaker.engineering.redhat.com/jobs/1860968

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   LOG    ] :: Sleeping for 5 seconds
:: [   PASS   ] :: Command 'getent -s sss passwd puser15999' (Expected 0, got 0)
:: [   LOG    ] :: Authentication successfull, as expected
:: [   LOG    ] :: Duration: 3m 16s
:: [   LOG    ] :: Assertions: 1 good, 0 bad
:: [   PASS   ] :: RESULT: sssdbe can deadlock if a long running task like enumeration blocks it bz1418943

Comment 44 errata-xmlrpc 2017-08-01 09:02:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2294