Bug 972699

Summary: sssd terminates every now and then
Product: Red Hat Enterprise Linux 6 Reporter: Peter Åstrand <astrand>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED DUPLICATE QA Contact: Kaushik Banerjee <kbanerje>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: aaron, astrand, grajaiya, jgalipea, lslebodn, mkosek, okos, pbrezina, roysjosh
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-05 10:14:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sssd_default.log-20130609.gz
none
sssd.log-20130609.gz
none
sssd_nss.log-20130609.gz
none
sssd_pam.log-20130609.gz
none
Core was generated by `/usr/libexec/sssd/sssd_pam --debug-to-files'.
none
core.2013-09-03T15:03.gz
none
core.2013-09-03T15:43.gz none

Description Peter Åstrand 2013-06-10 12:30:46 UTC
We are running sssd with the LDAP backend on a 64 bit system. Every now and then, sssd terminates and users can no longer login. This has happened on two different systems. Installed packages:

sssd-tools-1.9.2-82.7.el6_4.x86_64
libsss_idmap-1.9.2-82.7.el6_4.x86_64
sssd-client-1.9.2-82.7.el6_4.x86_64
sssd-1.9.2-82.7.el6_4.x86_64
sssd-client-1.9.2-82.7.el6_4.i686

Sometimes, sssd can run for weeks without problems, but sometimes, it terminates within hours. 

Earlier, we saw segfaults in the kernel log:

May  6 21:23:35 lola kernel: sssd_pam[1988]: segfault at 10 ip 0000000000410fc0 sp 00007fff60221660 error 4 in sssd_pam (deleted)[400000+75000]
May  6 21:23:35 lola sssd[pam]: Starting up
May  6 21:23:37 lola sssd[pam]: Starting up
May  6 21:23:41 lola sssd[pam]: Starting up
May  6 21:23:41 lola sssd[be[default]]: Shutting down
May  6 21:23:41 lola kernel: sssd_nss[1987]: segfault at c4 ip 0000003252024fbc sp 00007ffff9ab8620 error 4 in libdbus-1.so.3.4.0[3252000000+40000]

However, we are not seeing this any longer. Instead, at the time of sssd terminating, we are seing this:

/var/log/messages:
Jun  5 12:21:59 lucy sssd[pam]: Starting up
Jun  5 12:22:01 lucy sssd[pam]: Starting up
Jun  5 12:22:05 lucy sssd[pam]: Starting up
Jun  5 12:22:05 lucy sssd[nss]: Shutting down
Jun  5 12:22:05 lucy sssd[be[default]]: Shutting down


sssd_default.log:
(Wed Jun  5 12:21:58 2013) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 0,0,Success
(Wed Jun  5 12:21:58 2013) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 0,0,Success
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 0,0,Success
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [sdap_async_sys_connect_timeout] (0x0100): The LDAP connection timed out
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'demoldap.thinlinc.com' as 'not working'
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [fo_resolve_service_send] (0x0020): No available servers for service 'LDAP'
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline (Autentiseringstjänsten kan inte hämta autentiseringsinformation)]
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_pam_handler_callback] (0x0100): Sending result [9][default]
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_pam_handler_callback] (0x0100): Sent result [9][default]
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4099][1][name=postfix]
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [3][1][name=cenddrif]
(Wed Jun  5 12:21:59 2013) [sssd[be[default]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching.


sssd_nss.log:
(Wed Jun  5 12:21:28 2013) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [postfix@default]
(Wed Jun  5 12:21:59 2013) [sssd[nss]] [nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information from Data Provid
er
Error: 1, 11, Fast reply - offline
Will try to return what we have in cache
(Wed Jun  5 12:21:59 2013) [sssd[nss]] [nss_cmd_initgroups] (0x0100): Requesting info for [postfix] from [<ALL>]
(Wed Jun  5 12:21:59 2013) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [postfix@default]


sssd_pam.log:
(Wed Jun  5 12:21:59 2013) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Wed Jun  5 12:21:59 2013) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Wed Jun  5 12:21:59 2013) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-d
p_default].
(Wed Jun  5 12:21:59 2013) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Wed Jun  5 12:21:59 2013) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Wed Jun  5 12:22:01 2013) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Wed Jun  5 12:22:01 2013) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Wed Jun  5 12:22:01 2013) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-d
p_default].
(Wed Jun  5 12:22:01 2013) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Wed Jun  5 12:22:01 2013) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Wed Jun  5 12:22:05 2013) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Wed Jun  5 12:22:05 2013) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Wed Jun  5 12:22:05 2013) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-d
p_default].
(Wed Jun  5 12:22:05 2013) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Wed Jun  5 12:22:05 2013) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector


sssd.log:
(Wed Jun  5 12:22:05 2013) [sssd] [mt_svc_exit_handler] (0x0010): Process [pam], definitely stopped!


SSH login then fails:

/var/log/messages:
Jun  5 12:22:27 lucy sshd[26687]: pam_sss(sshd:session): Request to sssd failed. Connection refused

Comment 2 Ondrej Kos 2013-06-10 12:48:55 UTC
Hi Peter,

Do you have a corefile for the segfault? It would help us a lot.
If you're worried about the content of the corefile, you could check it with the "strings" utility first for any private data.

Comment 3 Jakub Hrozek 2013-06-10 12:53:18 UTC
To get the core file or the backtrace, you can simply install and start abrt, it should get the crash data automatically.

As per the error where the sssd doesn't seemingly segfault, can you attach the logs from the period when the problem happened? Maybe we could at least see what was going on. I'm concerned about the SSSD not reconnecting after a problem, that seems quite bad.

Comment 4 Peter Åstrand 2013-06-11 06:43:24 UTC
Unfortunately, we do not have the core files from the earlier segfaults. If a segfault happens on the new system, I will give it to you. 

Will attach the logs.

Comment 5 Peter Åstrand 2013-06-11 06:48:57 UTC
Created attachment 759442 [details]
sssd_default.log-20130609.gz

Comment 6 Peter Åstrand 2013-06-11 06:49:39 UTC
Created attachment 759443 [details]
sssd.log-20130609.gz

Comment 7 Peter Åstrand 2013-06-11 06:50:23 UTC
Created attachment 759444 [details]
sssd_nss.log-20130609.gz

Comment 8 Peter Åstrand 2013-06-11 06:50:50 UTC
Created attachment 759445 [details]
sssd_pam.log-20130609.gz

Comment 9 Pavel Březina 2013-06-19 11:50:50 UTC
Hi,
I'm sorry for late answer. Did you manage to get the core file? It would help a lot. sssd_pam process is terminated abruptly so there should be a core file generated by abrt (if enabled).

I see one scenario twice in the logs. SSSD gets requests for quite a lot users, probably a /home directory is being listed? This enumeration is interrupted by ssh login of user "cenddrif", returning success to sssd_pam. Does this user contain some special values in LDAP? Are you able to login with him?

Also, can you test if setting 'client_idle_timeout = 600' in you sssd.conf domain section [domain/default] helps?

Thank you.

Comment 10 Peter Åstrand 2013-06-19 12:54:51 UTC
As I mentioned in comment #4, we do not have the the core files from the earlier segfaults. This server is no longer part of our organization. 

"cenddrif" is a pretty normal account, used by the Nagios checks. The login works fine. 

Strangely enough, we haven't seen this problem for a while. I've added 'client_idle_timeout = 600' anyway.

Comment 11 Jakub Hrozek 2013-06-20 11:21:48 UTC
I'll set the needinfo flag again to make it clear we need the core file in order to provide a fix and leave the bug open for a while longer.

Comment 12 Peter Åstrand 2013-07-01 09:09:58 UTC
sssd on usdemo.thinlinc.com died again this weekend. It's difficult to tell if it actually crashed or not - no traces found in dmesg about any segfault. 

Historically, we have had some problems with missing core dumps when ABRT is installed. I would love to uninstall ABRT, but unfortunately this is not possible because sabayon (which we use) depends on it. But I have at least stopped the abrtd service. 

Also, I've added:

DAEMON_COREFILE_LIMIT=999999

...to /etc/init.d/sssd. I've verified that "kill -TRAP" on the sssd process generates a core file in /, so hopefully, we will get a core file the next time it crashes. If it *is* a segfault...

Comment 13 Jakub Hrozek 2013-07-08 08:32:01 UTC
(In reply to Peter Åstrand from comment #12)
> sssd on usdemo.thinlinc.com died again this weekend. It's difficult to tell
> if it actually crashed or not - no traces found in dmesg about any segfault. 
> 

Anything interesting in the sssd log files? Fatal errors would have been printed even without explicit debug_level setting.

> Historically, we have had some problems with missing core dumps when ABRT is
> installed. I would love to uninstall ABRT, but unfortunately this is not
> possible because sabayon (which we use) depends on it. But I have at least
> stopped the abrtd service. 
> 
> Also, I've added:
> 
> DAEMON_COREFILE_LIMIT=999999
> 
> ...to /etc/init.d/sssd. I've verified that "kill -TRAP" on the sssd process
> generates a core file in /, so hopefully, we will get a core file the next
> time it crashes. If it *is* a segfault...

OK, sounds good, thank you for following up and for the patience.

Comment 14 Jakub Hrozek 2013-08-09 16:30:31 UTC
Hi Peter,

I think this bug is actually the same as https://bugzilla.redhat.com/show_bug.cgi?id=984814 for which we have a candidate patch for. If you're interested and still seeing the problem, we can build you test packages..

Comment 15 Peter Åstrand 2013-08-12 08:04:00 UTC
Created attachment 785594 [details]
Core was generated by `/usr/libexec/sssd/sssd_pam --debug-to-files'.

sssd died last friday again. Attaching core.

Comment 16 Peter Åstrand 2013-08-12 08:06:22 UTC
(In reply to Jakub Hrozek from comment #14)
> Hi Peter,
> 
> I think this bug is actually the same as
> https://bugzilla.redhat.com/show_bug.cgi?id=984814 for which we have a

"You are not authorized to access bug #984814."

Comment 18 Lukas Slebodnik 2013-08-13 07:35:45 UTC
It is clear, that there was two segfaults (https://bugzilla.redhat.com/show_bug.cgi?id=972699#c0). The first in sssd_pam and the second in sssd_nss

>May  6 21:23:35 lola kernel: sssd_pam[1988]: segfault at 10 ip 0000000000410fc0 >sp 00007fff60221660 error 4 in sssd_pam (deleted)[400000+75000]
>May  6 21:23:35 lola sssd[pam]: Starting up
>May  6 21:23:37 lola sssd[pam]: Starting up
>May  6 21:23:41 lola sssd[pam]: Starting up
>May  6 21:23:41 lola sssd[be[default]]: Shutting down
>May  6 21:23:41 lola kernel: sssd_nss[1987]: segfault at c4 ip 0000003252024fbc >sp 00007ffff9ab8620 error 4 in libdbus-1.so.3.4.0[3252000000+40000]

Jakub thought about the second crash as a duplicate of bug #984814 (private bug)
Upstream ticket for that bug is located here: https://fedorahosted.org/sssd/ticket/2018
We cannot be sure without coredump whether it is the same problem.

The first crash looked very similar to upstream ticket https://fedorahosted.org/sssd/ticket/1869. After deep analysis of coredump, we found out that it is different bug. Cause of crash is dereference of NULL pointer.

Thank you very much for coredump. It was very helpful, but I need to discuss this problem with more experienced sssd developers. I am not really sure how it could happen that pointer is NULL.

Comment 19 Peter Åstrand 2013-09-04 06:32:15 UTC
sssd crashed again for us a few hours ago. I can see that both upstream bugs are now fixed, so what about releasing an updated RPM package?

Comment 20 Jakub Hrozek 2013-09-04 06:45:33 UTC
Hi,

There are two issues. One is a crash in the memory cache code which has been fixed and will be released as an update no later than 6.5 (perhaps sooner, details tbd). I can give you a 6.5 candidate built for 6.4 now if you like, testing is always very much welcome.

The other issue is the NULL dereference in the PAM responder. Unfortunately the coredump tells us where the SSSD crashed but not how we got there. But you're right we need to have some kind of resolution here. I will revive the discussion Lukas started on the upstream mailing list and suggest that we at least add a NULL check and several additional DEBUG statements that would prevent the crash in the future and give us a better hint on what happened.

Comment 21 Jakub Hrozek 2013-09-04 06:47:56 UTC
By the way given that you are hit by this issue in a production environment, I suspect that running the pam responder under valgrind would not be a viable option? There would be some performance penalty, but functionality should be retained.

Comment 22 Peter Åstrand 2013-09-04 07:38:23 UTC
(In reply to Jakub Hrozek from comment #20)
> Hi,
> 
> There are two issues. One is a crash in the memory cache code which has been
> fixed and will be released as an update no later than 6.5 (perhaps sooner,
> details tbd). I can give you a 6.5 candidate built for 6.4 now if you like,
> testing is always very much welcome.

Since it only crashes once a month or so right now, I think we can wait. 


>By the way given that you are hit by this issue in a production environment, I >suspect that running the pam responder under valgrind would not be a viable >option? There would be some performance penalty, but functionality should be >retained.

I'd like to avoid this if possible. It's a remote system.

Comment 23 Lukas Slebodnik 2013-09-04 08:06:37 UTC
Could you also attach last coredump? I think it can confirm our assumption
about dereference of NULL pointer and that it is not the same problem like
https://fedorahosted.org/sssd/ticket/1869

Comment 24 Peter Åstrand 2013-09-04 08:46:56 UTC
Created attachment 793556 [details]
core.2013-09-03T15:03.gz

I actually got two core files within the hour, attaching both.

Comment 25 Peter Åstrand 2013-09-04 08:47:27 UTC
Created attachment 793562 [details]
core.2013-09-03T15:43.gz

Comment 26 Lukas Slebodnik 2013-09-04 10:16:36 UTC
Thank you very much for these coredumps.
I analysed them and it is the same problem as https://fedorahosted.org/sssd/ticket/1869 (which has already been fixed).
Jakub can provide test package for you.

Comment 27 Jakub Hrozek 2013-09-04 14:32:42 UTC
Hi Peter,

you can download the test builds here:
http://jhrozek.fedorapeople.org/sssd-test-builds/sssd-pam-crash-rhbz972699/

It's the latest 6.4.z packages plus two fixes - one that would fix the crashes in the memcache and one that would prevent the pam responder from crashing on timeout.

Comment 28 Jakub Hrozek 2013-09-05 10:14:49 UTC
I'm going to close this bug as duplicate of #951086 as Lukas' investigation shows they are the same. Feel free to use the test packages in comment #27.

Thank  you for reporting the issue and working with us towards fixing it!

*** This bug has been marked as a duplicate of bug 951086 ***

Comment 29 Peter Åstrand 2013-09-05 10:33:34 UTC
(In reply to Jakub Hrozek from comment #28)
> I'm going to close this bug as duplicate of #951086 as Lukas' investigation
> shows they are the same. Feel free to use the test packages in comment #27.
> 
> Thank  you for reporting the issue and working with us towards fixing it!
> 
> *** This bug has been marked as a duplicate of bug 951086 ***

I think this is totally wrong, since bug 951086 is private. It is certainly very impractical, since it will be, for example, impossible of following the progress of how and when updated RPMs are available.

IMO, DUPLICATE should be used for cases when two users accidently reported the same bug twice, and when a public user can easily abandon their own bug in favour of another. In some rating systems, you will also actually get a penalty when reporting a bug which was later closed as DUPLICATE, because the reporter did not try hard enough of finding a suitable existing report before creating another one. I don't want to look like a fool just because you have some internal information that you don't want to reveal. (Also, I don't understand why you do not use the built in access features of Bugzilla, which makes it possible to restrict comments/attachements/etc, so that you can use one single bug.)

Would you please reconsider this DUPLICATE decision? Why not close the internal report as DUPLICATE instead?

Comment 30 Jakub Hrozek 2013-09-05 17:04:27 UTC
(In reply to Peter Åstrand from comment #29)
> (In reply to Jakub Hrozek from comment #28)
> > I'm going to close this bug as duplicate of #951086 as Lukas' investigation
> > shows they are the same. Feel free to use the test packages in comment #27.
> > 
> > Thank  you for reporting the issue and working with us towards fixing it!
> > 
> > *** This bug has been marked as a duplicate of bug 951086 ***
> 

I'm very sorry about the negative experience you had, Peter. To be honest, at first, I didn't realize the other bug was marked as private.. Let me try to explain what's going on:

> I think this is totally wrong, since bug 951086 is private. It is certainly
> very impractical, since it will be, for example, impossible of following the
> progress of how and when updated RPMs are available.
> 

You're completely right. However, that bug came through our support systems and for various reasons, mostly because content copied from the support system might include sensitive data, such bugs are marked private as default.

> IMO, DUPLICATE should be used for cases when two users accidently reported
> the same bug twice, and when a public user can easily abandon their own bug
> in favour of another. 

I agree.

> In some rating systems, you will also actually get a
> penalty when reporting a bug which was later closed as DUPLICATE, because
> the reporter did not try hard enough of finding a suitable existing report
> before creating another one. I don't want to look like a fool just because
> you have some internal information that you don't want to reveal.

I assure you that there is no penalty and I value the fact that you took the time to report the bug and worked with us towards confirming it is a known issue very much. Actually, I don't think there was any other way because as you noted, you couldn't have an idea about the other bug. Also, the problem you hit was not a trivial one and required careful investigation of the core file.

>  (Also, I
> don't understand why you do not use the built in access features of
> Bugzilla, which makes it possible to restrict comments/attachements/etc, so
> that you can use one single bug.)
> 

I would like to make the bugzilla public and only mark certain comments as private as you suggested, but I need to check if our process allows that. I'm sorry about the inconvenience before this issue is sorted out.

> Would you please reconsider this DUPLICATE decision? Why not close the
> internal report as DUPLICATE instead?

Because that bug was already approved for inclusion in the next RHEL release and by marking that one as duplicate of this one the approval would have to be redone.

Comment 31 Jakub Hrozek 2013-09-05 17:34:49 UTC
I marked bug 951086 as public. You should be able to see it now.

Comment 32 Peter Åstrand 2013-09-06 07:16:50 UTC
(In reply to Jakub Hrozek from comment #31)
> I marked bug 951086 as public. You should be able to see it now.

Thanks.