Bug 1261186

Summary: sssd_be general protection in libsss_idmap.so.0.4.0.
Product: Red Hat Enterprise Linux 7 Reporter: Todd H. Poole <toddhpoolework>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.1CC: abokovoy, grajaiya, jgalipea, jhrozek, lslebodn, mkosek, mzidek, pbrezina, preichl, rharwood, sbose, ssorce, toddhpoolework
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-07 16:38:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Todd H. Poole 2015-09-08 19:59:34 UTC
Description of problem:
kernel: traps: sssd_be general protection in libsss_idmap.so.0.4.0.

After several days of light use (30 or so users logging in and out several times a day), sssd_be begins misbehaving with the following messages showing up in /var/log/messages immediately after each login attempt:

Sep  8 10:59:48 redacted systemd: Starting Session 698 of user toddhpoole.
Sep  8 10:59:48 redacted systemd: Started Session 698 of user toddhpoole.
Sep  8 10:59:48 redacted systemd-logind: New session 698 of user toddhpoole.
Sep  8 10:59:48 redacted systemd-logind: Removed session 698.
Sep  8 10:59:48 redacted kernel: traps: sssd_be[32615] general protection ip:7f97fd56d331 sp:7fff0906baf0 error:0 in libsss_idmap.so.0.4.0[7f97fd56a000+5000]

This prevents users from successfully logging in.

Neither restarting the sssd service (via 'systemctl restart sssd') nor purging the sss cache (via 'sss_cache -E') and then trying to reconnect via SSH appear to resolve the issue:

Sep  8 11:15:35 redacted systemd: Started System Security Services Daemon.
Sep  8 11:15:51 redacted kernel: traps: sssd_be[792] general protection ip:7fac644d2331 sp:7fff195e5c30 error:0 in libsss_idmap.so.0.4.0[7fac644cf000+5000]
Sep  8 11:15:51 redacted sssd[be[redacted.domain.local]]: Starting up
Sep  8 11:17:17 redacted kernel: traps: sssd_be[800] general protection ip:7ffa2b176331 sp:7fff60e65370 error:0 in libsss_idmap.so.0.4.0[7ffa2b173000+5000]
Sep  8 11:17:17 redacted sssd[be[redacted.domain.local]]: Starting up
Sep  8 11:17:53 redacted kernel: traps: sssd_be[814] general protection ip:7fb30dab4331 sp:7fff9038f100 error:0 in libsss_idmap.so.0.4.0[7fb30dab1000+5000]
Sep  8 11:17:53 redacted sssd[be[redacted.domain.local]]: Starting up
Sep  8 11:19:06 redacted kernel: traps: sssd_be[822] general protection ip:7f4030cc0331 sp:7fff4a949aa0 error:0 in libsss_idmap.so.0.4.0[7f4030cbd000+5000]

Deleting the cache database file (via 'rm -f /var/lib/sss/db/cache_DOMAIN_redacted.ldb') and then restarting sssd (via 'systemctl restart sssd') does appear to resolve the issue.

Version-Release number of selected component (if applicable):
[root@redacted ~]# sssd --version
1.12.2
[root@redacted ~]# rpm -qa | grep "sssd"
sssd-ldap-1.12.2-58.el7_1.6.x86_64
sssd-common-pac-1.12.2-58.el7_1.6.x86_64
sssd-ad-1.12.2-58.el7_1.6.x86_64
python-sssdconfig-1.12.2-58.el7_1.6.noarch
sssd-krb5-common-1.12.2-58.el7_1.6.x86_64
sssd-krb5-1.12.2-58.el7_1.6.x86_64
sssd-ipa-1.12.2-58.el7_1.6.x86_64
sssd-1.12.2-58.el7_1.6.x86_64
sssd-debuginfo-1.12.2-58.el7_1.6.x86_64
sssd-client-1.12.2-58.el7_1.6.x86_64
sssd-common-1.12.2-58.el7_1.6.x86_64
sssd-proxy-1.12.2-58.el7_1.6.x86_64

How reproducible:
The issue presents itself sporadically. In most cases, after about 6 or 7 days of uptime, but, I've witnessed failures starting to occur as early as 12 hours and as late as 16 days. Deleting the cache database file then restarting sssd appears to reset the timer.

Steps to Reproduce:
1. Wait several days with a small number of users logging in and out of the system several times a day.
2. Observe after several days sssd_be begins failing, preventing users from logging in.

Actual results:
sssd_be fails.

Expected results:
sssd_be does not fail.

Additional info:
Google yields a surprisingly small number of hits for this failure. I see there was a similar bug posted last year (https://bugzilla.redhat.com/show_bug.cgi?id=1079237), but that was purportedly fixed by the version I have. There's also this posting (http://freeipa-users.redhat.narkive.com/0Hl9a7R4/sssd-sssd-be-crashing-on-rhel-6-2) which references the same general protection fault, but that user has a significantly different environment than we do. Core dumps are not available for this machine, and /var/log/sssd/sssd.log is 0B in size (as is /var/log/sssd/sssd_redacted.domain.local). I'd be happy to install abrt on this machine to capture core dumps if this bug is a true bug and not a DUPLICATE of a pre-existing one that I might have missed or provide additional logs with sensitive information redacted if requested.

Thank you,
Todd H. Poole

Comment 1 Lukas Slebodnik 2015-09-09 08:21:29 UTC
(In reply to Todd H. Poole from comment #0)
> Description of problem:
> kernel: traps: sssd_be general protection in libsss_idmap.so.0.4.0.
> 
> After several days of light use (30 or so users logging in and out several
> times a day), sssd_be begins misbehaving with the following messages showing
> up in /var/log/messages immediately after each login attempt:
> 
> Sep  8 10:59:48 redacted systemd: Starting Session 698 of user toddhpoole.
> Sep  8 10:59:48 redacted systemd: Started Session 698 of user toddhpoole.
> Sep  8 10:59:48 redacted systemd-logind: New session 698 of user toddhpoole.
> Sep  8 10:59:48 redacted systemd-logind: Removed session 698.
> Sep  8 10:59:48 redacted kernel: traps: sssd_be[32615] general protection
> ip:7f97fd56d331 sp:7fff0906baf0 error:0 in
> libsss_idmap.so.0.4.0[7f97fd56a000+5000]
> 
> This prevents users from successfully logging in.
> 
> Neither restarting the sssd service (via 'systemctl restart sssd') nor
> purging the sss cache (via 'sss_cache -E') and then trying to reconnect via
> SSH appear to resolve the issue:
> 
> Sep  8 11:15:35 redacted systemd: Started System Security Services Daemon.
> Sep  8 11:15:51 redacted kernel: traps: sssd_be[792] general protection
> ip:7fac644d2331 sp:7fff195e5c30 error:0 in
> libsss_idmap.so.0.4.0[7fac644cf000+5000]
> Sep  8 11:15:51 redacted sssd[be[redacted.domain.local]]: Starting up
> Sep  8 11:17:17 redacted kernel: traps: sssd_be[800] general protection
> ip:7ffa2b176331 sp:7fff60e65370 error:0 in
> libsss_idmap.so.0.4.0[7ffa2b173000+5000]
> Sep  8 11:17:17 redacted sssd[be[redacted.domain.local]]: Starting up
> Sep  8 11:17:53 redacted kernel: traps: sssd_be[814] general protection
> ip:7fb30dab4331 sp:7fff9038f100 error:0 in
> libsss_idmap.so.0.4.0[7fb30dab1000+5000]
> Sep  8 11:17:53 redacted sssd[be[redacted.domain.local]]: Starting up
> Sep  8 11:19:06 redacted kernel: traps: sssd_be[822] general protection
> ip:7f4030cc0331 sp:7fff4a949aa0 error:0 in
> libsss_idmap.so.0.4.0[7f4030cbd000+5000]
> 
> Deleting the cache database file (via 'rm -f
> /var/lib/sss/db/cache_DOMAIN_redacted.ldb') and then restarting sssd (via
> 'systemctl restart sssd') does appear to resolve the issue.
> 
> Version-Release number of selected component (if applicable):
> [root@redacted ~]# sssd --version
> 1.12.2
> [root@redacted ~]# rpm -qa | grep "sssd"
> sssd-ldap-1.12.2-58.el7_1.6.x86_64
> sssd-common-pac-1.12.2-58.el7_1.6.x86_64
> sssd-ad-1.12.2-58.el7_1.6.x86_64
> python-sssdconfig-1.12.2-58.el7_1.6.noarch
> sssd-krb5-common-1.12.2-58.el7_1.6.x86_64
> sssd-krb5-1.12.2-58.el7_1.6.x86_64
> sssd-ipa-1.12.2-58.el7_1.6.x86_64
> sssd-1.12.2-58.el7_1.6.x86_64
> sssd-debuginfo-1.12.2-58.el7_1.6.x86_64
> sssd-client-1.12.2-58.el7_1.6.x86_64
> sssd-common-1.12.2-58.el7_1.6.x86_64
> sssd-proxy-1.12.2-58.el7_1.6.x86_64
> 

This BZ is filed against fedora 22. Which has sssd-1.12.5.
So please try to test with this version.

If your plan was to failed against el7.1 then It would be better to use
different component in Bugzilla. BTW There were some updates in el7.
You might try to reproduce with 1.12.2-58.el7_1.14.

If there will be still a problem then you might want to test with backported
version from fedora 22.
https://copr.fedoraproject.org/coprs/lslebodn/sssd-1-12/

Comment 2 Todd H. Poole 2015-09-09 22:55:11 UTC
Thank you for the feedback Lukas, I'll give those a shot.

Unfortunately, given the relatively infrequent nature of these failures, it'll be several days (if not weeks) before I'm able to report back again. Expect an update no later than the 24th of September.

For the sake of correctness, I've also gone ahead and updated this report's components to more accurately reflect the actual environment and installed packages.

Comment 4 Jakub Hrozek 2015-09-10 07:57:10 UTC
Thank you for changing the bug version.

I'll set the needinfo flag to make it clear we need the corefile or logs (ideally with a totally up-to-date version) to help with the issue.

Comment 5 Todd H. Poole 2015-09-24 21:54:13 UTC
It's been 15 days since the latest patch/bugfix release was applied to one of our test/staging clusters, and I'm pleased to report we've not seen the issue return.

We've begun deploying these changes to our production environment, which will have a significantly higher number of users exercising these services. If these failures do not return after a few weeks of heavy load in our production environment, then I think we can consider this issue resolved.

I'll report back if anything changes. Thanks gentlemen.

Comment 6 Jakub Hrozek 2015-09-25 08:13:51 UTC
(In reply to Todd H. Poole from comment #5)
> It's been 15 days since the latest patch/bugfix release was applied to one
> of our test/staging clusters, and I'm pleased to report we've not seen the
> issue return.
> 
> We've begun deploying these changes to our production environment, which
> will have a significantly higher number of users exercising these services.
> If these failures do not return after a few weeks of heavy load in our
> production environment, then I think we can consider this issue resolved.
> 
> I'll report back if anything changes. Thanks gentlemen.

Thank you very much for coming back.

Please just close this bugzilla if the issue doesn't hit you in the production environment.

Comment 7 Jakub Hrozek 2016-01-07 16:38:27 UTC
We haven't heard on this issue for quite some time, closing.