Bug 1576597

Summary:	Clearing SSSD cache is necessary after update from 1.15 to 1.16
Product:	Red Hat Enterprise Linux 7	Reporter:	Josip Vilicic <jvilicic>
Component:	sssd	Assignee:	SSSD Maintainers <sssd-maint>
Status:	CLOSED WORKSFORME	QA Contact:	sssd-qe <sssd-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	grajaiya, jhrozek, jvilicic, lslebodn, mkosek, mzidek, pbrezina, tscherf
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-23 10:54:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Josip Vilicic 2018-05-09 21:59:14 UTC

Description of problem:
Updating SSSD from 1.15 to 1.16 caused very long auth times in Ansible, which was only remedied with clearing SSSD cache
_____________________________________________


Version-Release number of selected component (if applicable):
sssd-1.16.0-19.el7.x86_64

_____________________________________________


How reproducible:
Consistent

_____________________________________________


Steps to Reproduce:
1) Set up authentication through SSSD on version 1.15
2) Update RHEL to 7.5 (which updates SSSD to 1.16)
3) Try to authenticate

_____________________________________________


Actual results:
- Experience ~20 second delays, when authentication used to take ~10 seconds
- Downgrading SSSD causes it to not start up again, so cache must be cleared (which is difficult to perform in an automated-deployment environment)

_____________________________________________


Expected results:
Authentications to work just as quickly as before SSSD update

_____________________________________________


Additional info:

-- Errors seen after downgrading from 1.16 to 1.15 in order to see if that fixed the delay, which led the customer to try clearing the SSSD cache:

   May 07 10:05:31 ls0oegip100 systemd[1]: Starting System Security Services Daemon...
   May 07 10:05:31 ls0oegip100 sssd[47975]: Starting up
   May 07 10:05:31 ls0oegip100 sssd[47975]: Lower version of database is expected!
   May 07 10:05:31 ls0oegip100 sssd[47975]: Removing cache files in /var/lib/sss/db should fix the issue, but note that removing cache files will also remove all of your cached credentials.
   May 07 10:05:31 ls0oegip100 systemd[1]: sssd.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
   May 07 10:05:31 ls0oegip100 systemd[1]: Failed to start System Security Services Daemon.
   May 07 10:05:31 ls0oegip100 systemd[1]: Unit sssd.service entered failed state.
   May 07 10:05:31 ls0oegip100 systemd[1]: sssd.service failed.

Comment 2 Jakub Hrozek 2018-05-10 10:48:36 UTC

This bug report has no configuration or logs. Please see https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html to learn what is needed in a useful bug report.

About the cache removal when downgrade, that is expected.

Comment 3 Jakub Hrozek 2018-05-11 16:18:35 UTC

OK, I found some time to actually read the case and it is not clear to me what this bug is about:
 a) is it about the perceived performance regression? If yes, can we see log files that capture the login or id or whatever is slow? It would be best to be able to compare the old and new version

or ..

 b) is it about having to remove the cache when you downgrade? If yes, then that's expected, we changed the database layout and the indexes somewhat between 1.15 and 1.16 so the database must be upgraded, but we don't support database downgrades (often it's not even possible)

Comment 4 Josip Vilicic 2018-05-16 18:32:39 UTC

1) This bug was opened because the customer reported having to clear SSSD's cache after upgrading from 1.15 to 1.16:

   ----------------------
   TEST #1: 
   My first test is with the actual environnement when the problem first occurs after the upgrade of the server.

   kernel: 3.10.0-862.el7.x86_64
   sssd: sssd-1.16.0-19.el7.x86_64

   Result: 
   The playbook failed with error messages: "Timeout (22s) waiting for privilege escalation prompt"
   
   Result time:
   real    0m23.801s
   user    0m2.091s
   sys     0m0.619s
   ----------------------



2) The customer made additional test cases, where they downgraded SSSD, which required them to clear the cache after SSSD would not start properly:

   ----------------------
   TEST #2: 
   I downgrade 'sssd' to the previous version, I kept the same kernel.

   kernel: 3.10.0-862.el7.x86_64
   sssd: sssd-1.15.2-50.el7_4.11.x86_64

   Result:
   The playbook complete succesfully.

   Result time:
   real    0m13.991s
   user    0m2.242s
   sys     0m0.627s

   As you can see, the execution took half the time for version 1.15 vs 1.16.

   IMPORTANT:
   After the downgrade to version 1.15, when I restart 'sssd', I had the following error:

   # systemctl status sssd.service
    sssd.service - System Security Services Daemon
      Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
     Drop-In: /etc/systemd/system/sssd.service.d
              journal.conf
      Active: failed (Result: exit-code) since Mon 2018-05-07 10:05:31 EDT; 9s ago
     Process: 47975 ExecStart=/usr/sbin/sssd -i -f (code=exited, status=3)
    Main PID: 47975 (code=exited, status=3)

   May 07 10:05:31 ls0oegip100 systemd[1]: Starting System Security Services Daemon...
   May 07 10:05:31 ls0oegip100 sssd[47975]: Starting up
   May 07 10:05:31 ls0oegip100 sssd[47975]: Lower version of database is expected!
   May 07 10:05:31 ls0oegip100 sssd[47975]: Removing cache files in /var/lib/sss/db should fix the issue, but note that removing cache files will also remove all of your cached credentials.
   May 07 10:05:31 ls0oegip100 systemd[1]: sssd.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
   May 07 10:05:31 ls0oegip100 systemd[1]: Failed to start System Security Services Daemon.
   May 07 10:05:31 ls0oegip100 systemd[1]: Unit sssd.service entered failed state.
   May 07 10:05:31 ls0oegip100 systemd[1]: sssd.service failed.


   I clear the contents of the '/var/lib/sss/db' directory, the start of the 'sssd' work perfectly.
   ----------------------

   I personally feel we can ignore Test Case #2, but it shows they don't experience the timeout like they do with SSSD 1.16



3) Then 2 additional tests, where they upgrade SSSD from 1.15 to 1.16, experience problems (after having cleared the SSSD cache in TEST #2 above), and only after clearing SSSD's cache *after the upgrade* did things work properly:

   ----------------------
   TEST #3:
   I upgrade to version 1.16 of 'sssd',

   kernel: 3.10.0-862.el7.x86_64
   sssd: sssd-1.16.0-19.el7.x86_64

   Result: 
   The playbook failed with error messages: "Timeout (22s) waiting for privilege escalation prompt"

   Result time:
   real    0m23.990s
   user    0m2.247s
   sys     0m0.702s


   TEST #4: 
   With the same versions of the Kernel and sssd, I clear the contents of the '/var/lib/sss/db' directory.

   kernel: 3.10.0-862.el7.x86_64
   sssd: sssd-1.16.0-19.el7.x86_64

   Result: 
   The playbook complet succesfully.

   Result time:
   real    0m12.109s
   user    0m2.371s
   sys     0m0.630s
   ----------------------



4) Unfortunately, the customer has moved on after the "workaround" of clearing SSSD's cache after the upgrade and they do not have the resources to continue troubleshooting, so we do not have, and will not receive, SSSD debug logs of the failure.

Comment 5 Jakub Hrozek 2018-05-16 19:28:22 UTC

Thank you very much for clearing up the confusion.

Since the case is closed and none of our tests showed a performance regression, I think it makes sense to close the bug as WORKSFORME in a couple of days.

Comment 6 Jakub Hrozek 2018-05-23 10:54:15 UTC

Since there is no additional information to perform some kind of a post-mortem analysis, I'm going to close this bug.