Red Hat Bugzilla – Bug 1110226
Requests queued during transition from offline to online mode
Last modified: 2016-09-15 10:01:17 EDT
This bug is created as a clone of upstream ticket: https://fedorahosted.org/sssd/ticket/2355 Using SSSD with "cache_credentials = true", users may experience periodic blocking for up to 6 seconds if SSSD is switching from offline to online mode and the LDAP server is unreachable. The first request to SSSD after it has been offline for more than 60 seconds is immediately answered from the cache, but then triggers a reconnection trial to the LDAP server. All subsequent requests reaching SSSD during the connection phase are queued and answered once the connection succeeds or fails. In case the LDAP server is unreachable, SSSD waits for 6 seconds before the connection trial is aborted. This means that the user may experience a delay of up to 6 seconds every 60 seconds (in the worst case). See the following debug logs where the LDAP server is not responding, starting off in offline mode: ------------------------------------- The *first* request to SSSD (which is triggering the reconnection trial) gets answered right away in offline mode: >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] >> (0x0100): Got request for [4097][1][idnumber=10011] >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] >> (0x0100): Request processed. Returned 1,11,Fast reply - offline >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] >> (0x4000): beginning to connect ... SSSD is now trying to reconnect to the LDAP server. Only the *subsequent* requests that are received *while* SSSD is trying to (re-)connect to the LDAP server are queued until the connection times out (for at most 6 seconds). These pending requests are causing the system to block: ... >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] >> (0x0100): Got request for [4097][1][name=brauchle] >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] >> (0x4000): waiting for connection to complete ... >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] >> (0x0100): Got request for [4097][1][idnumber=10011] >> (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] >> (0x4000): waiting for connection to complete ... --> this is the time where the system may be unresponsive for 6 seconds <-- ... >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] >> (0x0020): Failed to connect, going offline (5 [Input/output error]) >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline] >> (0x2000): Going offline! >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb] >> (0x0080): Going offline. Running callbacks. >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] >> (0x4000): notify offline to op #1 >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] >> (0x4000): notify offline to op #2 >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] >> (0x0100): Request processed. Returned 1,11,Offline >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] >> (0x4000): notify offline to op #3 >> (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] >> (0x0100): Request processed. Returned 1,11,Offline ------------------------------------- After the connection times out, the queued request are answered with cached entries. So why not keep the "offline" flag set to "true" until the LDAP connection trial returns (positive or negative) and only if positive, switch to online mode? As the first request (triggering the reconnection) is answered from the cache anyway, there is no point to keep the subsequent ones pending until the connection is established successfully. Possibly one needs to consider that start up phase (with cold caches) as a special case and actually do queue incoming request in this case?
How to reproduce: - start SSSD with option 'cache_credentials = True', make sure the configured server for authentication is available (it does not matter which type of authentication) - authenticate a user to make sure that credentials are cached - switch off authentication server - authenticate the user again, this might take a few seconds because SSSD has to detect that the server is offline - the next authentications are fast because they are done offline - after a minute SSSD will try to go online again, if the server is still not available authentication will need a couple of seconds With the patch applied the only delay should be seen at the point where SSSD had to detect that the server is offline. While offline no delays should be seen, even when offline timeout (1 minute) is passed and SSSD tries to go online again.
master: 45eb92f4564c76d04535b5c8e85e769db3ea93a3 b39098de5bbb7513d5509fb30a3096a3bd9d43f2 ab0ab5a30379b84d6e05e1f2dc457bd1dd97401f
Pushed to sssd-1-11: f65efdab6cb62f80a9cfa30303d4c550047bc97e e552a21a8588a63c26791ad11281692527feebd1 e9ca61c12da6f16ce6269bf104ddf5b4c7c5760b
There is a change in behaviour due to the fix. This should be documented. Pasting Michal's reply to the discussion on sssd taking 96s for the first time on transitioning from offline to online. <snip> default base timeout is 60 seconds, longest random offset is 30 seconds.... so after max 90 seconds sssd will check if the server is available... this check may take up to 6 seconds (most likely it will be less than a second, but to be sure I am taking the longest possible interval)... so totaly 96 potential seconds of sssd being offline </snip>
Upstream ticket: https://fedorahosted.org/sssd/ticket/2401
(In reply to Kaushik Banerjee from comment #6) > There is a change in behaviour due to the fix. This should be documented. Hi, I realized yesterday during patch review of the man patch addition that in RHEL-6.6 we don't have the offline_timeout option at all yet, so there's no option to amend. Do you still think we should document the difference somewhere even though in 6.6 the offline timeout is not documented? Or is it OK to move the documentation addendum to 7.1 since the offline timeout is pretty much opaque in RHEL-6.
(In reply to Jakub Hrozek from comment #8) > Hi, I realized yesterday during patch review of the man patch addition that > in RHEL-6.6 we don't have the offline_timeout option at all yet, so there's > no option to amend. Do you still think we should document the difference > somewhere even though in 6.6 the offline timeout is not documented? I would like to have this documented in the manpage. Looking at the upstream patch for offline_timeout man page text, can we include the similar text(minus mentioning the offline_timeout option) somewhere? That would of course mean a separate patch for 1.11 If this is not possible, lets just release note this behaviour for 6.6.
Michal, can you prepare a similar patch for 6.6, maybe in the "NOTES" section or somewhere where we describe failover?
* sssd-1-11: ce16e214b09324b614c8014d3590000a740d68a4
Verified with 1.11.6-29.el6 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: offline_misc_003: BZ 1110226 Requests queued during transition from offline to online mode :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Stopping sssd: [ OK ] Starting sssd: [ OK ] :: [ PASS ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0) :: [ PASS ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0) :: [ PASS ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0) :: [ PASS ] :: Running '(time su_success puser1 Secret123) > /tmp/output 2>&1' (Expected 0, got 0) :: [ PASS ] :: No delay seen during auth in offline mode after 90s The manpage of sssd also has the following in the Notes section: <snip> Amount of time SSSD spends in offline mode When SSSD switches to offline mode, the amount of time before it tries to go back online will increase based upon the time spent disconnected. This value is in seconds and calculated by the following: 60 + random_offset The random offset can increment up to 30 seconds. After each unsuccessful attempt to go online, the new interval is recalculated by the following: new_interval = old_interval*2 + random_offset Note that the maximum length of each interval is currently limited to one hour. If the calculated length of new_interval is greater than an hour, it will be forced to one hour. </snip>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1375.html