1110226 – Requests queued during transition from offline to online mode

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1110226 - Requests queued during transition from offline to online mode

Summary: Requests queued during transition from offline to online mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jakub Hrozek
QA Contact:	Kaushik Banerjee
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-17 09:28 UTC by Jakub Hrozek
Modified:	2020-05-02 17:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:	sssd-1.11.6-23.el6
Doc Type:	Bug Fix
Doc Text:	No Documentation Needed
Clone Of:
Environment:
Last Closed:	2014-10-14 04:48:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 3397	None	None	None	2020-05-02 17:43:38 UTC
Github	SSSD sssd issues 3443	None	None	None	2020-05-02 17:46:32 UTC
Red Hat Product Errata	RHBA-2014:1375	normal	SHIPPED_LIVE	sssd bug fix and enhancement update	2014-10-14 01:06:25 UTC

Description Jakub Hrozek 2014-06-17 09:28:30 UTC

This bug is created as a clone of upstream ticket:
https://fedorahosted.org/sssd/ticket/2355

Using SSSD with "cache_credentials = true", users may experience periodic blocking for up to 6 seconds if SSSD is switching from offline to online mode and the LDAP server is unreachable. 

The first request to SSSD after it has been offline for more than 60 seconds is immediately answered from the cache, but then triggers a reconnection trial to the LDAP server.

All subsequent requests reaching SSSD during the connection phase are queued and answered once the connection succeeds or fails. In case the LDAP server is unreachable, SSSD waits for 6 seconds before the connection trial is aborted. This means that the user may experience a delay of up to 6 seconds every 60 seconds (in the worst case).

See the following debug logs where the LDAP server is not responding, starting off in offline mode:
-------------------------------------
The *first* request to SSSD (which is triggering the reconnection trial) gets answered right away in offline mode:
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Request processed. Returned 1,11,Fast reply - offline
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): beginning to connect
...
SSSD is now trying to reconnect to the LDAP server.

Only the *subsequent* requests that are received *while* SSSD is trying to (re-)connect to the LDAP server are queued until the connection times out (for at most 6 seconds). These pending requests are causing the system to block:

...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][name=brauchle]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...

--> this is the time where the system may be unresponsive for 6 seconds <--

...
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x0020): Failed to connect, going offline (5 [Input/output error])
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline]
>> (0x2000): Going offline!
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb]
>> (0x0080): Going offline. Running callbacks.
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #1
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #2
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #3
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline 
-------------------------------------
After the connection times out, the queued request are answered with cached entries.

So why not keep the "offline" flag set to "true" until the LDAP connection trial returns (positive or negative) and only if positive, switch to online mode?

As the first request (triggering the reconnection) is answered from the cache anyway, there is no point to keep the subsequent ones pending until the connection is established successfully.


Possibly one needs to consider that start up phase (with cold caches) as a special case and actually do queue incoming request in this case?

Comment 2 Sumit Bose 2014-07-17 14:18:34 UTC

How to reproduce:
- start SSSD with option 'cache_credentials = True', make sure the configured server for authentication is available (it does not matter which type of authentication)
- authenticate a user to make sure that credentials are cached
- switch off authentication server
- authenticate the user again, this might take a few seconds because SSSD has to detect that the server is offline
- the next authentications are fast because they are done offline
- after a minute SSSD will try to go online again, if the server is still not available authentication will need a couple of seconds

With the patch applied the only delay should be seen at the point where SSSD had to detect that the server is offline. While offline no delays should be seen, even when offline timeout (1 minute) is passed and SSSD tries to go online again.

Comment 3 Jakub Hrozek 2014-07-31 09:56:17 UTC

master:
    45eb92f4564c76d04535b5c8e85e769db3ea93a3
    b39098de5bbb7513d5509fb30a3096a3bd9d43f2
    ab0ab5a30379b84d6e05e1f2dc457bd1dd97401f

Comment 4 Jakub Hrozek 2014-07-31 11:56:56 UTC

Pushed to sssd-1-11:
    f65efdab6cb62f80a9cfa30303d4c550047bc97e
    e552a21a8588a63c26791ad11281692527feebd1
    e9ca61c12da6f16ce6269bf104ddf5b4c7c5760b

Comment 6 Kaushik Banerjee 2014-08-08 10:15:56 UTC

There is a change in behaviour due to the fix. This should be documented.

Pasting Michal's reply to the discussion on sssd taking 96s for the first time on transitioning from offline to online.

<snip>
default base timeout is 60 seconds, longest random offset is 30 seconds....
so after max 90 seconds sssd will check if the server is available... this
check may take up to 6 seconds (most likely it will be less than a second,
but to be sure I am taking the longest possible interval)... so totaly 96
potential seconds of sssd being offline
</snip>

Comment 7 Jakub Hrozek 2014-08-08 22:09:01 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/2401

Comment 8 Jakub Hrozek 2014-08-13 06:35:20 UTC

(In reply to Kaushik Banerjee from comment #6)
> There is a change in behaviour due to the fix. This should be documented.

Hi, I realized yesterday during patch review of the man patch addition that in RHEL-6.6 we don't have the offline_timeout option at all yet, so there's no option to amend. Do you still think we should document the difference somewhere even though in 6.6 the offline timeout is not documented? Or is it OK to move the documentation addendum to 7.1 since the offline timeout is pretty much opaque in RHEL-6.

Comment 9 Kaushik Banerjee 2014-08-13 07:28:35 UTC

(In reply to Jakub Hrozek from comment #8)

> Hi, I realized yesterday during patch review of the man patch addition that
> in RHEL-6.6 we don't have the offline_timeout option at all yet, so there's
> no option to amend. Do you still think we should document the difference
> somewhere even though in 6.6 the offline timeout is not documented?

I would like to have this documented in the manpage.
Looking at the upstream patch for offline_timeout man page text, can we include the similar text(minus mentioning the offline_timeout option) somewhere? That would of course mean a separate patch for 1.11

If this is not possible, lets just release note this behaviour for 6.6.

Comment 10 Jakub Hrozek 2014-08-13 07:52:16 UTC

Michal, can you prepare a similar patch for 6.6, maybe in the "NOTES" section or somewhere where we describe failover?

Comment 12 Jakub Hrozek 2014-08-19 14:20:24 UTC

* sssd-1-11: ce16e214b09324b614c8014d3590000a740d68a4

Comment 14 Kaushik Banerjee 2014-09-05 13:41:15 UTC

Verified with 1.11.6-29.el6

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: offline_misc_003: BZ 1110226 Requests queued during transition from offline to online mode
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Stopping sssd:                                             [  OK  ]
Starting sssd:                                             [  OK  ]
:: [   PASS   ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0)
:: [   PASS   ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0)
:: [   PASS   ] :: Running 'su_success puser1 Secret123' (Expected 0, got 0)
:: [   PASS   ] :: Running '(time su_success puser1 Secret123) > /tmp/output 2>&1' (Expected 0, got 0)
:: [   PASS   ] :: No delay seen during auth in offline mode after 90s 



The manpage of sssd also has the following in the Notes section:

<snip>

Amount of time SSSD spends in offline mode
           When SSSD switches to offline mode, the amount of time before it tries to go back online will increase based upon the time spent disconnected. This value is in seconds and calculated by the following:

           60 + random_offset

           The random offset can increment up to 30 seconds. After each unsuccessful attempt to go online, the new interval is recalculated by the following:

           new_interval = old_interval*2 + random_offset

           Note that the maximum length of each interval is currently limited to one hour. If the calculated length of new_interval is greater than an hour, it will be forced to one hour.

</snip>

Comment 15 errata-xmlrpc 2014-10-14 04:48:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1375.html

Note You need to log in before you can comment on or make changes to this bug.