Bug 1401546

Summary: Please back-port fast failover from sssd 1.14 on RHEL 7 into sssd 1.13 on RHEL 6
Product: Red Hat Enterprise Linux 6 Reporter: Greg Scott <gscott>
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED ERRATA QA Contact: shridhar <sgadekar>
Severity: high Docs Contact: Filip Hanzelka <fhanzelk>
Priority: high    
Version: 6.8CC: cww, enewland, fhanzelk, fidencio, grajaiya, gscott, jhrozek, jkurik, lslebodn, mkosek, mzidek, pbrezina, sgoveas, sssd-maint
Target Milestone: rc   
Target Release: 6.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.13.3-60.el6 Doc Type: Bug Fix
Doc Text:
*SSSD* now correctly switches to online mode if prompted by *sssd_be* Previously, a decision to go online made by the *sssd_be* subcomponent of the *System Security Services Daemon* (SSSD) utility was incorrectly followed by a check by the *libsss_ldap* subcomponent of *SSSD* that canceled the original decision. In consequence, it sometimes took a long time for *SSSD* to reconnect from offline mode. With this update, the incorrect check has been removed. As a result, if a request, in particular an authentication request, arrives to *SSSD* while it is offline and *sssd_be* decides *SSSD* should reconnect, it reconnects correctly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-19 05:13:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1374441, 1461138, 1504542    

Description Greg Scott 2016-12-05 14:38:42 UTC
Description of problem:
Here is the scenario:

1)      We have a working machine talking to LDAP for user authentication.  We are leveraging SSSD for this communication, leveraging nss and pam services.

2)      We use iptables and insert a rule to drop the LDAP traffic

3)      After an extended period of time (say 20min) we remove the iptable rule and permit the ldap traffic again

4)      Used tcpdump to observe ldap traffic during the outage.

When performing the above it seems to take over 5 minutes for the system to wake up and re-establish a connection to ldap.  The system we tested this on is:

Version-Release number of selected component (if applicable):
RHEL 6.n, sssd <= 1.13

How reproducible:
Always

Steps to Reproduce:
1. See above
2.
3.

Actual results:
It takes a long time 

Expected results:
It should connect back up to its LDAP server more quickly after the LDAP server becomes available again.

Additional info:

We did an evaluation of the latest 7.3 release and noticed the performance of sssd was almost instantaneous after recovering from a prolonged outage.

Looking at the version and changelog, it appears with sssd v 1.14.0 this issue was fixed; eagle-eyed Mike picked up on it:
[host.example.com ~]$ rpm --changelog -q sssd
# snippet
* Tue Jul 12 2016 Jakub Hrozek <jhrozek> - 1.14.0-3
- Sync a few minor patches from upstream
- Fix a failover issue
- Resolves: rhbz#1334749 - sssd fails to mark a connection as bad on
                           searches that time out

So the question becomes, can sssd 1.14.0 be made available for RHEL 6?
Or would it be possible to back-port this fix to sssd 1.13?

Comment 2 Lukas Slebodnik 2016-12-05 14:54:09 UTC
(In reply to Greg Scott from comment #0)
> So the question becomes, can sssd 1.14.0 be made available for RHEL 6?
> Or would it be possible to back-port this fix to sssd 1.13?

There isn't any technical problem to build sssd-1.14 on rhel6
https://copr.fedorainfracloud.org/coprs/g/sssd/sssd-1-14/

However, rhel6 is in production phase which does not allow rebases.

Regarding to BZ1334749. The fix is trivial https://fedorahosted.org/sssd/ticket/3009. The question is whether it will help for customer or it is just guessing based on rhel7.3 changelog.

Comment 3 Jakub Hrozek 2016-12-05 15:21:23 UTC
(In reply to Lukas Slebodnik from comment #2)
> Regarding to BZ1334749. The fix is trivial
> https://fedorahosted.org/sssd/ticket/3009. The question is whether it will
> help for customer or it is just guessing based on rhel7.3 changelog.

I talked to Greg over e-mail and they were guessing. But I asked for a bugzilla to be opened nonetheless, because otherwise we will just forget about this problem. So I built the test package as requested:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12198058

However, the problem, as described here sounds a bit different than what I thought it was. The fix was about "failing fast", that is, if a server was marked as bad, it took SSSD too long to get into the offline mode.

It seems that the customer's problem is "recovering fast" which is quite different. Moreover, as I already said in the case earlier, sssd mostly reacts to queries from outside, it rarely performs any lookups or reconnects on its own.

So I would like to ask for
1) the test package to be tried out. It's the 6.9 candidate with the extra patch applied
2) if this doesn't help (and I'm not convinced anymore it would), then please attach sssd logs that capture the period between inserting and removing the iptables rules. The logs in the customer case do not capture any failover at all. Also, I see errors like "Unexpected result from ldap: Server is unwilling to perform(53), Rejecting the requested operation  because the connection has
 not been authenticated" which seems to suggest they are using an anonymous bind against a server that requires authentication (like AD).

Comment 6 Greg Scott 2016-12-06 15:24:41 UTC
Jakub, I pasted most of your update and your link into the support case yesterday and asked the customer to test the package you built.

This thought may be too creative but it's bound to come up. If the customer were to grab a 1.14 ssd from a RHEL 7 repo and try to install that RPM onto RHEL 6, would the install fail with a bazillion dependency problems?

- Greg

Comment 7 Jakub Hrozek 2016-12-06 16:27:22 UTC
(In reply to Greg Scott from comment #6)
> Jakub, I pasted most of your update and your link into the support case
> yesterday and asked the customer to test the package you built.
> 

Thank you.

> This thought may be too creative but it's bound to come up. If the customer
> were to grab a 1.14 ssd from a RHEL 7 repo and try to install that RPM onto
> RHEL 6, would the install fail with a bazillion dependency problems?

Yes, this wouldn't work and wouldn't be supported.

Comment 8 Greg Scott 2016-12-08 15:48:18 UTC
I have this comment from the customer:

> Unfortunately I'm unable to get to the URL 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12198058 from 
> within PNC and even my home system.  Is there an alternative way to grab the 
> package?  Also we are running rhel6.6 so do I need to upgrade to rhel6.9 for 
> testing or will this package work on 6.6?

My fault, it's probably private and I should have packaged this better for the customer.

I navigated through the URL above and found the RPMs.  It looks like there are a bunch of dependencies that also need to be installed.

This is what I think needs to happen.

First, do:

yum update sssd

to get to the stock 1.13 sssd and all its dependencies.  Then do:

rpm -i sssd-1.13.3-51.el6.i686.rpm

On top of the sssd 1.13 already in place.  Am I on solid ground?

thanks

- Greg

Comment 9 Lukas Slebodnik 2016-12-08 15:54:53 UTC
(In reply to Greg Scott from comment #8)
> This is what I think needs to happen.
> 
> First, do:
> 
> yum update sssd
> 
> to get to the stock 1.13 sssd and all its dependencies.  Then do:
> 
> rpm -i sssd-1.13.3-51.el6.i686.rpm
> 
> On top of the sssd 1.13 already in place.  Am I on solid ground?
> 
IIRC the only new dependency which is not in rhel6.8 is packages built
from ding-libs-0.4.0-12.el6

Comment 10 Jakub Hrozek 2016-12-08 16:47:47 UTC
Yeah, I realized the dependency might be missing so I'm building another test package with just that single patch cherry-picked atop 6.8 and without requiring newer ding-libs:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12215039

this one should install cleanly on a RHEL-6 system. You should use rpm -Uvh *.rpm to upgrade to these packages. None of the -devel packages are needed for a test and in general only packages that the customer already has should be upgraded (even upgrading only sssd-common should work).

Comment 11 Greg Scott 2016-12-08 17:17:55 UTC
I may have messed up here.  I see references to src.rpm files in that link above, but I don't see any installable RPMs.  I clicked on some buttons and ended up at

https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=17261

where I saw a package named

sssd-1.13.3-51.el6.i686

That linked to:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=527920

where I saw installable RPMs, including:

sssd-1.13.3-51.el6.i686.rpm

built yesterday, which I attached to the support case. Did I get the wrong one?

- Greg

Comment 12 Jakub Hrozek 2016-12-09 08:38:40 UTC
(In reply to Greg Scott from comment #11)
> I may have messed up here.  I see references to src.rpm files in that link
> above, but I don't see any installable RPMs.  I clicked on some buttons and
> ended up at
> 
> https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=17261
> 
> where I saw a package named
> 
> sssd-1.13.3-51.el6.i686
> 
> That linked to:
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=527920
> 
> where I saw installable RPMs, including:
> 
> sssd-1.13.3-51.el6.i686.rpm
> 
> built yesterday, which I attached to the support case. Did I get the wrong
> one?
> 
> - Greg

You'll want to use the build I provided yesterday. Sorry about the first one, I didn't realize there was a dependency on a newer libini package than the customer has through RHN, so the packages wouldn't install cleanly.

As for the arch specific packages, here are the ones for x86_64:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12215044
and here are the packages for i686:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12215046

Comment 13 Greg Scott 2016-12-09 15:51:28 UTC
Thanks Jakub.  I just now attached

sssd-1.13.3-22.el6_8.6.1.x86_64.rpm and
sssd-1.13.3-22.el6_8.6.1.i686.rpm

to the case.

The customer wants to stay at RHEL 6.6.  So I left instructions to do

yum update sssd

to get the stock sssd 1.13 and all its dependencies, and then install the RPM above by hand.

- Greg

Comment 14 Greg Scott 2016-12-12 18:51:58 UTC
After updating sssd to the stock 1.13 version, the customer had to install several of the patched RPMs by hand.  Here is feedback from the support case after installing and testing:

**********
"So was able to update the packages.  Helmuth ran a test and the issue still exists unfortunately.

I went ahead and attached a sosreport from our test system in case you would to take a look at our config.  Please let me know if you need anything else or what the next steps are."
**********

If it's easy to do and low risk, would it be possible to just build 1.14 on RHEL 6?

- Greg

Comment 15 Jakub Hrozek 2016-12-12 20:25:31 UTC
(In reply to Greg Scott from comment #14)
> After updating sssd to the stock 1.13 version, the customer had to install
> several of the patched RPMs by hand.  Here is feedback from the support case
> after installing and testing:
> 
> **********
> "So was able to update the packages.  Helmuth ran a test and the issue still
> exists unfortunately.
> 
> I went ahead and attached a sosreport from our test system in case you would
> to take a look at our config.  Please let me know if you need anything else
> or what the next steps are."
> **********
> 
> If it's easy to do and low risk, would it be possible to just build 1.14 on
> RHEL 6?
> 
> - Greg

Possible, yes, supported no. In general, I'm wary of asking customers to run the upstream repositories, because we've had a couple of cases in the past when the customer then came back and asked for support on those..

Could we see two sets of debug logs, one from the working and one from the non-working system that capture the problem so that we can see what the issue might be?

Comment 16 Greg Scott 2016-12-12 20:40:15 UTC
Oh - sorry - I didn't mean for the customer to build 1.14. The customer is asking Red hat to build and fully support sssd 1.14 on RHEL 6.

The customer attached an SOSreport from the problem system to the support case. Where's a good place to get you a copy?

- Greg

Comment 17 Jakub Hrozek 2016-12-13 09:43:46 UTC
(In reply to Greg Scott from comment #16)
> Oh - sorry - I didn't mean for the customer to build 1.14. The customer is
> asking Red hat to build and fully support sssd 1.14 on RHEL 6.
> 

That won't happen, sorry. RHEL-6 is already in production phase 2 and only urgent and high priority fixes are cherry-picked individually.

> The customer attached an SOSreport from the problem system to the support
> case. Where's a good place to get you a copy?

Attach the sssd debug logs to this bugzilla, please.

Comment 30 Jakub Hrozek 2016-12-22 17:07:53 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/3274

Comment 31 Greg Scott 2017-01-13 16:01:23 UTC
Hi Jakub - Since the customer tested the patch and it works for them, what should I tell them about plans, if any, to incorporate it into a RHEL 6 stream?

thanks

- Greg

Comment 32 Jakub Hrozek 2017-01-13 16:12:12 UTC
(In reply to Greg Scott from comment #31)
> Hi Jakub - Since the customer tested the patch and it works for them, what
> should I tell them about plans, if any, to incorporate it into a RHEL 6
> stream?
> 
> thanks
> 
> - Greg

Unfortunately, I think it's too late for 6.9 unless an exception is provided by a PM since we are already in the snapshot phase. I think either 6.10 or 6.9.z is more realistic..

Comment 33 Greg Scott 2017-01-13 16:54:52 UTC
OK, thanks - Here is what I said in the support case:

******************
Created By: Greg Scott  (1/13/2017 10:53 AM)
It looks like the engineering team is planning to incorporate the patch you tested into a later RHEL 6 stream.  It's too late for 6.9, so hopefully 6.10.  It may make it into a later 6.9.z release stream.  I'll keep an eye on it and update here when new information is available.

- Greg
******************

- Greg

Comment 48 Greg Scott 2018-02-27 17:36:18 UTC
Groovy - thanks to both Jakub and Fabiano.

If you have to stand on your head to force-fit these fixes into SSSD 1.13, and if SSSD 1.14 saves your developer time, and if you guys think it's the best way to proceed, then I'll vote to bend the other rule and slip SSSD 1.14 into RHEL 6.10.z, or more likely 6.11.

- Greg

Comment 49 Jakub Hrozek 2018-02-27 18:51:00 UTC
(In reply to Greg Scott from comment #48)
> Groovy - thanks to both Jakub and Fabiano.
> 
> If you have to stand on your head to force-fit these fixes into SSSD 1.13,
> and if SSSD 1.14 saves your developer time, and if you guys think it's the
> best way to proceed, then I'll vote to bend the other rule and slip SSSD
> 1.14 into RHEL 6.10.z, or more likely 6.11.

Putting a major update into a z-stream is generally forbidden and there won't be (AFAIK) a 6.11 so I think 6.10 is the last chance we've got.

I've been testing sssd master's reconnection logic lately and I realized we broke something else in 1.15. I'm not sure it's related to the code in 1.13 at all, but I would like to make sure that we're not backporting a bug.

If the customer has some time, it would be nice if they could test the 6.10 candidate to make sure something else between the test build and the 6.10 candidate didn't change.

Comment 50 Greg Scott 2018-02-27 22:31:48 UTC
Thanks Jakub.  I just left a comment in the support case asking about it.

- Greg

Comment 51 Jakub Hrozek 2018-03-05 17:03:56 UTC
(In reply to Greg Scott from comment #50)
> Thanks Jakub.  I just left a comment in the support case asking about it.
> 
> - Greg

Did the customer agree with testing? If yes, this is the link:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=657662

Comment 52 Greg Scott 2018-03-05 18:30:56 UTC
Thanks Jakub.  No answer yet from the customer.

- Greg

Comment 58 errata-xmlrpc 2018-06-19 05:13:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1877