Bug 1620315 - FreeIPA webUI user authentication fails with 389-ds-base 1.3.8.6+
Summary: FreeIPA webUI user authentication fails with 389-ds-base 1.3.8.6+
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: 389-ds-base
Version: 27
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: mreynolds
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-23 01:56 UTC by Adam Williamson
Modified: 2018-11-27 15:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-27 15:05:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
server logs (tarball of /var/log from the server) (1.71 MB, application/octet-stream)
2018-08-23 02:01 UTC, Adam Williamson
no flags Details
client logs (tarball of /var/log from the client) (1.58 MB, application/octet-stream)
2018-08-23 02:01 UTC, Adam Williamson
no flags Details
updated server logs (with dirsrv) (2.11 MB, application/octet-stream)
2018-08-23 20:11 UTC, Adam Williamson
no flags Details
server logs from a passed run (same test against a different update) (2.29 MB, application/octet-stream)
2018-08-23 21:01 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2018-08-23 01:56:43 UTC
In automated (openQA) testing, FreeIPA client tests often seem to fail with 389-ds-base 1.3.8.6 and 1.3.8.7 (in Fedora 27 update testing). There are failures in Rawhide/F29 testing (with 1.4.0.15) which may *possibly* be the same, but I'm not 100% sure of that yet.

For F27, the test reliably affected is a complex one which does a lot of stuff. It starts by enrolling as a client via cockpit (which just runs realmd under the covers), then tests logging into the web UI (as admin) and doing some stuff in it, including creating a user called 'test3'. It creates an HBAC rule allowing test3 to login anywhere, then quits to console and kinits as test3 and sets a permanent password.

Next it tests it can log in as test3 at the console (which seems to always work). Then it tests logging into the web UI as test3 and changing its password. This test seems to always fail at this point, when it tries to log in to the web UI as test3, it fails. The web UI shows a dialog titled "Operations Error" which says "Some operations failed.", with a 'Show details' link. If I have the test click the 'Show details' link, it says "no matching entry found".

These results seem highly reproducible in testing of the 1.3.8.6 and 1.3.8.7 updates on F27 (testing on F29 and Rawhide is more of a mess ATM as other bugs frequently get in the way). Here's a recent run of the tests on the 1.3.8.6 F27 update:

https://openqa.stg.fedoraproject.org/tests/overview?version=27&groupid=2&build=Update-FEDORA-2018-3c876babb9&distri=fedora

You can see the failed 'realmd_join_cockpit' test: https://openqa.stg.fedoraproject.org/tests/343612

Note the red-outlined thumbnails - if you expand these you see that it failed on logging into the web UI during the password change test (as described above), and later (the password change test is marked as 'non-fatal' so the run continues after it fails). It also fails later when attempting to log in to the console as 'test1', which should also work.

We can see other recent results of the same test here:

https://openqa.fedoraproject.org/tests/264748?arch=x86_64&test=realmd_join_cockpit&flavor=updates-server&machine=64bit&distri=fedora&limit_previous=50&version=27#previous (prod)

https://openqa.stg.fedoraproject.org/tests/343612?flavor=updates-server&machine=64bit&distri=fedora&test=realmd_join_cockpit&arch=x86_64&limit_previous=50&version=27#previous (staging)

there are only failures from the 389-ds-base 1.3.8.6 and 1.3.8.7 updates, plus - on staging - a failure caused by a bug in a fedora-repos update (bad GPG key) and a transient failure which went away on restart (the fail for FEDORA-2018-e952086dda , note there's a pass for the same update right after it). This shows that the *same test* run against any other non-broken F27 update passes, indicating very strongly that the bug really is caused by the 389-ds-base update somehow. Also note that the test has been at least 4 times on each openQA instance (because we automatically re-run any failed test in case it was a blip, and there have been two updates, FEDORA-2018-a45016e03f (1.3.8.7) and FEDORA-2018-3c876babb9 (1.3.8.6) - there've been more than 4 tries on staging as I've re-run the test manually a few times.

I will attach logs from client and server, but there's nothing immediately outstanding in any of them, at least that I can see. Note the way ipa-dnskeysyncd crashes in a loop doesn't seem to be the cause of this bug - that also happens on all the other updates where the test *passes*.

The web UI failure message is somewhat interesting. FreeIPA 4.6.3 - the version in Fedora 27 at present - only raises that error in two places, ipalib/install/certstore.py `get_ca_certs` and ipapython/ipaldap.py `find_entries`. That's probably a useful avenue for further investigation (figure out which of those is likely getting used during web UI user login, and why a 389-ds-base update might cause that to break). I could probably re-run the test with a hack to FreeIPA to make it use a slightly different log message in those two places, so we know which one we're hitting...

Comment 1 Adam Williamson 2018-08-23 02:01:09 UTC
Created attachment 1478020 [details]
server logs (tarball of /var/log from the server)

Comment 2 Adam Williamson 2018-08-23 02:01:43 UTC
Created attachment 1478021 [details]
client logs (tarball of /var/log from the client)

Comment 3 Viktor Ashirov 2018-08-23 10:00:43 UTC
dirsrv logs seem to be missing from the tarball, without them it's hard to say why the operation failed. 
In 1.3.8.5 there was a filter optimization fix, 1.3.8.6 only backs out fix for https://pagure.io/389-ds-base/issue/49789, so I wonder if 1.3.8.5 failed too or it only started with 1.3.8.6.

Comment 4 Adam Williamson 2018-08-23 16:09:13 UTC
The log tarballs contain the whole of /var/log ; if there's nothing in /var/log/dirsrv in the tarball, there wasn't anything in it on the host. The logs in question may be in the journal; did you check the journal? (You can read it with journalctl --file , the files are under var/log/journal ).

The deal with 1.3.8.5 and 1.3.8.6 is that the thing backed out in 1.3.8.6 actually broke FreeIPA server deployment. So I can't really *prove* that this bug existed in 1.3.8.5 because the tests can't get that far, as the server deployment fails. However, since all 1.3.8.6 did was revert that problematic 'fix', logically speaking it seems almost certain that one of the *other* things in 1.3.8.5 (the ones that weren't reverted in 1.3.8.6) is what causes this bug.

Comment 5 mreynolds 2018-08-23 18:08:31 UTC
Very odd there is no logging for DS.  Do you know if the logging was disabled intentionally?

In the journal I do see the server's error log content but not the access log, and it's the access log that we need.  Is this a system I can look at right now?

Comment 6 Adam Williamson 2018-08-23 18:30:39 UTC
No, the systems are spun up on demand and destroyed right after the test. But I can tweak things and re-run them. It should also be possible to recreate the test manually, though I have a bit of trouble doing so here as https://bugzilla.redhat.com/show_bug.cgi?id=1484130 *still* does not appear to be fixed, so I cannot use disk images on a shared drive, and I don't really have enough space on my local system for disks for the two VMs I usually use to recreate this kind of issue :/

Where would ds logging be configured? If you tell me where to look, I can get the test to tell us what's there, and change it if necessary.

I'm on #freeipa on Freenode if you want to discuss this there...

Comment 7 Adam Williamson 2018-08-23 18:31:54 UTC
Oh, and the tests don't do anything to reduce logging detail (they do a few things to try and get *more* logs). If there aren't enough tests coming out of the logs, that's the way things are set up by default, apparently.

Comment 8 Adam Williamson 2018-08-23 20:11:00 UTC
Created attachment 1478337 [details]
updated server logs (with dirsrv)

Turns out the server logs being captured after FreeIPA decommissioning meant dirsrv logs were removed; here's an updated tarball with the logs captured before decommissioning, so dirsrv logs are present. The failed login as test3 happens at 19:46:32.0522 UTC (15:46 in the dirsrv logs) and the failed console login as test1 at 19:48:44.

Comment 9 Adam Williamson 2018-08-23 21:01:44 UTC
Created attachment 1478355 [details]
server logs from a passed run (same test against a different update)

Here's the same server log tarball from a case where the tests *passed* - this is the exact same test run against a different update, so 389-ds 1.3.8.4 is used. Comparing logs from passed and failed cases may help, I guess. This is with the tweak to ensure dirsrv logs are included. In these logs, the failed attempt to log in to the web UI occurs at 2018-08-23T20:37:44.0902 UTC .

Comment 10 Adam Williamson 2018-08-23 21:02:15 UTC
Err...I obviously meant that the *successful* attempt to log in to the web UI occurred at 2018-08-23T20:37:44.0902 UTC :)

Comment 11 mreynolds 2018-08-23 21:47:03 UTC
A quick look through the logs is not too revealing.  I see several successful binds as test3:

[23/Aug/2018:16:38:19.685947465 -0400] conn=101 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
[23/Aug/2018:16:38:19.690602686 -0400] conn=101 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0004827704 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"

I don't see any errors.  In the successful log I see "more" successful binds than in the failing log.  I also see a direct (non-sasl) bind of the user in the successful log which I do not see in the failure log:

[23/Aug/2018:16:38:19.291538409 -0400] conn=96 fd=122 slot=122 connection from local to /var/run/slapd-DOMAIN-LOCAL.socket
[23/Aug/2018:16:38:19.291973196 -0400] conn=96 op=0 BIND dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local" method=128 version=3
[23/Aug/2018:16:38:19.292841696 -0400] conn=96 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0001134752 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
[23/Aug/2018:16:38:19.293035197 -0400] conn=96 op=1 UNBIND
[23/Aug/2018:16:38:19.293068018 -0400] conn=96 op=1 fd=122 closed - U1


Perhaps there is an IPA log that might have more information about why its failing?


I will also do a more fine grained log analysis tomorrow.

Comment 12 Adam Williamson 2018-08-23 22:37:27 UTC
Yeah, that's more or less what I saw. Clearly more 'stuff' happens in the successful login case, but there's no clear reason why the failed one just apparently gives up - the transactions right before it gives up look just the same as in the successful case.

I'm currently fiddling with making the test hack up the relevant bits of FreeIPA to log harder; will report any findings from that.

Comment 13 Adam Williamson 2018-08-24 00:11:37 UTC
Welp, so far, I know for sure that the error is happening here:

https://github.com/freeipa/freeipa/blob/release-4-6-3/ipapython/ipaldap.py#L1467

Comment 14 Adam Williamson 2018-08-24 06:41:46 UTC
I hacked up the routines there with some debugging statements, you can see the adjusted copy of the file with the added logging lines at https://www.happyassassin.net/temp/ipaldap.py - search for 'XXX'. Here are logs from passed and failed attempts with those changes:

https://openqa.stg.fedoraproject.org/tests/344011/file/role_deploy_domain_controller_check-var_log.tar.gz (failed)
https://openqa.stg.fedoraproject.org/tests/344013/file/role_deploy_domain_controller_check-var_log.tar.gz (passed)

The attempts to log in to the web UI happened at 05:00:02 (pass) and 04:59:10 (fail). The log messages are in var/log/httpd/error.log (yeah, all IPA web UI log messages appear to go to error.log, even 'info' and 'debug' messages).

Comment 15 Adam Williamson 2018-08-24 06:47:01 UTC
Welp, that's fun - the 'success' case logs 'XXX raising EmptyResult error' (which is the log message I added right before the exception that seems to produce the "no matching entry found" error we see in the web UI) 9 times, the 'failure' case logs it 3 times. I'm done trying to figure out what the hell's going on here for tonight...

Comment 16 Adam Williamson 2018-08-24 06:53:20 UTC
I'm starting to wonder if this exception is a red herring, and what's really going on is the login process just gets...aborted, or cancelled, somehow, partway through, and *because* that happened, the web UI shows us some exception that was raised...but actually gets raised on the 'success' path too, we just don't get a message about it because the login succeeds. But really not sure. It really seems like the login process just kinda...stops, partway through. The dirsrv logs, httpd logs and even the system logs (there are way more occurrences of 'GSSAPI client step 1' in the journal around the time of the web UI login in the 'success' logs than the 'failure' logs) all look like that. But I can't find any smoking gun explaining *why*.

Comment 17 mreynolds 2018-08-24 13:18:45 UTC
We know there is an issue with one of the fixes in 1.3.8.5 around complex filters and access control.  So I backed out the filter fix and did a copr build:

https://copr-be.cloud.fedoraproject.org/results/@389ds/389-ds-base-nightly/fedora-27-x86_64/00790627-389-ds-base/

If it still fails there is one more candidate fix that might be related that I can back out as well, but let's see how this build reacts.

Comment 18 mreynolds 2018-08-24 13:43:35 UTC
Looking at the logs again I see that IPA is using GER (getEffectRights), and that behavior is different in the failure logs.  We did have a complex fix around GER that went into 1.3.8.5.  So I did a separate copr build with just the GER fix removed.  I suspect this might actually be the bug, and not the filter fix from the first copr build I did.

So please try this copr build first (GER fix removed):

https://copr-be.cloud.fedoraproject.org/results/@389ds/389-ds-base-nightly/fedora-27-x86_64/00790664-389-ds-base/

Then if there is still a problem try this one(from comment 17):

https://copr-be.cloud.fedoraproject.org/results/@389ds/389-ds-base-nightly/fedora-27-x86_64/00790627-389-ds-base/

Thanks!

Comment 19 Adam Williamson 2018-08-24 16:04:07 UTC
Will do! Backing the 1.3.8.5 commits out one at a time was gonna be my next play, but thanks for getting to it ahead of me and knowing which ones to try first :P

Comment 20 Adam Williamson 2018-08-24 20:00:00 UTC
OK. I did horrible things to openQA to make it test those builds, and it looks like the test *fails* with the first one (790664):

https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=27&build=RHBZ1620315-790664-NOREPORT&groupid=2

but *passes* with the second one (790627):

https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=27&build=RHBZ1620315-790267-NOREPORT&groupid=2

note I mistyped the ID as '790267' there, but fortunately I typo'ed it consistently but used the correct URLs, so the test did in fact download and use the build, as you can see:

https://openqa.stg.fedoraproject.org/tests/344487#step/_advisory_update/23

so, it looks like the change from #c17 fixes things.

Comment 21 thierry bordaz 2018-08-27 14:32:07 UTC
Here is an example of the failure/success taken from the attached logs. 

    FAILURE (role_deploy_domain_controller_check-var_log.tar.gz)
    [23/Aug/2018:15:46:32.800472178 -0400] conn=88 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
    [23/Aug/2018:15:46:32.807363447 -0400] conn=88 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0007212877 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
    ...
    [23/Aug/2018:15:46:32.825391581 -0400] conn=88 op=4 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
    [23/Aug/2018:15:46:32.825653982 -0400] conn=88 op=4 RESULT err=0 tag=101 nentries=0 etime=0.0000346845
    [23/Aug/2018:15:46:32.826549921 -0400] conn=88 op=5 EXT oid="1.3.6.1.4.1.4203.1.11.3" name="whoami-plugin"                          
    [23/Aug/2018:15:46:32.826663836 -0400] conn=88 op=5 RESULT err=0 tag=120 nentries=0 etime=0.0000292653
     
     
     
    [Thu Aug 23 15:46:32.816961 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: config_show(rights=False, all=False, raw=False, version='2.229')
    [Thu Aug 23 15:46:32.825950 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: INFO: test3: batch: config_show(): EmptyResult            
    [Thu Aug 23 15:46:32.826114 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: raw: whoami(version='2.229')
    [Thu Aug 23 15:46:32.826234 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: whoami(version='2.229')
     
     
    -----------------------------------------------------------------------------------------
     
    SUCCESS (role_deploy_domain_controller_check-var_log.tar.gz)
    [23/Aug/2018:16:37:45.128176099 -0400] conn=88 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
    [23/Aug/2018:16:37:45.133772003 -0400] conn=88 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0005853464 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
    ...
    [23/Aug/2018:16:37:45.151352349 -0400] conn=88 op=4 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"                                                                                                  
    [23/Aug/2018:16:37:45.151769181 -0400] conn=88 op=4 RESULT err=0 tag=101 nentries=1 etime=0.0000526557
    ..
    [23/Aug/2018:16:37:45.155848326 -0400] conn=88 op=7 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
    [23/Aug/2018:16:37:45.156191319 -0400] conn=88 op=7 RESULT err=0 tag=101 nentries=1 etime=0.0000444405
    ...
    [23/Aug/2018:16:37:45.159116377 -0400] conn=88 op=10 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"                                                                                                  
    [23/Aug/2018:16:37:45.159579732 -0400] conn=88 op=10 RESULT err=0 tag=101 nentries=1 etime=0.0000615801
    [23/Aug/2018:16:37:45.161628540 -0400] conn=88 op=11 EXT oid="1.3.6.1.4.1.4203.1.11.3" name="whoami-plugin"
    [23/Aug/2018:16:37:45.161700145 -0400] conn=88 op=11 RESULT err=0 tag=120 nentries=0 etime=0.0000154022
     
     
     
    [Thu Aug 23 16:37:45.143757 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: config_show(rights=False, all=False, raw=False, version='2.229')
    [Thu Aug 23 16:37:45.161018 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: INFO: test3: batch: config_show(): SUCCESS
    [Thu Aug 23 16:37:45.161220 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: raw: whoami(version='2.229')
    [Thu Aug 23 16:37:45.161358 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: whoami(version='2.229')


So the failure is that the following search returns no entry

ldapsearch -D "uid=test3,cn=users,cn=accounts,dc=domain,dc=local" -W -b "cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" -s one "(objectClass=ipaConfigObject)" cn

Comment 23 Adam Williamson 2018-08-27 17:20:12 UTC
Seems similar, but there's also some discussion in https://pagure.io/389-ds-base/issue/49443 specifically about how things are different between 'master' (at that time) and the 1.3.7 branch...

Comment 24 Ben Cotton 2018-11-27 14:44:55 UTC
This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 25 mreynolds 2018-11-27 15:05:29 UTC
This is fixed in the current release after backing out a "search filter enhancement", closing...


Note You need to log in before you can comment on or make changes to this bug.