In automated (openQA) testing, FreeIPA client tests often seem to fail with 389-ds-base 188.8.131.52 and 184.108.40.206 (in Fedora 27 update testing). There are failures in Rawhide/F29 testing (with 220.127.116.11) which may *possibly* be the same, but I'm not 100% sure of that yet.
For F27, the test reliably affected is a complex one which does a lot of stuff. It starts by enrolling as a client via cockpit (which just runs realmd under the covers), then tests logging into the web UI (as admin) and doing some stuff in it, including creating a user called 'test3'. It creates an HBAC rule allowing test3 to login anywhere, then quits to console and kinits as test3 and sets a permanent password.
Next it tests it can log in as test3 at the console (which seems to always work). Then it tests logging into the web UI as test3 and changing its password. This test seems to always fail at this point, when it tries to log in to the web UI as test3, it fails. The web UI shows a dialog titled "Operations Error" which says "Some operations failed.", with a 'Show details' link. If I have the test click the 'Show details' link, it says "no matching entry found".
These results seem highly reproducible in testing of the 18.104.22.168 and 22.214.171.124 updates on F27 (testing on F29 and Rawhide is more of a mess ATM as other bugs frequently get in the way). Here's a recent run of the tests on the 126.96.36.199 F27 update:
You can see the failed 'realmd_join_cockpit' test: https://openqa.stg.fedoraproject.org/tests/343612
Note the red-outlined thumbnails - if you expand these you see that it failed on logging into the web UI during the password change test (as described above), and later (the password change test is marked as 'non-fatal' so the run continues after it fails). It also fails later when attempting to log in to the console as 'test1', which should also work.
We can see other recent results of the same test here:
there are only failures from the 389-ds-base 188.8.131.52 and 184.108.40.206 updates, plus - on staging - a failure caused by a bug in a fedora-repos update (bad GPG key) and a transient failure which went away on restart (the fail for FEDORA-2018-e952086dda , note there's a pass for the same update right after it). This shows that the *same test* run against any other non-broken F27 update passes, indicating very strongly that the bug really is caused by the 389-ds-base update somehow. Also note that the test has been at least 4 times on each openQA instance (because we automatically re-run any failed test in case it was a blip, and there have been two updates, FEDORA-2018-a45016e03f (220.127.116.11) and FEDORA-2018-3c876babb9 (18.104.22.168) - there've been more than 4 tries on staging as I've re-run the test manually a few times.
I will attach logs from client and server, but there's nothing immediately outstanding in any of them, at least that I can see. Note the way ipa-dnskeysyncd crashes in a loop doesn't seem to be the cause of this bug - that also happens on all the other updates where the test *passes*.
The web UI failure message is somewhat interesting. FreeIPA 4.6.3 - the version in Fedora 27 at present - only raises that error in two places, ipalib/install/certstore.py `get_ca_certs` and ipapython/ipaldap.py `find_entries`. That's probably a useful avenue for further investigation (figure out which of those is likely getting used during web UI user login, and why a 389-ds-base update might cause that to break). I could probably re-run the test with a hack to FreeIPA to make it use a slightly different log message in those two places, so we know which one we're hitting...
Created attachment 1478020 [details]
server logs (tarball of /var/log from the server)
Created attachment 1478021 [details]
client logs (tarball of /var/log from the client)
dirsrv logs seem to be missing from the tarball, without them it's hard to say why the operation failed.
In 22.214.171.124 there was a filter optimization fix, 126.96.36.199 only backs out fix for https://pagure.io/389-ds-base/issue/49789, so I wonder if 188.8.131.52 failed too or it only started with 184.108.40.206.
The log tarballs contain the whole of /var/log ; if there's nothing in /var/log/dirsrv in the tarball, there wasn't anything in it on the host. The logs in question may be in the journal; did you check the journal? (You can read it with journalctl --file , the files are under var/log/journal ).
The deal with 220.127.116.11 and 18.104.22.168 is that the thing backed out in 22.214.171.124 actually broke FreeIPA server deployment. So I can't really *prove* that this bug existed in 126.96.36.199 because the tests can't get that far, as the server deployment fails. However, since all 188.8.131.52 did was revert that problematic 'fix', logically speaking it seems almost certain that one of the *other* things in 184.108.40.206 (the ones that weren't reverted in 220.127.116.11) is what causes this bug.
Very odd there is no logging for DS. Do you know if the logging was disabled intentionally?
In the journal I do see the server's error log content but not the access log, and it's the access log that we need. Is this a system I can look at right now?
No, the systems are spun up on demand and destroyed right after the test. But I can tweak things and re-run them. It should also be possible to recreate the test manually, though I have a bit of trouble doing so here as https://bugzilla.redhat.com/show_bug.cgi?id=1484130 *still* does not appear to be fixed, so I cannot use disk images on a shared drive, and I don't really have enough space on my local system for disks for the two VMs I usually use to recreate this kind of issue :/
Where would ds logging be configured? If you tell me where to look, I can get the test to tell us what's there, and change it if necessary.
I'm on #freeipa on Freenode if you want to discuss this there...
Oh, and the tests don't do anything to reduce logging detail (they do a few things to try and get *more* logs). If there aren't enough tests coming out of the logs, that's the way things are set up by default, apparently.
Created attachment 1478337 [details]
updated server logs (with dirsrv)
Turns out the server logs being captured after FreeIPA decommissioning meant dirsrv logs were removed; here's an updated tarball with the logs captured before decommissioning, so dirsrv logs are present. The failed login as test3 happens at 19:46:32.0522 UTC (15:46 in the dirsrv logs) and the failed console login as test1 at 19:48:44.
Created attachment 1478355 [details]
server logs from a passed run (same test against a different update)
Here's the same server log tarball from a case where the tests *passed* - this is the exact same test run against a different update, so 389-ds 18.104.22.168 is used. Comparing logs from passed and failed cases may help, I guess. This is with the tweak to ensure dirsrv logs are included. In these logs, the failed attempt to log in to the web UI occurs at 2018-08-23T20:37:44.0902 UTC .
Err...I obviously meant that the *successful* attempt to log in to the web UI occurred at 2018-08-23T20:37:44.0902 UTC :)
A quick look through the logs is not too revealing. I see several successful binds as test3:
[23/Aug/2018:16:38:19.685947465 -0400] conn=101 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
[23/Aug/2018:16:38:19.690602686 -0400] conn=101 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0004827704 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
I don't see any errors. In the successful log I see "more" successful binds than in the failing log. I also see a direct (non-sasl) bind of the user in the successful log which I do not see in the failure log:
[23/Aug/2018:16:38:19.291538409 -0400] conn=96 fd=122 slot=122 connection from local to /var/run/slapd-DOMAIN-LOCAL.socket
[23/Aug/2018:16:38:19.291973196 -0400] conn=96 op=0 BIND dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local" method=128 version=3
[23/Aug/2018:16:38:19.292841696 -0400] conn=96 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0001134752 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
[23/Aug/2018:16:38:19.293035197 -0400] conn=96 op=1 UNBIND
[23/Aug/2018:16:38:19.293068018 -0400] conn=96 op=1 fd=122 closed - U1
Perhaps there is an IPA log that might have more information about why its failing?
I will also do a more fine grained log analysis tomorrow.
Yeah, that's more or less what I saw. Clearly more 'stuff' happens in the successful login case, but there's no clear reason why the failed one just apparently gives up - the transactions right before it gives up look just the same as in the successful case.
I'm currently fiddling with making the test hack up the relevant bits of FreeIPA to log harder; will report any findings from that.
Welp, so far, I know for sure that the error is happening here:
I hacked up the routines there with some debugging statements, you can see the adjusted copy of the file with the added logging lines at https://www.happyassassin.net/temp/ipaldap.py - search for 'XXX'. Here are logs from passed and failed attempts with those changes:
The attempts to log in to the web UI happened at 05:00:02 (pass) and 04:59:10 (fail). The log messages are in var/log/httpd/error.log (yeah, all IPA web UI log messages appear to go to error.log, even 'info' and 'debug' messages).
Welp, that's fun - the 'success' case logs 'XXX raising EmptyResult error' (which is the log message I added right before the exception that seems to produce the "no matching entry found" error we see in the web UI) 9 times, the 'failure' case logs it 3 times. I'm done trying to figure out what the hell's going on here for tonight...
I'm starting to wonder if this exception is a red herring, and what's really going on is the login process just gets...aborted, or cancelled, somehow, partway through, and *because* that happened, the web UI shows us some exception that was raised...but actually gets raised on the 'success' path too, we just don't get a message about it because the login succeeds. But really not sure. It really seems like the login process just kinda...stops, partway through. The dirsrv logs, httpd logs and even the system logs (there are way more occurrences of 'GSSAPI client step 1' in the journal around the time of the web UI login in the 'success' logs than the 'failure' logs) all look like that. But I can't find any smoking gun explaining *why*.
We know there is an issue with one of the fixes in 22.214.171.124 around complex filters and access control. So I backed out the filter fix and did a copr build:
If it still fails there is one more candidate fix that might be related that I can back out as well, but let's see how this build reacts.
Looking at the logs again I see that IPA is using GER (getEffectRights), and that behavior is different in the failure logs. We did have a complex fix around GER that went into 126.96.36.199. So I did a separate copr build with just the GER fix removed. I suspect this might actually be the bug, and not the filter fix from the first copr build I did.
So please try this copr build first (GER fix removed):
Then if there is still a problem try this one(from comment 17):
Will do! Backing the 188.8.131.52 commits out one at a time was gonna be my next play, but thanks for getting to it ahead of me and knowing which ones to try first :P
OK. I did horrible things to openQA to make it test those builds, and it looks like the test *fails* with the first one (790664):
but *passes* with the second one (790627):
note I mistyped the ID as '790267' there, but fortunately I typo'ed it consistently but used the correct URLs, so the test did in fact download and use the build, as you can see:
so, it looks like the change from #c17 fixes things.
Here is an example of the failure/success taken from the attached logs.
[23/Aug/2018:15:46:32.800472178 -0400] conn=88 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
[23/Aug/2018:15:46:32.807363447 -0400] conn=88 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0007212877 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
[23/Aug/2018:15:46:32.825391581 -0400] conn=88 op=4 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
[23/Aug/2018:15:46:32.825653982 -0400] conn=88 op=4 RESULT err=0 tag=101 nentries=0 etime=0.0000346845
[23/Aug/2018:15:46:32.826549921 -0400] conn=88 op=5 EXT oid="184.108.40.206.4.1.4220.127.116.11" name="whoami-plugin"
[23/Aug/2018:15:46:32.826663836 -0400] conn=88 op=5 RESULT err=0 tag=120 nentries=0 etime=0.0000292653
[Thu Aug 23 15:46:32.816961 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: config_show(rights=False, all=False, raw=False, version='2.229')
[Thu Aug 23 15:46:32.825950 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: INFO: test3: batch: config_show(): EmptyResult
[Thu Aug 23 15:46:32.826114 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: raw: whoami(version='2.229')
[Thu Aug 23 15:46:32.826234 2018] [wsgi:error] [pid 7726:tid 139681378449152] [remote 10.0.2.102:37654] ipa: DEBUG: whoami(version='2.229')
[23/Aug/2018:16:37:45.128176099 -0400] conn=88 op=0 BIND dn="" method=sasl version=3 mech=GSS-SPNEGO
[23/Aug/2018:16:37:45.133772003 -0400] conn=88 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0005853464 dn="uid=test3,cn=users,cn=accounts,dc=domain,dc=local"
[23/Aug/2018:16:37:45.151352349 -0400] conn=88 op=4 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
[23/Aug/2018:16:37:45.151769181 -0400] conn=88 op=4 RESULT err=0 tag=101 nentries=1 etime=0.0000526557
[23/Aug/2018:16:37:45.155848326 -0400] conn=88 op=7 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
[23/Aug/2018:16:37:45.156191319 -0400] conn=88 op=7 RESULT err=0 tag=101 nentries=1 etime=0.0000444405
[23/Aug/2018:16:37:45.159116377 -0400] conn=88 op=10 SRCH base="cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" scope=1 filter="(objectClass=ipaConfigObject)" attrs="cn"
[23/Aug/2018:16:37:45.159579732 -0400] conn=88 op=10 RESULT err=0 tag=101 nentries=1 etime=0.0000615801
[23/Aug/2018:16:37:45.161628540 -0400] conn=88 op=11 EXT oid="18.104.22.168.4.1.422.214.171.124" name="whoami-plugin"
[23/Aug/2018:16:37:45.161700145 -0400] conn=88 op=11 RESULT err=0 tag=120 nentries=0 etime=0.0000154022
[Thu Aug 23 16:37:45.143757 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: config_show(rights=False, all=False, raw=False, version='2.229')
[Thu Aug 23 16:37:45.161018 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: INFO: test3: batch: config_show(): SUCCESS
[Thu Aug 23 16:37:45.161220 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: raw: whoami(version='2.229')
[Thu Aug 23 16:37:45.161358 2018] [wsgi:error] [pid 7728:tid 140691395168000] [remote 10.0.2.102:51334] ipa: DEBUG: whoami(version='2.229')
So the failure is that the following search returns no entry
ldapsearch -D "uid=test3,cn=users,cn=accounts,dc=domain,dc=local" -W -b "cn=masters,cn=ipa,cn=etc,dc=domain,dc=local" -s one "(objectClass=ipaConfigObject)" cn
https://bugzilla.redhat.com/show_bug.cgi?id=1620315#c21 seems to match https://pagure.io/389-ds-base/issue/49443 - https://bugzilla.redhat.com/show_bug.cgi?id=1514051
Seems similar, but there's also some discussion in https://pagure.io/389-ds-base/issue/49443 specifically about how things are different between 'master' (at that time) and the 1.3.7 branch...
This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30 Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '27'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 27 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
This is fixed in the current release after backing out a "search filter enhancement", closing...