Bug 2214399
| Summary: | regression: autofs fails to renew kerberos ticket | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Ian Collier <imc> |
| Component: | cyrus-sasl | Assignee: | Jakub Jelen <jjelen> |
| Status: | ASSIGNED --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 38 | CC: | anon.amish, crypto-team, ikent, jjelen, ssorce |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Ian Collier
2023-06-12 20:37:45 UTC
This is a bit of a puzzle because, as far as I can see, there haven't been any changes to the ldap code between f36 and f38. To start with I could add the latest upstream ldap changes and see if that makes any difference ... Can you provide /etc/autofs_ldap_auth.conf and /etc/autofs.conf please. (In reply to Ian Kent from comment #2) > Can you provide /etc/autofs_ldap_auth.conf and /etc/autofs.conf please. Could you also provide a log of a successful authentication. Both krb5-libs and cyrus-sasl-gssapi have increased version numbers in F38 and I'm guessing the problem actually lies in one of those libraries. I'm going to poke around a bit and see if I can spot anything. A successful authentication when automount starts up starts like this, and basically all that's different is that it carries on where the above log just stops with an error. do_bind: lookup(ldap): auth_required: 2, sasl_mech GSSAPI sasl_do_kinit: initializing kerberos ticket: client principal host/xyz@realm sasl_do_kinit: calling krb5_parse_name on client principal host/xyz@realm sasl_do_kinit: Using tgs name krbtgt/realm@realm sasl_do_kinit: Kerberos authentication was successful! sasl_bind_mech: Attempting sasl bind with mechanism GSSAPI sasl_log_func: GSSAPI client step 1 getuser_func: called with context (nil), id 16385. sasl_log_func: GSSAPI client step 1 getuser_func: called with context (nil), id 16385. sasl_log_func: GSSAPI client step 2 sasl_bind_mech: sasl bind with mechanism GSSAPI succeeded do_bind: lookup(ldap): autofs_sasl_bind returned 0 I've unpacked the F36 package of krb5-libs-1.19.2 into a directory and pointed LD_LIBRARY_PATH at that directory, and now autofs works. So now the question is: bug in krb5-libs, or is it a deliberate API change? I don't understand what's going on under the covers well enough to know, but I do note there are a couple of ccache changes noted in the release notes. For example: https://github.com/krb5/krb5/commit/3f5a348287646d65700854650fe668b9c4249013 It might be worth asking the Kerberos maintainer, you can always send it back to me. Any idea what I might be doing wrong in autofs here Jakub? There were changes since Fedora 36, but I was not following the much. Most of the work was done by Simo including the rebase and some CVE fixes. Do you have idea what could have gone wrong? @imc.ac.uk did you manually kinit at some point? Auto-initiation generally requires the first ticket to be obtained through auto-initiation itself, if it is manually fetched, auto-initiation may fail later. Have you tried a full kdestroy of the ccache used and then letting autofs retry on its own? > did you manually kinit at some point?
Yes and no. I am required to have a valid TGT in order to be logged in to the machine and have access to NFS shares, but the credential cache for this is not shared with autofs.
I think this probably needs someone to read the sasl_do_kinit function from autofs's modules/cyrus-sasl.c source file and see if it's doing the right thing. The short version is that it is using a memory cache ("MEMORY:_autofstkt"); it gets credentials for principal host/myhost@MYREALM using the system's /etc/krb5.keytab, stores them to the cache and sets the environment variable KRB5CCNAME to this cache.
In Fedora 38 this works the first time and then something is failing for the next do_kinit after the first ticket expired. So possibly the question is: what happens when krb5_cc_store_cred is called to store credentials in a memory cache that already contains an (expired) ticket - and has this changed?
For debugging purposes I edited the cache name in autofs to "FILE:/tmp/_autofstkt" and this actually works. Now, since storing the creds in a file allows me to inspect the file to see what tickets are in there, I can see that what autofs is doing is appending the new ticket to the existing tickets in the file - I can still see the previously expired tickets in there.
So as far as autofs is concerned, maybe the correct behaviour would be to destroy the memory cache before putting the renewed ticket into it. If the previous expired tickets are being kept then that might constitute a memory leak. But if autofs is not doing anything *wrong* per se then maybe there's a krb5 bug with regard to how it treats a memory cache that contains both an expired and a valid ticket. (However, this is speculation on my part because I don't know what's actually in the memory cache after autofs renews its ticket.)
A memory ccache has 2 issues: 1. it is only available within the process, if any computation is deferred to another process it will be unavailable (or stale if fork happens before renew). 2. it is a cache collection, so if another library linked in the same process somehow uses it, the process may find that a cache for the target already exist and decide to use it. Not sure what krb5_cc_store_cred(0 will do, I would assume it overwrites. OTOH maybe sasl should stop doing kinit on its own, given GSSAPI is perfectly capable of doing initiation if it is provided with a client keytab. Could you please set the environment variable KRB5_TRACE to a file in the autofs process and then capture its output when the operation fails? This may give us a clue about what operation fails hopefully. May be useful for comparison to have a trace taken when it suceeds with a file cache. In another test I've edited the cache name to DIR:/tmp/_autotest and now autofs fails whenever it has to renew the ticket. DIR is of course a collection, but there is only ever one cache in the collection; this cache contains multiple tickets. The ticket lifetime on this system is set to 20 minutes for quicker testing.
So if I klist the cache I can see this sort of thing:
Valid starting Expires Service principal
06/14/23 12:18:04 06/14/23 12:38:04 krbtgt/MYREALM@MYREALM
06/14/23 12:18:04 06/14/23 12:38:04 ldap/ipaserver1@
Ticket server: ldap/ipaserver1@MYREALM
06/14/23 12:18:05 06/14/23 12:38:05 krbtgt/MYREALM@MYREALM
06/14/23 12:18:05 06/14/23 12:38:04 ldap/ipaserver2@
Ticket server: ldap/ipaserver2@MYREALM
06/14/23 12:35:42 06/14/23 12:55:42 krbtgt/MYREALM@MYREALM
06/14/23 12:39:50 06/14/23 12:59:50 krbtgt/MYREALM@MYREALM
06/14/23 12:43:29 06/14/23 13:03:29 krbtgt/MYREALM@MYREALM
06/14/23 13:33:43 06/14/23 13:53:43 krbtgt/MYREALM@MYREALM
06/14/23 14:31:21 06/14/23 14:51:21 krbtgt/MYREALM@MYREALM
And if I use this cache to do an ldapsearch at 14:40 while the last ticket is valid, it fails:
$ ldapsearch -Y GSSAPI
SASL/GSSAPI authentication started
ldap_sasl_interactive_bind: Local error (-2)
additional info: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired)
This same copy of autofs running against krb5 version 1.19, which doesn't output the "Ticket expired" error, still records multiple tickets in the cache. This points to a difference with respect to retrieving the credentials from a cache collection that contains both valid and expired tickets. However, if I run ldapsearch with that cache then it still fails (ldapsearch is not linked against libkrb5.so.3).
Created attachment 1970873 [details]
KRB5_TRACE log of a failed lookup [mildly redacted]
Created attachment 1970874 [details]
KRB5_TRACE log of a successful lookup [mildly redacted]
This log is taken in the same circumstances as the failed log, except that LD_LIBRARY_PATH points to the libraries from krb5-libs-1.19.
To my mind the only significant difference is that this has extra lines at the end, beginning "Starting with TGT for client realm". Up to that point both logs show a successful request for a TGT, then storing this in the ccache, then later retrieving a TGT from the ccache. The failure log then stops abruptly while the successful log continues to request the service ticket.
My conclusion then is that the TGT that was successfully retrieved from the ccache in the failure case was an expired ticket.
So my proposed fix to autofs, regardless of whether krb5 is doing the right thing here, would be where it says:
if (krb5cc_in_use++ == 0)
/* tell the cache what the default principal is */
ret = krb5_cc_initialize(ctxt->krb5ctxt,
ctxt->krb5_ccache, krb5_client_princ);
to call krb5_cc_initialize every time and not just the first time.
(In reply to Simo Sorce from comment #9) > @imc.ac.uk did you manually kinit at some point? It's been quite a while since I've worked on this. But it's unavoidable that this must work (and has worked) without the admin doing a kinit. There are cases where, if an external credential cache is used, it needs to be setup with a key for the principal but that's not what's being used here. It's automount(8) that must setup what is needed itself. I did write some automated tests for this functionality some time ago and I can go through those to get details of what I did if that will be useful. In any case this is renewing an expired key that's previously authenticated ok. Be aware that, while I have tweaked the autofs Kerberos code from time to time I didn't write it initially myself, so I'm not intimately familiar with the way each of these functions work. @ikent would you be willing to patch the sasl_do_kinit function in autofs/modules/cyrus-sasl.c so that it calls krb5_cc_initialize every time it does a kinit and not just the first time? This does fix the problem. Rationale: 1. The documentation of this function admits the possibility of calling it on an existing cache and says that it destroys the contents. 2. If we are in sasl_do_kinit then it means our existing tickets are no longer valid, so there is no value in keeping them. Thus, clearing the current contents of the cache before storing the new ticket is a desirable outcome. 3. It would be consistent with the behaviour of the 'kinit' CLI program which does seem to destroy all existing tickets in the currently selected cache when used to create or renew a ticket. The alternatives are (a) refactoring this whole code so that it uses exclusively gss+sasl calls instead of low-level krb5 calls (which I don't think is necessary at all, though it might be theoretically cleaner), or (b) asking the krb5 developers whether the ccache is behaving correctly in this case. However, the change from krb5-1.19 to krb5-1.20 actually brings memory ccaches into line with other types, in the following way: I wrote a test program that mimics what autofs does in starting a sasl client using quickly-expiring tickets. It does this three times in such a way that the original ticket will have expired by the time the third connection is requested. Whether this worked can be summarised by the following table. cc-type F36 F38 MEMORY y n DIR n n FILE n n KEYRING n n Thus, the fact that it worked with memory ccaches in Fedora 36 was somewhat of an outlier. In all cases (except that the memory ccache isn't visible from outside the process and so can't be tested) the ccache stored all previous and current tickets. In the case of the keyring ccache, the kernel apparently deleted the ccache as soon as the first ticket expired, and so the error message for the third access was "No credentials cache found"; for all other failures the error message was "Ticket expired". (In reply to Ian Collier from comment #17) > @ikent would you be willing to patch the sasl_do_kinit function > in autofs/modules/cyrus-sasl.c so that it calls krb5_cc_initialize every > time it does a kinit and not just the first time? This does fix the problem. Yes, but I'm not sure how multiple concurrent clients will deal with the approach. > > Rationale: > 1. The documentation of this function admits the possibility of calling it > on an existing cache and says that it destroys the contents. > 2. If we are in sasl_do_kinit then it means our existing tickets are no > longer valid, so there is no value in keeping them. Thus, clearing the > current contents of the cache before storing the new ticket is a desirable > outcome. > 3. It would be consistent with the behaviour of the 'kinit' CLI program > which does seem to destroy all existing tickets in the currently selected > cache when used to create or renew a ticket. A single threaded program, not sure if that really does compare with the autofs use case. > > The alternatives are (a) refactoring this whole code so that it uses > exclusively gss+sasl calls instead of low-level krb5 calls (which I don't > think is necessary at all, though it might be theoretically cleaner), or (b) > asking the krb5 developers whether the ccache is behaving correctly in this > case. However, the change from krb5-1.19 to krb5-1.20 actually brings > memory ccaches into line with other types, in the following way: or (c) ask the krb5 developers how multiple threads each destroying the ccache could affect observed behaviour. Perhaps all we need to do is come up with a strategy to not destroy the ccache if it has a valid key in it ... The other thing is now that I think about it I'm not sure the code is properly handling multiple threads anyway ... I'm not entirely clear on how that is really meant to work so maybe there's something to be learned as well. > > I wrote a test program that mimics what autofs does in starting a sasl > client using quickly-expiring tickets. It does this three times in such a > way that the original ticket will have expired by the time the third > connection is requested. Whether this worked can be summarised by the > following table. With multiple threads concurrently doing this? At the very least the cache will be deleted more times than it needs to be and could it be possible for it to be deleted while another thread is using it? I don't think this could happen but maybe my understanding is wrong! > > cc-type F36 F38 > MEMORY y n > DIR n n > FILE n n > KEYRING n n > > Thus, the fact that it worked with memory ccaches in Fedora 36 was somewhat > of an outlier. In all cases (except that the memory ccache isn't visible > from outside the process and so can't be tested) the ccache stored all > previous and current tickets. In the case of the keyring ccache, the kernel > apparently deleted the ccache as soon as the first ticket expired, and so > the error message for the third access was "No credentials cache found"; for > all other failures the error message was "Ticket expired". This might be useful info. since we do need to make it work which means improved (different) error handling too. Ian Yes you are correct that I did not test in a highly multi-threaded environment. I can see that there might be a race condition where one thread might clear the cache while another is trying to look at it. One thing we can do (as you noted above) is try to pull a cred out of the cache, and if it's valid, don't destroy the cache. Another thing we could do is instead of destroying the cache, enumerate it and remove any creds that were found to be expired. I'm experimenting with that in the test program. Created attachment 1973430 [details]
proof-of-concept patch which clears the cache if an expired ticket can be found in it
Currently running with the attached patch which has fixed the problem (but
on a lightly-loaded workstation).
The following changes are not strictly relevant:
- Moved cache initialization to before acquiring the ticket (that might not
have been necessary, to be fair - I was going to reuse the my_creds struct
and then ended up not doing)
- Bracketed and indented the error handling for the initialization
- If cache failed to initialize, reverse the increment to krb5cc_in_use
which seems like it would be the right thing to do
Then this was added:
- If the cache is already in use, try to pull out a cred and then if that
was successful and the cred is expired, clear the cache.
So this fixes the behaviour I was seeing, since that was happening because
expired creds were being pulled out of the cache and in that case the patched
version clears the cache to remove the expired creds.
What sort of race conditions might happen here?
- If the function is called very late during the validity of a ticket, it
might expire after the decision not to clear the cache. In that case,
the behaviour is the same as the unpatched version, but this is highly
unlikely because do_kinit is not supposed to happen while there is a
valid ticket.
- If two or more threads decide to call do_kinit at about the same time:
it's protected by a mutex, so one of the calls will happen first; this
call will clear the cache and add a new ticket. When the others kick
in, the cache won't be cleared because it's only cleared if we can
find an expired ticket in the cache and any such ticket was removed
when the first do_kinit happened.
- If one thread does do_kinit while another thread is trying to do a lookup:
if the current ticket is expired then the lookup would have failed anyway;
if it's not expired then we won't clear the cache.
- If there is both an expired and a valid ticket in the cache:
this only happens if two or more do_kinits clashed and stored tickets
with different expiration times, and if the current time is between those
times. The current bug happens because krb5 cache retrieval is returning
the earliest (i.e. expired) ticket. When that's the case then do_kinit
will clear the cache because when it tests the cache it will pull the
expired cred - and it needs to do this because otherwise all lookups are
failing (that's the bug). In a case where krb5 cache retrieval returns
the valid ticket, it doesn't matter that the cache is not cleared because
any subsequent lookups will use that valid ticket.
Did I miss any?
(In reply to Ian Collier from comment #20) > > - If one thread does do_kinit while another thread is trying to do a lookup: > if the current ticket is expired then the lookup would have failed anyway; > if it's not expired then we won't clear the cache. This is the case I was wondering about. I don't know what happens in this case, if the initialisation could cause a problem it will likely result in a failed lookup. Anyway I'll turn this into a patch and post it for your review. |