Bug 2214399

Summary:

regression: autofs fails to renew kerberos ticket

Product:

[Fedora] Fedora

Reporter:

Ian Collier <imc>

Component:

autofs

Assignee:

Ian Kent <ikent>

Status:

CLOSED EOL

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

anon.amish, crypto-team, ikent, jjelen, ssorce

Target Milestone:

---

Keywords:

Regression, Reopened

Target Release:

---

Flags:

fedora-admin-xmlrpc: mirror+

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-11-27 21:13:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
KRB5_TRACE log of a failed lookup [mildly redacted]	none
KRB5_TRACE log of a successful lookup [mildly redacted]	none
proof-of-concept patch which clears the cache if an expired ticket can be found in it	none

Description Ian Collier 2023-06-12 20:37:45 UTC

Just upgraded a working Fedora 36 system to Fedora 38, and now autofs stops working every 8 hours (which coincidentally is the default ticket lifetime configured in /etc/krb5.conf).
The remedy so far has been to restart the service.

This happens:
do_bind: lookup(ldap): auth_required: 2, sasl_mech GSSAPI
sasl_do_kinit: initializing kerberos ticket: client principal host/xyz@realm
sasl_do_kinit: calling krb5_parse_name on client principal host/xyz@realm
sasl_do_kinit: Using tgs name krbtgt/realm@realm
sasl_do_kinit: Kerberos authentication was successful!
sasl_bind_mech: Attempting sasl bind with mechanism GSSAPI
sasl_log_func: GSSAPI client step 1
getuser_func: called with context (nil), id 16385.
sasl_log_func:128: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)
sasl_client_start failed for ipa-server
sasl_bind_mech: sasl_client_start: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)
do_bind: lookup(ldap): autofs_sasl_bind returned -1
lookup(ldap): couldn't connect to server ldap://ipa-server
do_reconnect: lookup(ldap): failed to find available server
lookup(ldap): lookup for imc failed: connection failed
key "imc" not found in map source(s).


Reproducible: Always

Comment 1 Ian Kent 2023-06-13 02:50:02 UTC

This is a bit of a puzzle because, as far as I can see, there haven't
been any changes to the ldap code between f36 and f38.

To start with I could add the latest upstream ldap changes and see
if that makes any difference ...

Comment 2 Ian Kent 2023-06-13 03:44:23 UTC

Can you provide /etc/autofs_ldap_auth.conf and /etc/autofs.conf please.

Comment 3 Ian Kent 2023-06-13 03:50:18 UTC

(In reply to Ian Kent from comment #2)
> Can you provide /etc/autofs_ldap_auth.conf and /etc/autofs.conf please.

Could you also provide a log of a successful authentication.

Comment 4 Ian Collier 2023-06-13 15:22:31 UTC

Both krb5-libs and cyrus-sasl-gssapi have increased version numbers in F38 and I'm guessing the problem actually lies in one of those libraries.  I'm going to poke around a bit and see if I can spot anything.

A successful authentication when automount starts up starts like this, and basically all that's different is that it carries on where the above log just stops with an error.

do_bind: lookup(ldap): auth_required: 2, sasl_mech GSSAPI
sasl_do_kinit: initializing kerberos ticket: client principal host/xyz@realm
sasl_do_kinit: calling krb5_parse_name on client principal host/xyz@realm
sasl_do_kinit: Using tgs name krbtgt/realm@realm
sasl_do_kinit: Kerberos authentication was successful!
sasl_bind_mech: Attempting sasl bind with mechanism GSSAPI
sasl_log_func: GSSAPI client step 1
getuser_func: called with context (nil), id 16385.
sasl_log_func: GSSAPI client step 1
getuser_func: called with context (nil), id 16385.
sasl_log_func: GSSAPI client step 2
sasl_bind_mech: sasl bind with mechanism GSSAPI succeeded
do_bind: lookup(ldap): autofs_sasl_bind returned 0

Comment 5 Ian Collier 2023-06-13 19:10:25 UTC

I've unpacked the F36 package of krb5-libs-1.19.2 into a directory and pointed LD_LIBRARY_PATH at that directory, and now autofs works.

So now the question is: bug in krb5-libs, or is it a deliberate API change?  I don't understand what's going on under the covers well enough to know, but I do note there are a couple of ccache changes noted in the release notes.  For example: https://github.com/krb5/krb5/commit/3f5a348287646d65700854650fe668b9c4249013

Comment 6 Ian Kent 2023-06-14 01:19:22 UTC

It might be worth asking the Kerberos maintainer, you can always send it back to me.

Comment 7 Ian Kent 2023-06-14 01:24:20 UTC

Any idea what I might be doing wrong in autofs here Jakub?

Comment 8 Jakub Jelen 2023-06-14 07:44:08 UTC

There were changes since Fedora 36, but I was not following the much. Most of the work was done by Simo including the rebase and some CVE fixes. Do you have idea what could have gone wrong?

Comment 9 Simo Sorce 2023-06-14 08:03:39 UTC

@imc.ac.uk did you manually kinit at some point?

Auto-initiation generally requires the first ticket to be obtained through auto-initiation itself, if it is manually fetched, auto-initiation may fail later. Have you tried a full kdestroy of the ccache used and then letting autofs retry on its own?

Comment 10 Ian Collier 2023-06-14 11:09:47 UTC

> did you manually kinit at some point?

Yes and no.  I am required to have a valid TGT in order to be logged in to the machine and have access to NFS shares, but the credential cache for this is not shared with autofs.

I think this probably needs someone to read the sasl_do_kinit function from autofs's modules/cyrus-sasl.c source file and see if it's doing the right thing.  The short version is that it is using a memory cache ("MEMORY:_autofstkt"); it gets credentials for principal host/myhost@MYREALM using the system's /etc/krb5.keytab, stores them to the cache and sets the environment variable KRB5CCNAME to this cache.

In Fedora 38 this works the first time and then something is failing for the next do_kinit after the first ticket expired.  So possibly the question is: what happens when krb5_cc_store_cred is called to store credentials in a memory cache that already contains an (expired) ticket - and has this changed?

For debugging purposes I edited the cache name in autofs to "FILE:/tmp/_autofstkt" and this actually works.  Now, since storing the creds in a file allows me to inspect the file to see what tickets are in there, I can see that what autofs is doing is appending the new ticket to the existing tickets in the file - I can still see the previously expired tickets in there.

So as far as autofs is concerned, maybe the correct behaviour would be to destroy the memory cache before putting the renewed ticket into it.  If the previous expired tickets are being kept then that might constitute a memory leak.  But if autofs is not doing anything *wrong* per se then maybe there's a krb5 bug with regard to how it treats a memory cache that contains both an expired and a valid ticket.  (However, this is speculation on my part because I don't know what's actually in the memory cache after autofs renews its ticket.)

Comment 11 Simo Sorce 2023-06-14 11:35:53 UTC

A memory ccache has 2 issues:

1. it is only available within the process, if any computation is deferred to another process it will be unavailable (or stale if fork happens before renew).

2. it is a cache collection, so if another library linked in the same process somehow uses it, the process may find that a cache for the target already exist and decide to use it.

Not sure what krb5_cc_store_cred(0 will do, I would assume it overwrites.

OTOH maybe sasl should stop doing kinit on its own, given GSSAPI is perfectly capable of doing initiation if it is provided with a client keytab.

Could you please set the environment variable KRB5_TRACE to a file in the autofs process and then capture its output when the operation fails?
This may give us a clue about what operation fails hopefully.
May be useful for comparison to have a trace taken when it suceeds with a file cache.

Comment 12 Ian Collier 2023-06-14 14:41:34 UTC

In another test I've edited the cache name to DIR:/tmp/_autotest and now autofs fails whenever it has to renew the ticket.  DIR is of course a collection, but there is only ever one cache in the collection; this cache contains multiple tickets.  The ticket lifetime on this system is set to 20 minutes for quicker testing.

So if I klist the cache I can see this sort of thing:

Valid starting     Expires            Service principal
06/14/23 12:18:04  06/14/23 12:38:04  krbtgt/MYREALM@MYREALM
06/14/23 12:18:04  06/14/23 12:38:04  ldap/ipaserver1@
        Ticket server: ldap/ipaserver1@MYREALM
06/14/23 12:18:05  06/14/23 12:38:05  krbtgt/MYREALM@MYREALM
06/14/23 12:18:05  06/14/23 12:38:04  ldap/ipaserver2@
        Ticket server: ldap/ipaserver2@MYREALM
06/14/23 12:35:42  06/14/23 12:55:42  krbtgt/MYREALM@MYREALM
06/14/23 12:39:50  06/14/23 12:59:50  krbtgt/MYREALM@MYREALM
06/14/23 12:43:29  06/14/23 13:03:29  krbtgt/MYREALM@MYREALM
06/14/23 13:33:43  06/14/23 13:53:43  krbtgt/MYREALM@MYREALM
06/14/23 14:31:21  06/14/23 14:51:21  krbtgt/MYREALM@MYREALM

And if I use this cache to do an ldapsearch at 14:40 while the last ticket is valid, it fails:

$ ldapsearch -Y GSSAPI 
SASL/GSSAPI authentication started
ldap_sasl_interactive_bind: Local error (-2)
        additional info: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)

This same copy of autofs running against krb5 version 1.19, which doesn't output the "Ticket expired" error, still records multiple tickets in the cache.  This points to a difference with respect to retrieving the credentials from a cache collection that contains both valid and expired tickets.  However, if I run ldapsearch with that cache then it still fails (ldapsearch is not linked against libkrb5.so.3).

Comment 13 Ian Collier 2023-06-14 16:32:02 UTC

Created attachment 1970873 [details]
KRB5_TRACE log of a failed lookup [mildly redacted]

Comment 14 Ian Collier 2023-06-14 16:42:40 UTC

Created attachment 1970874 [details]
KRB5_TRACE log of a successful lookup [mildly redacted]

This log is taken in the same circumstances as the failed log, except that LD_LIBRARY_PATH points to the libraries from krb5-libs-1.19.

To my mind the only significant difference is that this has extra lines at the end, beginning "Starting with TGT for client realm".  Up to that point both logs show a successful request for a TGT, then storing this in the ccache, then later retrieving a TGT from the ccache.  The failure log then stops abruptly while the successful log continues to request the service ticket.

My conclusion then is that the TGT that was successfully retrieved from the ccache in the failure case was an expired ticket.

Comment 15 Ian Collier 2023-06-14 21:21:18 UTC

So my proposed fix to autofs, regardless of whether krb5 is doing the right thing here, would be where it says:

        if (krb5cc_in_use++ == 0)
                /* tell the cache what the default principal is */
                ret = krb5_cc_initialize(ctxt->krb5ctxt,
                                 ctxt->krb5_ccache, krb5_client_princ);

to call krb5_cc_initialize every time and not just the first time.

Comment 16 Ian Kent 2023-06-16 00:50:43 UTC

(In reply to Simo Sorce from comment #9)
> @imc.ac.uk did you manually kinit at some point?

It's been quite a while since I've worked on this.

But it's unavoidable that this must work (and has worked) without the admin
doing a kinit. There are cases where, if an external credential cache is used,
it needs to be setup with a key for the principal but that's not what's being
used here. It's automount(8) that must setup what is needed itself.

I did write some automated tests for this functionality some time ago and I
can go through those to get details of what I did if that will be useful.

In any case this is renewing an expired key that's previously authenticated
ok.

Be aware that, while I have tweaked the autofs Kerberos code from time to
time I didn't write it initially myself, so I'm not intimately familiar
with the way each of these functions work.

Comment 17 Ian Collier 2023-06-20 16:57:21 UTC

@ikent would you be willing to patch the sasl_do_kinit function in autofs/modules/cyrus-sasl.c so that it calls krb5_cc_initialize every time it does a kinit and not just the first time? This does fix the problem.

Rationale:
1. The documentation of this function admits the possibility of calling it on an existing cache and says that it destroys the contents.
2. If we are in sasl_do_kinit then it means our existing tickets are no longer valid, so there is no value in keeping them. Thus, clearing the current contents of the cache before storing the new ticket is a desirable outcome.
3. It would be consistent with the behaviour of the 'kinit' CLI program which does seem to destroy all existing tickets in the currently selected cache when used to create or renew a ticket.

The alternatives are (a) refactoring this whole code so that it uses exclusively gss+sasl calls instead of low-level krb5 calls (which I don't think is necessary at all, though it might be theoretically cleaner), or (b) asking the krb5 developers whether the ccache is behaving correctly in this case. However, the change from krb5-1.19 to krb5-1.20 actually brings memory ccaches into line with other types, in the following way:

I wrote a test program that mimics what autofs does in starting a sasl client using quickly-expiring tickets. It does this three times in such a way that the original ticket will have expired by the time the third connection is requested. Whether this worked can be summarised by the following table.

cc-type F36 F38
MEMORY y n
DIR n n
FILE n n
KEYRING n n

Thus, the fact that it worked with memory ccaches in Fedora 36 was somewhat of an outlier. In all cases (except that the memory ccache isn't visible from outside the process and so can't be tested) the ccache stored all previous and current tickets. In the case of the keyring ccache, the kernel apparently deleted the ccache as soon as the first ticket expired, and so the error message for the third access was "No credentials cache found"; for all other failures the error message was "Ticket expired".

Comment 18 Ian Kent 2023-06-26 07:56:09 UTC

(In reply to Ian Collier from comment #17)
> @ikent would you be willing to patch the sasl_do_kinit function
> in autofs/modules/cyrus-sasl.c so that it calls krb5_cc_initialize every
> time it does a kinit and not just the first time?  This does fix the problem.

Yes, but I'm not sure how multiple concurrent clients will deal with the
approach.

> 
> Rationale:
> 1. The documentation of this function admits the possibility of calling it
> on an existing cache and says that it destroys the contents.
> 2. If we are in sasl_do_kinit then it means our existing tickets are no
> longer valid, so there is no value in keeping them.  Thus, clearing the
> current contents of the cache before storing the new ticket is a desirable
> outcome.
> 3. It would be consistent with the behaviour of the 'kinit' CLI program
> which does seem to destroy all existing tickets in the currently selected
> cache when used to create or renew a ticket.

A single threaded program, not sure if that really does compare with the
autofs use case.

> 
> The alternatives are (a) refactoring this whole code so that it uses
> exclusively gss+sasl calls instead of low-level krb5 calls (which I don't
> think is necessary at all, though it might be theoretically cleaner), or (b)
> asking the krb5 developers whether the ccache is behaving correctly in this
> case.  However, the change from krb5-1.19 to krb5-1.20 actually brings
> memory ccaches into line with other types, in the following way:

or (c) ask the krb5 developers how multiple threads each destroying the
ccache could affect observed behaviour.

Perhaps all we need to do is come up with a strategy to not destroy the
ccache if it has a valid key in it ...

The other thing is now that I think about it I'm not sure the code is
properly handling multiple threads anyway ... I'm not entirely clear
on how that is really meant to work so maybe there's something to be
learned as well.

> 
> I wrote a test program that mimics what autofs does in starting a sasl
> client using quickly-expiring tickets.  It does this three times in such a
> way that the original ticket will have expired by the time the third
> connection is requested.  Whether this worked can be summarised by the
> following table.

With multiple threads concurrently doing this?

At the very least the cache will be deleted more times than it needs
to be and could it be possible for it to be deleted while another
thread is using it? I don't think this could happen but maybe my
understanding is wrong!

> 
>  cc-type  F36 F38
>  MEMORY    y   n
>  DIR       n   n
>  FILE      n   n
>  KEYRING   n   n
> 
> Thus, the fact that it worked with memory ccaches in Fedora 36 was somewhat
> of an outlier.  In all cases (except that the memory ccache isn't visible
> from outside the process and so can't be tested) the ccache stored all
> previous and current tickets.  In the case of the keyring ccache, the kernel
> apparently deleted the ccache as soon as the first ticket expired, and so
> the error message for the third access was "No credentials cache found"; for
> all other failures the error message was "Ticket expired".

This might be useful info. since we do need to make it work which means improved
(different) error handling too.

Ian

Comment 19 Ian Collier 2023-06-26 17:31:17 UTC

Yes you are correct that I did not test in a highly multi-threaded environment.  I can see that there might be a race condition where one thread might clear the cache while another is trying to look at it.

One thing we can do (as you noted above) is try to pull a cred out of the cache, and if it's valid, don't destroy the cache.  Another thing we could do is instead of destroying the cache, enumerate it and remove any creds that were found to be expired.  I'm experimenting with that in the test program.

Comment 20 Ian Collier 2023-06-30 17:03:26 UTC

Created attachment 1973430 [details]
proof-of-concept patch which clears the cache if an expired ticket can be found in it

Currently running with the attached patch which has fixed the problem (but
on a lightly-loaded workstation).

The following changes are not strictly relevant:
 - Moved cache initialization to before acquiring the ticket (that might not
   have been necessary, to be fair - I was going to reuse the my_creds struct
   and then ended up not doing)
 - Bracketed and indented the error handling for the initialization
 - If cache failed to initialize, reverse the increment to krb5cc_in_use
   which seems like it would be the right thing to do

Then this was added:
 - If the cache is already in use, try to pull out a cred and then if that
   was successful and the cred is expired, clear the cache.

So this fixes the behaviour I was seeing, since that was happening because
expired creds were being pulled out of the cache and in that case the patched
version clears the cache to remove the expired creds.

What sort of race conditions might happen here?

 - If the function is called very late during the validity of a ticket, it
   might expire after the decision not to clear the cache.  In that case,
   the behaviour is the same as the unpatched version, but this is highly
   unlikely because do_kinit is not supposed to happen while there is a
   valid ticket.

 - If two or more threads decide to call do_kinit at about the same time:
   it's protected by a mutex, so one of the calls will happen first; this
   call will clear the cache and add a new ticket.  When the others kick
   in, the cache won't be cleared because it's only cleared if we can 
   find an expired ticket in the cache and any such ticket was removed
   when the first do_kinit happened.

 - If one thread does do_kinit while another thread is trying to do a lookup:
   if the current ticket is expired then the lookup would have failed anyway;
   if it's not expired then we won't clear the cache.

 - If there is both an expired and a valid ticket in the cache:
   this only happens if two or more do_kinits clashed and stored tickets
   with different expiration times, and if the current time is between those
   times.  The current bug happens because krb5 cache retrieval is returning
   the earliest (i.e. expired) ticket.  When that's the case then do_kinit
   will clear the cache because when it tests the cache it will pull the
   expired cred - and it needs to do this because otherwise all lookups are
   failing (that's the bug).  In a case where krb5 cache retrieval returns
   the valid ticket, it doesn't matter that the cache is not cleared because
   any subsequent lookups will use that valid ticket.

Did I miss any?

Comment 21 Ian Kent 2023-07-01 01:45:27 UTC

(In reply to Ian Collier from comment #20)
> 
>  - If one thread does do_kinit while another thread is trying to do a lookup:
>    if the current ticket is expired then the lookup would have failed anyway;
>    if it's not expired then we won't clear the cache.

This is the case I was wondering about.

I don't know what happens in this case, if the initialisation could cause
a problem it will likely result in a failed lookup.

Anyway I'll turn this into a patch and post it for your review.

Comment 22 Ian Collier 2024-05-07 10:53:45 UTC

(In reply to Ian Kent from comment #21)
> Anyway I'll turn this into a patch and post it for your review.

I didn't ever see one, but I notice the code has made it into the 5.1.9 release
that comes with Fedora 40.  But... it doesn't work.  Why not?

 time_t now = monotonic_time(NULL);

monotonic_time returns the system's uptime, but we need wall-clock time here
because we are comparing it to the expiry time of a Kerberos ticket.

Comment 23 Ian Kent 2024-05-08 01:18:11 UTC

(In reply to Ian Collier from comment #22)
> (In reply to Ian Kent from comment #21)
> > Anyway I'll turn this into a patch and post it for your review.
> 
> I didn't ever see one, but I notice the code has made it into the 5.1.9
> release
> that comes with Fedora 40.  But... it doesn't work.  Why not?

Sorry, my bad, I mast have been under time pressure.

> 
>  time_t now = monotonic_time(NULL);
> 
> monotonic_time returns the system's uptime, but we need wall-clock time here
> because we are comparing it to the expiry time of a Kerberos ticket.

Right, it looks like I just need to change it back to time(NULL) as you had it.
I'll do that.

Ian

Comment 24 Aoife Moloney 2024-05-22 14:05:39 UTC

Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 25 Aoife Moloney 2024-05-28 13:11:49 UTC

Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 26 Aoife Moloney 2024-05-31 08:23:08 UTC

Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 27 Aoife Moloney 2024-11-08 10:53:46 UTC

This message is a reminder that Fedora Linux 39 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '39'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 39 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 28 Aoife Moloney 2024-11-27 21:13:20 UTC

Fedora Linux 39 entered end-of-life (EOL) status on 2024-11-26.

Fedora Linux 39 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.