Bug 1313580

Summary: multiple concurrent bkr processes can trample over each other during credentials cache init, causing krbV.Krb5Error: (-1765328188, 'Internal credentials cache error')
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: command lineAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 22CC: dcallagh, dowang, mjia, rjoost
Target Milestone: 22.3Keywords: Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-04 05:34:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2016-03-02 00:16:48 UTC
Description of problem:
When bkr is configured to use a Kerberos keytab for authentication, it handles the credentials cache initialisation (the step normally done by kinit, or k5start or some other tool). The credentials cache initialisation is (apparently?) not safe to be done on the same ccache filename from multiple processes concurrently.

Version-Release number of selected component (if applicable):
22.0 but affects all existing beaker-client versions

How reproducible:
not that easily...

Steps to Reproduce:
1. Set up bkr to use Kerberos keytab for authentication
2. Run multiple bkr commands concurrently
3. Get (un)lucky

Actual results:
Traceback (most recent call last):
[...]
   File "/usr/lib/python2.6/site-packages/bkr/common/hub.py", line 68, in __init__
     self._login()
   File "/usr/lib/python2.6/site-packages/bkr/common/hub.py", line 107, in _login
     login_method()
   File "/usr/lib/python2.6/site-packages/bkr/common/hub.py", line 175, in _login_krbv
     ccache.init(cprinc)
krbV.Krb5Error: (-1765328188, 'Internal credentials cache error')

Expected results:
Unsure.

Additional info:
There are many uncertainties...

Are we using the krbV module incorrectly? Probably not, but it's hard to tell because there are no docs. krbV is basically Red Hat only, it's not the same as the more generally used python-kerberos module. But krbV is a fairly thin wrapper around MIT krb5 and its API docs don't *explicitly* mention that krb5_cc_initialize can't be used concurrently on the same credentials cache, although it's kind of implied based on what it does...

If we switched to some other module or library or interface, such as GSSAPI which seems to be preferred nowadays, would this problem magically go away? Probably not.

Should bkr even be initing the ccache? Probably not. When the user is acquiring their ticket with a password we expect them to do kinit (or whatever else) as normal, and then bkr just relies on the ticket already existing. That's the usual convention. It *should* probably be the same if some script is acquiring its ticket using a keytab, that is bkr should just leave it up to k5start or kinit to manage that and just rely on the ticket existing. I suspect we have this code for doing the ccache init inside of bkr purely as a convenience.

If bkr *did* just force the caller to use kinit to get its own ticket first, would the same issue exist? Does kinit have any kind of protection against this scenario or does it also just assume you won't try running it concurrently on the same ccache?

If bkr is doing this as just a convenience when using a keytab, does it make sense to reuse the default ccache or should bkr always create (and then clean up) a separate, unique ccache file? I think this is probably the best solution.

Comment 2 Dan Callaghan 2016-03-02 00:20:22 UTC
Forgot to mention the workaround which is to explicitly set a unique ccache filename before invoking bkr:

export KRB5CCNAME=$(mktemp /tmp/krb5cc_XXXXXXXX)

with some corresponding code to clean up afterwards.

Comment 3 Dan Callaghan 2016-03-11 04:48:46 UTC
I was wondering if regular kinit suffers from the same problem (multiple concurrent kinits on the same ccache file can corrupt it) because it seems like it *shouldn't*. It's not mentioned anywhere in the krb5 docs that I could find, so I did some digging through the source.

The code for file-based credentials caches *does* lock the ccache file whenever it re-creates it or writes to it, as it should be doing. I also double-checked the behaviour at runtime with strace, and it looks right to me:

unlink("/tmp/krb5cc_0")                 = 0
[...]
open("/tmp/krb5cc_0", O_RDWR|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3
[...]
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
write(3, "\5\4", 2)                     = 2
[...]
fcntl(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
close(3)                                = 0
[...]
open("/tmp/krb5cc_0", O_RDWR)           = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
read(3, "\5\4", 2)                      = 2
[...]
fcntl(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
close(3)                                = 0
[...]

So now I am at a bit of a loss to explain why concurrent bkr processes using the same ccache file could actually trample over each other and cause the error shown in comment 0...

Comment 4 Dan Callaghan 2016-03-11 05:09:25 UTC
So I had a look for the exact error code, -1765328188. Its symbolic name is KRB5_FCC_INTERNAL.

There is only one API which is explicitly documented as returning that value, which is krb5_cc_default:

4351  * @retval
4352  * KRB5_FCC_INTERNAL       The name of the default credential cache cannot be
4353  *                         obtained

but it's also returned from krb5_fcc_interpret when mapping from several different POSIX errno values:

2486     case EINVAL:
2487     case EEXIST:                        /* XXX */
2488     case EFAULT:
2489     case EBADF:
2490 #ifdef ENAMETOOLONG
2491     case ENAMETOOLONG:
2492 #endif
2493 #ifdef EWOULDBLOCK
2494     case EWOULDBLOCK:
2495 #endif
2496         retval = KRB5_FCC_INTERNAL;
2497         break;

There's quite a few operations which krb5_fcc_initialize (and indirectly, krb5_fcc_open_file with mode FCC_OPEN_AND_ERASE) does that could result in one of those errnos which then gets mapped back through krb5_fcc_interpret to the KRB5_FCC_INTERNAL error code.

But looking at it now, the most likely candidate would be when it unlink()s and then open()s the file with O_CREAT|O_EXCL. Presumably it is using O_CREAT|O_EXCL to be sure that some other process hasn't also created the file and started writing to it in between the unlink() and open().

But when O_EXCL *didn't* create the file it indicates that by returning EEXIST, which is one of the error codes above. And what do you know, it even has a big XXX as though someone before has noticed that there is a window between unlink() and open() where two concurrent krb5_cc_initialize() calls could race with each other and one would fail in open() with EEXIST.

It seems to me that the unlink() followed by open() is inherently racy and that that is not the right way to wipe and recreate the cache file. Instead it should probably be doing open() with O_CREAT and *no* O_EXCL, then acquiring a write lock, and then ftruncate() and writing out the cache.

But regardless of whether it is a bug/misfeature in krb5, and regardless whether it might eventually be fixed or not -- it seems like we *will* need to work around it in bkr by using a unique credential cache filename.

Comment 5 Dan Callaghan 2016-03-11 05:54:14 UTC
Here is a patch for the workaround in bkr:

http://gerrit.beaker-project.org/4729

Now the hard part will be reproducing the error to prove that the workaround is valid...

Comment 6 Dan Callaghan 2016-03-11 06:51:29 UTC
I also filed bug 1316798 against krb5 including a reproducer, so that this can be fixed properly at the root.

Comment 8 Dan Callaghan 2016-03-18 00:33:36 UTC
This bug fix is included in beaker-client-22.2-0.git.22.668a081 which is available for download here:

https://beaker-project.org/nightlies/release-22/

Comment 9 Dan Callaghan 2016-03-18 01:17:02 UTC
This fix is also in beaker-client-22.3-0.git.5.a4291ca which might be simpler to use since it has a higher NVR than the 22.2 release which is already published.

Comment 11 Roman Joost 2016-04-04 05:34:53 UTC
Beaker 22.3 has been released. Release Notes can be found here: https://beaker-project.org/docs/whats-new/release-22.html#beaker-22-3