Bug 1313580
Summary: | multiple concurrent bkr processes can trample over each other during credentials cache init, causing krbV.Krb5Error: (-1765328188, 'Internal credentials cache error') | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> |
Component: | command line | Assignee: | Dan Callaghan <dcallagh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 22 | CC: | dcallagh, dowang, mjia, rjoost |
Target Milestone: | 22.3 | Keywords: | Patch, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-04-04 05:34:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Callaghan
2016-03-02 00:16:48 UTC
Forgot to mention the workaround which is to explicitly set a unique ccache filename before invoking bkr: export KRB5CCNAME=$(mktemp /tmp/krb5cc_XXXXXXXX) with some corresponding code to clean up afterwards. I was wondering if regular kinit suffers from the same problem (multiple concurrent kinits on the same ccache file can corrupt it) because it seems like it *shouldn't*. It's not mentioned anywhere in the krb5 docs that I could find, so I did some digging through the source. The code for file-based credentials caches *does* lock the ccache file whenever it re-creates it or writes to it, as it should be doing. I also double-checked the behaviour at runtime with strace, and it looks right to me: unlink("/tmp/krb5cc_0") = 0 [...] open("/tmp/krb5cc_0", O_RDWR|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 [...] fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 write(3, "\5\4", 2) = 2 [...] fcntl(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 close(3) = 0 [...] open("/tmp/krb5cc_0", O_RDWR) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 read(3, "\5\4", 2) = 2 [...] fcntl(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 close(3) = 0 [...] So now I am at a bit of a loss to explain why concurrent bkr processes using the same ccache file could actually trample over each other and cause the error shown in comment 0... So I had a look for the exact error code, -1765328188. Its symbolic name is KRB5_FCC_INTERNAL. There is only one API which is explicitly documented as returning that value, which is krb5_cc_default: 4351 * @retval 4352 * KRB5_FCC_INTERNAL The name of the default credential cache cannot be 4353 * obtained but it's also returned from krb5_fcc_interpret when mapping from several different POSIX errno values: 2486 case EINVAL: 2487 case EEXIST: /* XXX */ 2488 case EFAULT: 2489 case EBADF: 2490 #ifdef ENAMETOOLONG 2491 case ENAMETOOLONG: 2492 #endif 2493 #ifdef EWOULDBLOCK 2494 case EWOULDBLOCK: 2495 #endif 2496 retval = KRB5_FCC_INTERNAL; 2497 break; There's quite a few operations which krb5_fcc_initialize (and indirectly, krb5_fcc_open_file with mode FCC_OPEN_AND_ERASE) does that could result in one of those errnos which then gets mapped back through krb5_fcc_interpret to the KRB5_FCC_INTERNAL error code. But looking at it now, the most likely candidate would be when it unlink()s and then open()s the file with O_CREAT|O_EXCL. Presumably it is using O_CREAT|O_EXCL to be sure that some other process hasn't also created the file and started writing to it in between the unlink() and open(). But when O_EXCL *didn't* create the file it indicates that by returning EEXIST, which is one of the error codes above. And what do you know, it even has a big XXX as though someone before has noticed that there is a window between unlink() and open() where two concurrent krb5_cc_initialize() calls could race with each other and one would fail in open() with EEXIST. It seems to me that the unlink() followed by open() is inherently racy and that that is not the right way to wipe and recreate the cache file. Instead it should probably be doing open() with O_CREAT and *no* O_EXCL, then acquiring a write lock, and then ftruncate() and writing out the cache. But regardless of whether it is a bug/misfeature in krb5, and regardless whether it might eventually be fixed or not -- it seems like we *will* need to work around it in bkr by using a unique credential cache filename. Here is a patch for the workaround in bkr: http://gerrit.beaker-project.org/4729 Now the hard part will be reproducing the error to prove that the workaround is valid... I also filed bug 1316798 against krb5 including a reproducer, so that this can be fixed properly at the root. This bug fix is included in beaker-client-22.2-0.git.22.668a081 which is available for download here: https://beaker-project.org/nightlies/release-22/ This fix is also in beaker-client-22.3-0.git.5.a4291ca which might be simpler to use since it has a higher NVR than the 22.2 release which is already published. Beaker 22.3 has been released. Release Notes can be found here: https://beaker-project.org/docs/whats-new/release-22.html#beaker-22-3 |