Bug 1316798 - krb5_cc_initialize() can fail with KRB5_FCC_INTERNAL when multiple processes share the same ccache filename
krb5_cc_initialize() can fail with KRB5_FCC_INTERNAL when multiple processes ...
Product: Fedora
Classification: Fedora
Component: krb5 (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Robbie Harwood
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2016-03-11 01:48 EST by Dan Callaghan
Modified: 2017-05-02 14:48 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-05-02 14:48:24 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dan Callaghan 2016-03-11 01:48:43 EST
Description of problem:
Multiple processes concurrently invoking krb5_cc_initialize() with the same ccache filename will sometimes fail with KRB5_FCC_INTERNAL.

Version-Release number of selected component (if applicable):
but this was originally observed on RHEL6 and it looks like it affects all krb5 releases.

How reproducible:
quite easily with a synthetic reproducer

Steps to Reproduce:
cat >krb5initrace.c <<EOF
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <krb5.h>

int main(int argc, char *argv[]) {
    krb5_context ctx;
    krb5_ccache ccache;
    krb5_cc_default(ctx, &ccache);
    krb5_principal princ;
    krb5_parse_name(ctx, "example@REDHAT.COM", &princ);

    unsigned int attempt = 0;
    while (1) {
        attempt ++;
        int result = krb5_cc_initialize(ctx, ccache, princ);
        if (result != 0) {
            fprintf(stderr, "pid %d attempt %u failed: %d\n", getpid(), attempt, result);
            return 1;
gcc -o krb5initrace $(pkg-config krb5 --libs) krb5initrace.c
export KRB5CCNAME=/tmp/asdf
./krb5initrace & ./krb5initrace &

Actual results:
One of the two processes will quickly fail with KRB5_FCC_INTERNAL (-1765328188), as in:
pid 2415 attempt 6 failed: -1765328188

Expected results:
krb5_cc_initialize() should be safe to call on the same ccache filename from multiple processes concurrently without random crashes.

Additional info:
The API docs don't make any explicit mention about file-backed ccache safety when shared across multiple processes, but from poking around in the source I can see that cc_file.c does go to a lot of effort to ensure that a suitable POSIX file lock is held on the ccache filename whenever it's manipulated. However the ccache initialization routine is a bit of a special case in that it tries to first erase and then recreate the ccache filename. It does that by calling unlink() followed by open() with O_CREAT|O_EXCL, and then it acquires a lock and writes to the file. The problem is there is a race window between unlink() and open() where a racing process may already have created the same file, in which case open() will fail with EEXIST. That gets mapped to KRB5_FCC_INTERNAL which is what we are seeing.

We originally saw this in the bkr command line client, which can be configured to initialize the ccache using a keytab before authenticating. Some users reported hitting this -1765328188 error at random when they had multiple Jenkins jobs running concurrently under the same user account sharing the same default ccache file on RHEL6. That's bug 1313580.

The race between unlink() and open() seems to exist in all versions of krb5, including back to RHEL6 where we originally hit this, but I've filed this bug against Fedora rawhide because the krb5 fix for this would probably be a bit too drastic to be backported to RHEL6.
Comment 1 Dan Callaghan 2016-03-11 01:51:08 EST
I feel like this could be fixed by *not* calling unlink() and not using O_CREAT|O_EXCL, but instead just open()ing with O_CREAT and then acquiring the write lock and then truncating. I haven't tried that approach though. And I'm not sure if there is some other reason why the unlink() might be desired, like in case the process has permissions on the directory but not the existing file, or it's a symlink instead of a real file etc.

If the unlink() *is* still needed then it seems like the only correct alternative is to handle EEXIST properly instead of letting it leak out as KRB5_FCC_INTERNAL to the caller.
Comment 2 Dan Callaghan 2016-03-13 19:49:14 EDT
Here is a patch which leaves unlink() but removes O_EXCL and instead does ftruncate() after acquiring the write lock. It fixes the race with the reproducer above, and it seems to work correctly.

Comment 3 Robbie Harwood 2016-03-14 14:07:09 EDT
Hi, thanks for writing a patch!  I'll follow the upstream discussion and backport when something merges.
Comment 4 Jan Kurik 2016-07-26 00:58:40 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle.
Changing version to '25'.
Comment 5 Robbie Harwood 2017-05-02 14:48:24 EDT
Upstream PR has been inactive for a year; closing.

Note You need to log in before you can comment on or make changes to this bug.