Bug 586032
Summary: | kadmind fails to lock db during password change: Cannot lock database | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Martin Osvald 🛹 <mosvald> | ||||
Component: | krb5 | Assignee: | Nalin Dahyabhai <nalin> | ||||
Status: | CLOSED ERRATA | QA Contact: | Zbysek MRAZ <zmraz> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.4 | CC: | cww, dpal, ebenes, jplans, nico | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | krb5-1.6.1-60.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-07-21 08:14:27 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Martin Osvald 🛹
2010-04-26 16:47:46 UTC
forgot to mention that we were able to get corefiles from both running kadmind and krb5kdc server when the kadmind fails to lock database files: === snip === Core was generated by `/usr/kerberos/sbin/krb5kdc'. #0 0x00002b603889ee83 in __select_nocancel () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 1 (Thread 9837): #0 0x00002b603889ee83 in __select_nocancel () from /lib64/libc.so.6 #1 0x00002b603774d0d3 in krb5int_cm_call_select (in=<value optimized out>, out=0x2b603681cde0, sret=0x7fff31d8e084) at sendto_kdc.c:544 #2 0x00002b603661432a in listen_and_process (prog=0x7fff31d8ff6b "krb5kdc") at network.c:1082 #3 0x00002b6036613fd1 in main (argc=<value optimized out>, argv=0x7fff31d8e1c8) at main.c:738 (gdb) === snip === === snip === Core was generated by `/usr/kerberos/sbin/kadmind'. #0 0x00002b347817de83 in __select_nocancel () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 1 (Thread 9893): #0 0x00002b347817de83 in __select_nocancel () from /lib64/libc.so.6 #1 0x00002b3476102580 in kadm_svc_run (params=0x7fff956d4c20) at ovsec_kadmd.c:752 #2 0x00002b3476103609 in main (argc=<value optimized out>, argv=<value optimized out>) at ovsec_kadmd.c:646 (gdb) === snip === The above backtraces seem to be normal, no one of the processes are stuck somewhere - just waiting for another requests/connections, so it led me to think that some of daemons (probably krb5kdc) forgot to release the lock somewhere and in conjunction with fact that krb5kdc was still able to reply for AS/TGT requests. Hello, posting current status here to synchronize with IssueTracker and to inform you.. We were able to find out which process holds a lock on the database using a simple utility: === snip get_lock_info.c === #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char **argv) { if (argc != 2) { printf("Usage: %s lockfile\n", argv[0]); return 1; } int fd; char *lockfile = argv[1]; if ((fd = open(lockfile, O_RDWR)) == -1) { perror("open()"); return 1; } struct flock lock = { 0 }; lock.l_type = F_WRLCK; if (fcntl(fd, F_GETLK, &lock) == -1) { perror("fcntl()"); return 1; } if (lock.l_type & F_UNLCK) { printf("file: %s is not locked.\n", lockfile); } else { printf("file: %s is locked by PID: %d\n", lockfile, lock.l_pid); } return 0; } === snip === the output at time of problem: $ ./get_lock_info /var/kerberos/krb5kdc/principal.ok file: /var/kerberos/krb5kdc/principal.ok is locked by PID: 1744 $ $ ./get_lock_info /var/kerberos/krb5kdc/principal.kadm5.lock file: /var/kerberos/krb5kdc/principal.kadm5.lock is not locked. $ $ ps aux |grep 1744 root 1744 0.0 0.2 84336 23540 ? Ss Apr29 0:59 /usr/kerberos/sbin/krb5kdc root 7517 0.0 0.0 61188 724 pts/0 S+ 09:05 0:00 grep 1744 $ the above is strange because of two reasons: The first one is why only a file principal.ok is locked and principal.kadm5.lock isn't when these files should get locked and unlocked at once (in functions krb5_db2_db_lock() and krb5_db2_db_unlock() respectively). I wasn't able to find a function directly called by krb5kdc, which would lock only the file principal.ok (e.g. by directly calling a function krb5_lock_file()). This leads me to think that problem itself could be caused by the fucntion krb5_db2_db_unlock() as it is consisted of unlocking the file principal.kadm5.lock first (osa_adb_release_lock()) and if something bad happens during unlocking, the function returns without unlocking principal.ok and there is no checking for return value of krb5_db2_db_unlock() across the whole source code: === snip krb5-1.6.1/src/plugins/kdb/db2/kdb_db2.c === krb5_error_code krb5_db2_db_unlock(krb5_context context) { krb5_db2_context *db_ctx; kdb5_dal_handle *dal_handle; DB *db; krb5_error_code retval; if (!k5db2_inited(context)) return KRB5_KDB_DBNOTINITED; dal_handle = (kdb5_dal_handle *) context->db_context; db_ctx = (krb5_db2_context *) dal_handle->db_context; if ((retval = osa_adb_release_lock(db_ctx->policy_db))) { return retval; <<<--- this could lead to not unlocking principal.ok } if (!db_ctx->db_locks_held) /* lock already unlocked */ return KRB5_KDB_NOTLOCKED; db = db_ctx->db; if (--(db_ctx->db_locks_held) == 0) { (*db->close) (db); db_ctx->db = NULL; retval = krb5_lock_file(context, db_ctx->db_lf_file, KRB5_LOCKMODE_UNLOCK); db_ctx->db_lock_mode = 0; return (retval); } return 0; } === snip === The above could be true if there wasn't another strange thing.. The second strange thing is that the files shouldn't get locked exclusively by krb5kdc at all. As krb5kdc shipped with RHEL5 isn't compiled with KRBCONF_KDC_MODIFIES_KDB, which normally instructs kdc to update db with e.g. a number of failed attempts to get ticket for a specific user or time of last successful ticket creation. Again the problem disappears by restarting krb5kdc daemon. I will update when I have a new info.. Best regards, -Martin (In reply to comment #2) > === snip === > > the output at time of problem: > > $ ./get_lock_info /var/kerberos/krb5kdc/principal.ok > file: /var/kerberos/krb5kdc/principal.ok is locked by PID: 1744 > $ > > $ ./get_lock_info /var/kerberos/krb5kdc/principal.kadm5.lock > file: /var/kerberos/krb5kdc/principal.kadm5.lock is not locked. > $ > > $ ps aux |grep 1744 > root 1744 0.0 0.2 84336 23540 ? Ss Apr29 0:59 > /usr/kerberos/sbin/krb5kdc > root 7517 0.0 0.0 61188 724 pts/0 S+ 09:05 0:00 grep 1744 > $ > > the above is strange because of two reasons: > > The first one is why only a file principal.ok is locked and > principal.kadm5.lock isn't when these files should get locked and unlocked at > once (in functions krb5_db2_db_lock() and krb5_db2_db_unlock() respectively). > I wasn't able to find a function directly called by krb5kdc, which would lock > only the file principal.ok (e.g. by directly calling a function > krb5_lock_file()). This leads me to think that problem itself could be caused > by the fucntion krb5_db2_db_unlock() as it is consisted of unlocking the file > principal.kadm5.lock first (osa_adb_release_lock()) and if something bad > happens during unlocking, the function returns without unlocking principal.ok > and there is no checking for return value of krb5_db2_db_unlock() across the > whole source code: > > === snip krb5-1.6.1/src/plugins/kdb/db2/kdb_db2.c === > krb5_error_code > krb5_db2_db_unlock(krb5_context context) > { > krb5_db2_context *db_ctx; > kdb5_dal_handle *dal_handle; > DB *db; > krb5_error_code retval; > > if (!k5db2_inited(context)) > return KRB5_KDB_DBNOTINITED; > > dal_handle = (kdb5_dal_handle *) context->db_context; > db_ctx = (krb5_db2_context *) dal_handle->db_context; > > if ((retval = osa_adb_release_lock(db_ctx->policy_db))) { > return retval; <<<--- this could lead to not unlocking > principal.ok > } > > if (!db_ctx->db_locks_held) /* lock already unlocked */ > return KRB5_KDB_NOTLOCKED; > db = db_ctx->db; > if (--(db_ctx->db_locks_held) == 0) { > (*db->close) (db); > db_ctx->db = NULL; > > retval = krb5_lock_file(context, db_ctx->db_lf_file, > KRB5_LOCKMODE_UNLOCK); > db_ctx->db_lock_mode = 0; > return (retval); > } > return 0; > } > === snip === This looks interesting -- if the locking mode is somehow set to KRB5_LOCKMODE_PERMANENT, this would run afoul of the SELinux policy, which normally prevents the KDC from writing to the directory in osa_adb_release_lock(). But like you, I don't see that happening by inspecting the code. Created attachment 490820 [details]
patch which should fix the problem
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1031.html A fix is about to be pushed to master in the krb5/krb5 repo in github. The fix being pushed fixes a bug with these exact symptoms, but it may not be the same bug. In particular, the bug I'm referring to happens only when krb5kdc races with a kdb5_util _load_, and it affects kadmin.local, kadmind, and any libkadm5srv application. But loads typically only happen on slaves, and kadmind typically only runs on masters, yet this bugzilla report refers to kadmind... Is it possible that the symptoms in this report occur on KDCs that switch roles from slave to master? Or is kadmind just a red herring, that it's really kadmin.local that is affected but the OP felt that they needed to show that kadmind is affected? In any case, the upcoming commit comment: Fix lock inconsistency in ctx_unlock() The lock inconsistency fixed here is quite possibly the same as described in https://bugzilla.redhat.com/show_bug.cgi?id=586032 . The problem is that ctx_unlock() fails to unlock the principal DB if it fails to unlock the policy DB, and this happens when ctx_lock() fails to lock the policy DB (likely because the caller is racing against a kdb5_util load, which will be using a "permanent" lock, meaning that the lock file will be unlinked after acquiring the lock). The fix is to perform both unlock operations *then* handle any errors that either or both might have returned. The fix: diff --git a/src/plugins/kdb/db2/kdb_db2.c b/src/plugins/kdb/db2/kdb_db2.c index 87b6c18..62fbef7 100644 --- a/src/plugins/kdb/db2/kdb_db2.c +++ b/src/plugins/kdb/db2/kdb_db2.c @@ -383,12 +383,10 @@ done: static krb5_error_code ctx_unlock(krb5_context context, krb5_db2_context *dbc) { - krb5_error_code retval; + krb5_error_code retval, retval2; DB *db; retval = osa_adb_release_lock(dbc->policy_db); - if (retval) - return retval; if (!dbc->db_locks_held) /* lock already unlocked */ return KRB5_KDB_NOTLOCKED; @@ -399,9 +397,15 @@ ctx_unlock(krb5_context context, krb5_db2_context *dbc) dbc->db = NULL; dbc->db_lock_mode = 0; - retval = krb5_lock_file(context, dbc->db_lf_file, + retval2 = krb5_lock_file(context, dbc->db_lf_file, KRB5_LOCKMODE_UNLOCK); + if (retval2) + return retval2; } + + /* We may be unlocking because osa_adb_get_lock() failed. */ + if (retval == OSA_ADB_NOTLOCKED) + return 0; return retval; } The commit is 29ee39baa919361ae08e26caab896890d5cb3eb4. There's a test that fails on master prior to 29ee39 (though it requires 9e182bc) but passes after. |