Description of problem: kadmind for currently unknown reason fails to lock principal/policy database Version-Release number of selected component (if applicable): RHEL5 krb5-server-1.6.1-36.el5_4.1-x86_64 How reproducible: currently without stable reproducer, it happens after some time (couple of days/weeks) Steps to Reproduce: currently without reproducer, but it is finally triggered by the similar command: $ kadmin -p "<username>/admin@<REALM>" -q "cpw <some_user>" Actual results: kadmin: change_password <some_user> Enter password for principal "<some_user>": Re-enter password for principal "<some_user>": change_password: Cannot lock database while changing password for "<some_user>@<REALM>". kadmind.log: Apr 07 07:25:03 <hostname> kadmind[9893](Notice): Request: kadm5_init, <admin_user>/admin@<REALM>, success, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS>, flavor=300001 Apr 07 07:27:11 <hostname> kadmind[9893](Notice): Request: kadm5_get_principals, <some_user>, success, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:27:17 <hostname> kadmind[9893](Notice): Request: kadm5_get_policy, default, success, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:27:46 <hostname> kadmind[9893](Notice): Request: kadm5_create_principal, <some_user>@<REALM>, Cannot lock database, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:28:16 <hostname> kadmind[9893](Notice): Request: kadm5_get_policy, default, success, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:28:35 <hostname> kadmind[9893](Notice): Request: kadm5_create_principal, <some_user>@<REALM>, Password is too short, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:28:43 <hostname> kadmind[9893](Notice): Request: kadm5_get_policy, default, success, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> Apr 07 07:29:04 <hostname> kadmind[9893](Notice): Request: kadm5_create_principal, <some_user>@<REALM>, Cannot lock database, client=<admin_user>/admin@<REALM>, service=kadmin/admin@<REALM>, addr=<IP_ADDRESS> The problem disappears after restarting kerberos daemons. Expected results: kadmind should be successful in changing password without hitting the error "Cannot lock database" Additional info: The interesting thing is that according to krb5kdc.log (which doesn't contain any interesting stuff) krb5kdc works correctly during the time kadmind returns "Cannot lock database" errors. krb5kdc successfully responds to TGT and AS requests, so it seems that the process that forgot to unlock those files was krb5kdc as according to fcntl() manual page (related part is emphasized by "_____"): he l_type field can be used to place a read (F_RDLCK) or a write (F_WRLCK) lock on a file. Any number of processes may hold a read lock (shared lock) on a file region, but only one process may hold a write lock (exclusive lock). An exclusive lock excludes all other locks, both shared and exclusive. A single process can hold only one type of lock on a file region; _____if a new lock is applied to an already-locked region, then the existing lock is converted to the new lock type. (Such conversions may involve splitting, shrinking, or coalescing with an existing lock if the byte range specified by the new lock does not precisely coincide with the range of the existing lock.)_____ code path leading to the "Cannot lock database" error: kadm_svc_run() -> svc_getreqset() -> svc_do_xprt() -> kadm_1() -> chpass_principal_2_svc() -> kadm5_chpass_principal() -> kadm5_chpass_principal_3() -> kdb_put_entry() -> krb5_db_put_principal() -> krb5_db2_db_put_principal() -> krb5_db2_db_lock() -> krb5_lock_file() source code, which relates to "Cannot lock database" error message: $ grep "Cannot lock database" RHEL-5/krb5-1.6.1/src/lib/kdb/adb_err.et error_code OSA_ADB_CANTLOCK_DB, "Cannot lock database" $ === snip src/plugins/kdb/db2/kdb_db2.c:556 === krb5_error_code krb5_db2_db_lock(krb5_context context, int in_mode) { ...<snip> krb5_lock_mode = mode | KRB5_LOCKMODE_DONTBLOCK; for (gotlock = tries = 0; tries < MAX_LOCK_TRIES; tries++) { retval = krb5_lock_file(context, db_ctx->db_lf_file, krb5_lock_mode); if (retval == 0) { gotlock++; break; } else if (retval == EBADF && mode == KRB5_DB_LOCKMODE_EXCLUSIVE) /* tried to exclusive-lock something we don't have */ /* write access to */ return KRB5_KDB_CANTLOCK_DB; sleep(1); } if (retval == EACCES) return KRB5_KDB_CANTLOCK_DB; else if (retval == EAGAIN || retval == EWOULDBLOCK) return OSA_ADB_CANTLOCK_DB; else if (retval != 0) return retval; ...<snip> === snip === === snip src/lib/krb5/os/lock_file.c:65 === krb5_error_code krb5_lock_file(krb5_context context, int fd, int mode) { ...<snip> #ifdef POSIX_FILE_LOCKS lock_arg.l_whence = 0; lock_arg.l_start = 0; lock_arg.l_len = 0; if (fcntl(fd, lock_cmd, &lock_arg) == -1) { if (errno == EACCES || errno == EAGAIN) /* see POSIX/IEEE 1003.1-1988, 6.5.2.4 */ return(EAGAIN); if (errno != EINVAL) /* Fall back to flock if we get EINVAL */ return(errno); retval = errno; } else return 0; /* We succeeded. Yay. */ #endif === snip === I will update as soon as I have some new info.
forgot to mention that we were able to get corefiles from both running kadmind and krb5kdc server when the kadmind fails to lock database files: === snip === Core was generated by `/usr/kerberos/sbin/krb5kdc'. #0 0x00002b603889ee83 in __select_nocancel () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 1 (Thread 9837): #0 0x00002b603889ee83 in __select_nocancel () from /lib64/libc.so.6 #1 0x00002b603774d0d3 in krb5int_cm_call_select (in=<value optimized out>, out=0x2b603681cde0, sret=0x7fff31d8e084) at sendto_kdc.c:544 #2 0x00002b603661432a in listen_and_process (prog=0x7fff31d8ff6b "krb5kdc") at network.c:1082 #3 0x00002b6036613fd1 in main (argc=<value optimized out>, argv=0x7fff31d8e1c8) at main.c:738 (gdb) === snip === === snip === Core was generated by `/usr/kerberos/sbin/kadmind'. #0 0x00002b347817de83 in __select_nocancel () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 1 (Thread 9893): #0 0x00002b347817de83 in __select_nocancel () from /lib64/libc.so.6 #1 0x00002b3476102580 in kadm_svc_run (params=0x7fff956d4c20) at ovsec_kadmd.c:752 #2 0x00002b3476103609 in main (argc=<value optimized out>, argv=<value optimized out>) at ovsec_kadmd.c:646 (gdb) === snip === The above backtraces seem to be normal, no one of the processes are stuck somewhere - just waiting for another requests/connections, so it led me to think that some of daemons (probably krb5kdc) forgot to release the lock somewhere and in conjunction with fact that krb5kdc was still able to reply for AS/TGT requests.
Hello, posting current status here to synchronize with IssueTracker and to inform you.. We were able to find out which process holds a lock on the database using a simple utility: === snip get_lock_info.c === #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char **argv) { if (argc != 2) { printf("Usage: %s lockfile\n", argv[0]); return 1; } int fd; char *lockfile = argv[1]; if ((fd = open(lockfile, O_RDWR)) == -1) { perror("open()"); return 1; } struct flock lock = { 0 }; lock.l_type = F_WRLCK; if (fcntl(fd, F_GETLK, &lock) == -1) { perror("fcntl()"); return 1; } if (lock.l_type & F_UNLCK) { printf("file: %s is not locked.\n", lockfile); } else { printf("file: %s is locked by PID: %d\n", lockfile, lock.l_pid); } return 0; } === snip === the output at time of problem: $ ./get_lock_info /var/kerberos/krb5kdc/principal.ok file: /var/kerberos/krb5kdc/principal.ok is locked by PID: 1744 $ $ ./get_lock_info /var/kerberos/krb5kdc/principal.kadm5.lock file: /var/kerberos/krb5kdc/principal.kadm5.lock is not locked. $ $ ps aux |grep 1744 root 1744 0.0 0.2 84336 23540 ? Ss Apr29 0:59 /usr/kerberos/sbin/krb5kdc root 7517 0.0 0.0 61188 724 pts/0 S+ 09:05 0:00 grep 1744 $ the above is strange because of two reasons: The first one is why only a file principal.ok is locked and principal.kadm5.lock isn't when these files should get locked and unlocked at once (in functions krb5_db2_db_lock() and krb5_db2_db_unlock() respectively). I wasn't able to find a function directly called by krb5kdc, which would lock only the file principal.ok (e.g. by directly calling a function krb5_lock_file()). This leads me to think that problem itself could be caused by the fucntion krb5_db2_db_unlock() as it is consisted of unlocking the file principal.kadm5.lock first (osa_adb_release_lock()) and if something bad happens during unlocking, the function returns without unlocking principal.ok and there is no checking for return value of krb5_db2_db_unlock() across the whole source code: === snip krb5-1.6.1/src/plugins/kdb/db2/kdb_db2.c === krb5_error_code krb5_db2_db_unlock(krb5_context context) { krb5_db2_context *db_ctx; kdb5_dal_handle *dal_handle; DB *db; krb5_error_code retval; if (!k5db2_inited(context)) return KRB5_KDB_DBNOTINITED; dal_handle = (kdb5_dal_handle *) context->db_context; db_ctx = (krb5_db2_context *) dal_handle->db_context; if ((retval = osa_adb_release_lock(db_ctx->policy_db))) { return retval; <<<--- this could lead to not unlocking principal.ok } if (!db_ctx->db_locks_held) /* lock already unlocked */ return KRB5_KDB_NOTLOCKED; db = db_ctx->db; if (--(db_ctx->db_locks_held) == 0) { (*db->close) (db); db_ctx->db = NULL; retval = krb5_lock_file(context, db_ctx->db_lf_file, KRB5_LOCKMODE_UNLOCK); db_ctx->db_lock_mode = 0; return (retval); } return 0; } === snip === The above could be true if there wasn't another strange thing.. The second strange thing is that the files shouldn't get locked exclusively by krb5kdc at all. As krb5kdc shipped with RHEL5 isn't compiled with KRBCONF_KDC_MODIFIES_KDB, which normally instructs kdc to update db with e.g. a number of failed attempts to get ticket for a specific user or time of last successful ticket creation. Again the problem disappears by restarting krb5kdc daemon. I will update when I have a new info.. Best regards, -Martin
(In reply to comment #2) > === snip === > > the output at time of problem: > > $ ./get_lock_info /var/kerberos/krb5kdc/principal.ok > file: /var/kerberos/krb5kdc/principal.ok is locked by PID: 1744 > $ > > $ ./get_lock_info /var/kerberos/krb5kdc/principal.kadm5.lock > file: /var/kerberos/krb5kdc/principal.kadm5.lock is not locked. > $ > > $ ps aux |grep 1744 > root 1744 0.0 0.2 84336 23540 ? Ss Apr29 0:59 > /usr/kerberos/sbin/krb5kdc > root 7517 0.0 0.0 61188 724 pts/0 S+ 09:05 0:00 grep 1744 > $ > > the above is strange because of two reasons: > > The first one is why only a file principal.ok is locked and > principal.kadm5.lock isn't when these files should get locked and unlocked at > once (in functions krb5_db2_db_lock() and krb5_db2_db_unlock() respectively). > I wasn't able to find a function directly called by krb5kdc, which would lock > only the file principal.ok (e.g. by directly calling a function > krb5_lock_file()). This leads me to think that problem itself could be caused > by the fucntion krb5_db2_db_unlock() as it is consisted of unlocking the file > principal.kadm5.lock first (osa_adb_release_lock()) and if something bad > happens during unlocking, the function returns without unlocking principal.ok > and there is no checking for return value of krb5_db2_db_unlock() across the > whole source code: > > === snip krb5-1.6.1/src/plugins/kdb/db2/kdb_db2.c === > krb5_error_code > krb5_db2_db_unlock(krb5_context context) > { > krb5_db2_context *db_ctx; > kdb5_dal_handle *dal_handle; > DB *db; > krb5_error_code retval; > > if (!k5db2_inited(context)) > return KRB5_KDB_DBNOTINITED; > > dal_handle = (kdb5_dal_handle *) context->db_context; > db_ctx = (krb5_db2_context *) dal_handle->db_context; > > if ((retval = osa_adb_release_lock(db_ctx->policy_db))) { > return retval; <<<--- this could lead to not unlocking > principal.ok > } > > if (!db_ctx->db_locks_held) /* lock already unlocked */ > return KRB5_KDB_NOTLOCKED; > db = db_ctx->db; > if (--(db_ctx->db_locks_held) == 0) { > (*db->close) (db); > db_ctx->db = NULL; > > retval = krb5_lock_file(context, db_ctx->db_lf_file, > KRB5_LOCKMODE_UNLOCK); > db_ctx->db_lock_mode = 0; > return (retval); > } > return 0; > } > === snip === This looks interesting -- if the locking mode is somehow set to KRB5_LOCKMODE_PERMANENT, this would run afoul of the SELinux policy, which normally prevents the KDC from writing to the directory in osa_adb_release_lock(). But like you, I don't see that happening by inspecting the code.
Created attachment 490820 [details] patch which should fix the problem
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1031.html
A fix is about to be pushed to master in the krb5/krb5 repo in github. The fix being pushed fixes a bug with these exact symptoms, but it may not be the same bug. In particular, the bug I'm referring to happens only when krb5kdc races with a kdb5_util _load_, and it affects kadmin.local, kadmind, and any libkadm5srv application. But loads typically only happen on slaves, and kadmind typically only runs on masters, yet this bugzilla report refers to kadmind... Is it possible that the symptoms in this report occur on KDCs that switch roles from slave to master? Or is kadmind just a red herring, that it's really kadmin.local that is affected but the OP felt that they needed to show that kadmind is affected? In any case, the upcoming commit comment: Fix lock inconsistency in ctx_unlock() The lock inconsistency fixed here is quite possibly the same as described in https://bugzilla.redhat.com/show_bug.cgi?id=586032 . The problem is that ctx_unlock() fails to unlock the principal DB if it fails to unlock the policy DB, and this happens when ctx_lock() fails to lock the policy DB (likely because the caller is racing against a kdb5_util load, which will be using a "permanent" lock, meaning that the lock file will be unlinked after acquiring the lock). The fix is to perform both unlock operations *then* handle any errors that either or both might have returned. The fix: diff --git a/src/plugins/kdb/db2/kdb_db2.c b/src/plugins/kdb/db2/kdb_db2.c index 87b6c18..62fbef7 100644 --- a/src/plugins/kdb/db2/kdb_db2.c +++ b/src/plugins/kdb/db2/kdb_db2.c @@ -383,12 +383,10 @@ done: static krb5_error_code ctx_unlock(krb5_context context, krb5_db2_context *dbc) { - krb5_error_code retval; + krb5_error_code retval, retval2; DB *db; retval = osa_adb_release_lock(dbc->policy_db); - if (retval) - return retval; if (!dbc->db_locks_held) /* lock already unlocked */ return KRB5_KDB_NOTLOCKED; @@ -399,9 +397,15 @@ ctx_unlock(krb5_context context, krb5_db2_context *dbc) dbc->db = NULL; dbc->db_lock_mode = 0; - retval = krb5_lock_file(context, dbc->db_lf_file, + retval2 = krb5_lock_file(context, dbc->db_lf_file, KRB5_LOCKMODE_UNLOCK); + if (retval2) + return retval2; } + + /* We may be unlocking because osa_adb_get_lock() failed. */ + if (retval == OSA_ADB_NOTLOCKED) + return 0; return retval; }
The commit is 29ee39baa919361ae08e26caab896890d5cb3eb4. There's a test that fails on master prior to 29ee39 (though it requires 9e182bc) but passes after.