Bug 1126841
Summary: | named is crashed due to race condition at the time of access rbt | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Mohit Agrawal <moagrawa> | ||||||
Component: | bind-dyndb-ldap | Assignee: | Petr Spacek <pspacek> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Namita Soman <nsoman> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 6.5 | CC: | bhubbard, moagrawa, pspacek, sgadekar, thozza | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 6.6 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | bind-dyndb-ldap-2.3-7.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Prior to this update, the bind-dyndb-ldap plug-in incorrectly locked certain data structures. Consequently, a race condition during forwarder address reconfiguration could cause BIND to terminate unexpectedly. This bug has been fixed, bind-dyndb-ldap now locks data structures properly, and BIND no longer crashes in this scenario.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1158763 (view as bug list) | Environment: | |||||||
Last Closed: | 2015-07-22 05:38:56 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1158763, 1175321 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Mohit Agrawal
2014-08-05 12:24:13 UTC
bt pattern is given below #0 dns_rbt_findnode (rbt=0x7ff11bb97fd0, name=0x7ff11863f740, foundname=0x7ff118660d30, node=0x7ff11f09cfc8, chain=0x7ff11f09cfe8, options=1, callback=0x7ff1230ef090 <cache_zonecut_callback>, callback_arg=0x7ff11f09cfd0) at rbt.c:806 806 if (hash != HASHVAL(hnode)) (gdb) bt #0 dns_rbt_findnode (rbt=0x7ff11bb97fd0, name=0x7ff11863f740, foundname=0x7ff118660d30, node=0x7ff11f09cfc8, chain=0x7ff11f09cfe8, options=1, callback=0x7ff1230ef090 <cache_zonecut_callback>, callback_arg=0x7ff11f09cfd0) at rbt.c:806 #1 0x00007ff1230f97a2 in cache_find (db=<value optimized out>, name=0x7ff11863f740, version=<value optimized out>, type=12, options=0, now=1405818422, nodep=0x7ff11f09e318, foundname=0x7ff118660d30, rdataset=0x7ff11866b780, sigrdataset=0x0) at rbtdb.c:4808 #2 0x00007ff12386d867 in query_find (client=<value optimized out>, event=0x0, qtype=12) at query.c:5476 #3 0x00007ff1238745dc in ns_query_start (client=0x7ff104497310) at query.c:7345 #4 0x00007ff123859b16 in client_request (task=<value optimized out>, event=<value optimized out>) at client.c:1961 #5 0x00007ff1221ec2f8 in dispatch (uap=0x7ff1237e7010) at task.c:1012 #6 run (uap=0x7ff1237e7010) at task.c:1157 #7 0x00007ff121ba19d1 in start_thread (arg=0x7ff11f0a0700) at pthread_create.c:301 #8 0x00007ff121102b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:122 #9 0x0000000000000000 in ?? () (gdb) x/i $pc => 0x7ff1230eadb5 <dns_rbt_findnode+1237>: cmp 0x3c(%rbp),%ebx (gdb) i r rbp rbp 0xdededededededede 0xdededededededede (gdb) p hnode $1 = (dns_rbtnode_t *) 0xdededededededede (gdb) l 805 800 for (hnode = rbt->hashtable[hash % rbt->hashsize]; 801 hnode != NULL; 802 hnode = hnode->hashnext) 803 { 804 dns_name_t hnode_name; 805 806 if (hash != HASHVAL(hnode)) 807 continue; 808 if (find_up(hnode) != up_current) 809 continue; (gdb) p rbt $5 = (dns_rbt_t *) 0x7ff11bb97fd0 gdb) p *rbt->hashtable[hash % rbt->hashsize] $3 = {parent = 0x7ff10822dd38, left = 0xdededededededede, right = 0xdededededededede, down = 0xdededededededede, hashnext = 0xdededededededede, deadlink = {prev = 0xdededededededede, next = 0xdededededededede}, is_root = 0, color = 1,find_callback = 1, attributes = 3, nsec = 3, namelen = 222, offsetlen = 222, oldnamelen = 222, hashval = 3739147998, data = 0xdededededededede, dirty = 0, wild = 1, locknum = 951, references = {refs = -555819298}} Why? In thread 8 frame 3 is deleted same rbt pointer. (gdb) t 8 [Switching to thread 8 (Thread 0x7ff11e69f700 (LWP 28297))]#0 0x00007ff121ba33d1 in __pthread_mutex_lock (mutex=0x7ff1140d5480) at pthread_mutex_lock.c:61 61 LLL_MUTEX_LOCK (mutex); (gdb) bt #0 0x00007ff121ba33d1 in __pthread_mutex_lock (mutex=0x7ff1140d5480) at pthread_mutex_lock.c:61 #1 0x00007ff1221dd489 in isc___mem_put (ctx0=0x7ff1140d5450, ptr=0x7ff10822c238, size=84, file=0x7ff1231cb3e8 "rbt.c", line=2104) at mem.c:1335 #2 0x00007ff1230ea17b in dns_rbt_deletetreeflat (rbtp=0x7ff11bb93288, quantum=<value optimized out>) at rbt.c:2104 #3 dns_rbt_destroy2 (rbtp=0x7ff11bb93288, quantum=<value optimized out>) at rbt.c:290 #4 0x00007ff1230fc36d in free_rbtdb (rbtdb=0x7ff11bb93010, log=isc_boolean_true, event=0x0) at rbtdb.c:900 #5 0x00007ff1230fcf52 in maybe_free_rbtdb (rbtdb=0x7ff11bb93010) at rbtdb.c:1036 #6 0x00007ff1230fd210 in detach (dbp=0x7ff0fc85b668) at rbtdb.c:1051 #7 0x00007ff1230ab1c0 in dns_db_detach (dbp=0x7ff0fc85b668) at db.c:182 #8 0x00007ff12317d2f9 in dns_view_flushcache2 (view=0x7ff0fc85b610, fixuponly=<value optimized out>) at view.c:1513 #9 0x00007ff11b8fd57d in ?? () #10 0x00007ff11e69ebd0 in ?? () #11 0x00007ff100000000 in ?? () #12 0x00007ff123404a60 in ?? () from /usr/lib64/libdns.so.81.4.1 #13 0x00000001fc055678 in ?? () #14 0x00007ff11b907898 in ?? () #15 0x0000000100000017 in ?? () #16 0x0000000000000000 in ?? () (gdb) f 3 #3 dns_rbt_destroy2 (rbtp=0x7ff11bb93288, quantum=<value optimized out>) at rbt.c:290 290 dns_rbt_deletetreeflat(rbt, quantum, &rbt->root); (gdb) p rbt $6 = (dns_rbt_t *) 0x7ff11bb97fd0 The race appears to be between these two functions. (gdb) down #1 0x00007ff1230f97a2 in cache_find (db=<value optimized out>, name=0x7ff11863f740, version=<value optimized out>, type=12, options=0, now=1405818422, nodep=0x7ff11f09e318, foundname=0x7ff118660d30, rdataset=0x7ff11866b780, sigrdataset=0x0) at rbtdb.c:4808 4808 result = dns_rbt_findnode(search.rbtdb->tree, name, foundname, &node, (gdb) p search.rbtdb $14 = (dns_rbtdb_t *) 0x7ff11bb93010 (gdb) l cache_find 4759 static isc_result_t 4760 cache_find(dns_db_t *db, dns_name_t *name, dns_dbversion_t *version, 4761 dns_rdatatype_t type, unsigned int options, isc_stdtime_t now, 4762 dns_dbnode_t **nodep, dns_name_t *foundname, 4763 dns_rdataset_t *rdataset, dns_rdataset_t *sigrdataset) 4764 { 4765 dns_rbtnode_t *node = NULL; 4766 isc_result_t result; 4767 rbtdb_search_t search; 4768 isc_boolean_t cname_ok = ISC_TRUE; (gdb) 4769 isc_boolean_t empty_node; 4770 nodelock_t *lock; 4771 isc_rwlocktype_t locktype; 4772 rdatasetheader_t *header, *header_prev, *header_next; 4773 rdatasetheader_t *found, *nsheader; 4774 rdatasetheader_t *foundsig, *nssig, *cnamesig; 4775 rdatasetheader_t *update, *updatesig; 4776 rbtdb_rdatatype_t sigtype, negtype; 4777 4778 UNUSED(version); (gdb) 4779 4780 search.rbtdb = (dns_rbtdb_t *)db; 4781 4782 REQUIRE(VALID_RBTDB(search.rbtdb)); 4783 REQUIRE(version == NULL); 4784 4785 if (now == 0) 4786 isc_stdtime_get(&now); 4787 4788 search.rbtversion = NULL; (gdb) 4789 search.serial = 1; 4790 search.options = options; 4791 search.copy_name = ISC_FALSE; 4792 search.need_cleanup = ISC_FALSE; 4793 search.wild = ISC_FALSE; 4794 search.zonecut = NULL; 4795 dns_fixedname_init(&search.zonecut_name); 4796 dns_rbtnodechain_init(&search.chain, search.rbtdb->common.mctx); 4797 search.now = now; 4798 update = NULL; (gdb) 4799 updatesig = NULL; 4800 4801 RWLOCK(&search.rbtdb->tree_lock, isc_rwlocktype_read); 4802 4803 /* 4804 * Search down from the root of the tree. If, while going down, we 4805 * encounter a callback node, cache_zonecut_callback() will search the 4806 * rdatasets at the zone cut for a DNAME rdataset. 4807 */ 4808 result = dns_rbt_findnode(search.rbtdb->tree, name, foundname, &node, We appear to take a read lock on the rbtdb tree above. (gdb) t 8 [Switching to thread 8 (Thread 0x7ff11e69f700 (LWP 28297))]#0 0x00007ff121ba33d1 in __pthread_mutex_lock (mutex=0x7ff1140d5480) at pthread_mutex_lock.c:61 61 LLL_MUTEX_LOCK (mutex); (gdb) f 5 #5 0x00007ff1230fcf52 in maybe_free_rbtdb (rbtdb=0x7ff11bb93010) at rbtdb.c:1036 1036 free_rbtdb(rbtdb, ISC_TRUE, NULL); (gdb) p rbtdb $15 = (dns_rbtdb_t *) 0x7ff11bb93010 (gdb) l maybe_free_rbtdb 989 isc_mem_putanddetach(&rbtdb->common.mctx, rbtdb, sizeof(*rbtdb)); 990 isc_ondestroy_notify(&ondest, rbtdb); 991 } 992 993 static inline void 994 maybe_free_rbtdb(dns_rbtdb_t *rbtdb) { 995 isc_boolean_t want_free = ISC_FALSE; 996 unsigned int i; 997 unsigned int inactive = 0; 998 (gdb) 999 /* XXX check for open versions here */ 1000 1001 if (rbtdb->soanode != NULL) 1002 dns_db_detachnode((dns_db_t *)rbtdb, &rbtdb->soanode); 1003 if (rbtdb->nsnode != NULL) 1004 dns_db_detachnode((dns_db_t *)rbtdb, &rbtdb->nsnode); 1005 1006 /* 1007 * Even though there are no external direct references, there still 1008 * may be nodes in use. (gdb) 1009 */ 1010 for (i = 0; i < rbtdb->node_lock_count; i++) { 1011 NODE_LOCK(&rbtdb->node_locks[i].lock, isc_rwlocktype_write); 1012 rbtdb->node_locks[i].exiting = ISC_TRUE; 1013 NODE_UNLOCK(&rbtdb->node_locks[i].lock, isc_rwlocktype_write); 1014 if (isc_refcount_current(&rbtdb->node_locks[i].references) 1015 == 0) { 1016 inactive++; 1017 } 1018 } (gdb) 1019 1020 if (inactive != 0) { 1021 RBTDB_LOCK(&rbtdb->lock, isc_rwlocktype_write); 1022 rbtdb->active -= inactive; 1023 if (rbtdb->active == 0) 1024 want_free = ISC_TRUE; 1025 RBTDB_UNLOCK(&rbtdb->lock, isc_rwlocktype_write); 1026 if (want_free) { 1027 char buf[DNS_NAME_FORMATSIZE]; 1028 if (dns_name_dynamic(&rbtdb->common.origin)) (gdb) 1029 dns_name_format(&rbtdb->common.origin, buf, 1030 sizeof(buf)); 1031 else 1032 strcpy(buf, "<UNKNOWN>"); 1033 isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, 1034 DNS_LOGMODULE_CACHE, ISC_LOG_DEBUG(1), 1035 "calling free_rbtdb(%s)", buf); 1036 free_rbtdb(rbtdb, ISC_TRUE, NULL); 1037 } 1038 } In maybe_free_rbtdb we take a write lock but appear to release it before doing the actual free so it seems we may be freeing elements of this structure in maybe_free_rbtdb while we do not have a lock on it but cache_find does. Hi Mohit. Thank you for thorough investigation. From the coredump it is really obvious there is a race condition. I'm still working on the fix and will discuss it with upstream, since from what I can tell, the same issue is present also in the latest BIND. I'll update the bug once I have some new information. I was able to obtain more verbose backtrace from gdb. The cache flush is triggered from the bind-dyndb-ldap plugin while reconfiguring the global forwarders. #0 0x00007ff121ba33d1 in ?? () #1 0x00007ff1221dd489 in isc___mem_put (ctx0=0x7ff1140d5450, ptr=0x7ff10822c238, size=84, file=0x7ff1231cb3e8 "rbt.c", line=2104) at mem.c:1335 #2 0x00007ff1230ea17b in dns_rbt_deletetreeflat (rbtp=0x7ff11bb93288, quantum=<value optimized out>) at rbt.c:2104 #3 dns_rbt_destroy2 (rbtp=0x7ff11bb93288, quantum=<value optimized out>) at rbt.c:290 #4 0x00007ff1230fc36d in free_rbtdb (rbtdb=0x7ff11bb93010, log=isc_boolean_true, event=0x0) at rbtdb.c:900 #5 0x00007ff1230fcf52 in maybe_free_rbtdb (rbtdb=0x7ff11bb93010) at rbtdb.c:1036 #6 0x00007ff1230fd210 in detach (dbp=0x7ff0fc85b668) at rbtdb.c:1051 #7 0x00007ff1230ab1c0 in dns_db_detach (dbp=0x7ff0fc85b668) at db.c:182 #8 0x00007ff12317d2f9 in dns_view_flushcache2 (view=0x7ff0fc85b610, fixuponly=<value optimized out>) at view.c:1513 #9 0x00007ff11b8fd57d in configure_zone_forwarders (entry=<value optimized out>, inst=0x7ff11bbde010, name=0x7ff123404a60) at ldap_helper.c:1155 #10 0x00007ff11b8fda16 in ldap_parse_configentry (entry=0x7ff10adfce08, inst=0x7ff11bbde010) at ldap_helper.c:1179 #11 0x00007ff11b90176b in update_config (task=<value optimized out>, event=0x7ff114a0b340) at ldap_helper.c:3396 #12 0x00007ff1221ec2f8 in dispatch (uap=0x7ff1237e7010) at task.c:1012 #13 run (uap=0x7ff1237e7010) at task.c:1157 #14 0x00007ff121ba19d1 in ?? () #15 0x00007ff11e69f700 in ?? () #16 0x0000000000000000 in ?? () In the named source the dns_view_flushcache() is always called in a section secured by the isc_task_beginexclusive() and isc_task_endexclusive(). So it seems in a rare situation it might be possible that while the RBT DB for cache is being freed, other thread tries to search it. It seems that the change needs to be made in the ldap plugin. However we are reluctant to change anything in the plugin without reproducer for testing the fix. Given I assume to know what happened I'll try to reproduce the crash. Thanks to Petr Spacek for help! I was able to reproduce the race condition. The issue is for sure caused in the bind-dyndb-ldap plugin, specifically in the code responsible for updating global forwarders. With the reproducer named crashes on different places and it takes different time to see the crash, however it is always under 30 seconds. For example: 24-Oct-2014 13:45:37.875 rbtdb.c:684: REQUIRE(prev > 0) failed, back trace 24-Oct-2014 13:45:37.875 #0 0x7ffff7fa7eff in ?? 24-Oct-2014 13:45:37.875 #1 0x7ffff691389a in ?? 24-Oct-2014 13:45:37.875 #2 0x7ffff7831fcf in ?? 24-Oct-2014 13:45:37.875 #3 0x7ffff77f0252 in ?? 24-Oct-2014 13:45:37.875 #4 0x7ffff7faa26a in ?? 24-Oct-2014 13:45:37.875 #5 0x7ffff7faafed in ?? 24-Oct-2014 13:45:37.875 #6 0x7ffff7fb26dd in ?? 24-Oct-2014 13:45:37.875 #7 0x7ffff7fb95dc in ?? 24-Oct-2014 13:45:37.875 #8 0x7ffff7f9eb16 in ?? 24-Oct-2014 13:45:37.875 #9 0x7ffff69322f8 in ?? 24-Oct-2014 13:45:37.875 #10 0x7ffff62e79d1 in ?? 24-Oct-2014 13:45:37.875 #11 0x7ffff5848b6d in ?? 24-Oct-2014 13:45:37.875 exiting (due to assertion failure) rbtdb.c:980: REQUIRE((&rbtdb->references)->refs == 0) failed, back trace #0 0x7ffff69136c6 in ?? #1 0x7ffff691389a in ?? #2 0x7ffff7841a05 in ?? #3 0x7ffff7841f52 in ?? #4 0x7ffff7842210 in ?? #5 0x7ffff77f01c0 in ?? #6 0x7ffff78c22f9 in ?? #7 0x7fffe540357d in ?? #8 0x7fffe5403a16 in ?? #9 0x7fffe540776b in ?? #10 0x7ffff69322f8 in ?? #11 0x7ffff62e79d1 in ?? #12 0x7ffff5848b6d in ?? Program received signal SIGABRT, Aborted. [Switching to Thread 0x7fffea1d7700 (LWP 32717)] 0x00007ffff5792925 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff5792925 in raise () from /lib64/libc.so.6 #1 0x00007ffff5794105 in abort () from /lib64/libc.so.6 #2 0x00007ffff691389f in isc_assertion_failed (file=<value optimized out>, line=<value optimized out>, type=<value optimized out>, cond=<value optimized out>) at assertions.c:58 #3 0x00007ffff7841a05 in free_rbtdb (rbtdb=0x7fffe5699010, log=isc_boolean_true, event=0x0) at rbtdb.c:980 #4 0x00007ffff7841f52 in maybe_free_rbtdb (rbtdb=0x7fffe5699010) at rbtdb.c:1036 #5 0x00007ffff7842210 in detach (dbp=0x7fffe0410b68) at rbtdb.c:1051 #6 0x00007ffff77f01c0 in dns_db_detach (dbp=0x7fffe0410b68) at db.c:182 #7 0x00007ffff78c22f9 in dns_view_flushcache2 (view=0x7fffe0410b10, fixuponly=<value optimized out>) at view.c:1513 #8 0x00007fffe540357d in configure_zone_forwarders (entry=<value optimized out>, inst=0x7fffe5932010, name=0x7ffff7b49a60) at ldap_helper.c:1155 #9 0x00007fffe5403a16 in ldap_parse_configentry (entry=0x7fffe59485f8, inst=0x7fffe5932010) at ldap_helper.c:1179 #10 0x00007fffe540776b in update_config (task=<value optimized out>, event=0x7fffe5949290) at ldap_helper.c:3396 #11 0x00007ffff69322f8 in dispatch (uap=0x7ffff7f2d010) at task.c:1012 #12 run (uap=0x7ffff7f2d010) at task.c:1157 #13 0x00007ffff62e79d1 in start_thread () from /lib64/libpthread.so.0 #14 0x00007ffff5848b6d in clone () from /lib64/libc.so.6 (gdb) t 3 [Switching to thread 3 (Thread 0x7ffff37e6700 (LWP 32702))]#0 0x00007ffff5792925 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff5792925 in raise () from /lib64/libc.so.6 #1 0x00007ffff5794105 in abort () from /lib64/libc.so.6 #2 0x00007ffff7fa80e4 in assertion_failed (file=<value optimized out>, line=<value optimized out>, type=<value optimized out>, cond=<value optimized out>) at ./main.c:219 #3 0x00007ffff691389a in isc_assertion_failed (file=<value optimized out>, line=<value optimized out>, type=<value optimized out>, cond=<value optimized out>) at assertions.c:57 #4 0x00007ffff7831fcf in attach (source=<value optimized out>, targetp=<value optimized out>) at rbtdb.c:684 #5 0x00007ffff77f0252 in dns_db_attach (source=0x7fffe5699010, targetp=0x7ffff37e3998) at db.c:167 #6 0x00007ffff7faa26a in query_getcachedb (client=0x7fffe02c41c0, name=0x7fffe5cf5060, qtype=1, dbp=0x7ffff37e4328, options=<value optimized out>) at query.c:923 #7 0x00007ffff7faafed in query_getdb (client=0x7fffe02c41c0, name=0x7fffe5cf5060, qtype=1, options=0, zonep=0x7ffff37e42c0, dbp=0x7ffff37e4328, versionp=0x7ffff37e42c8, is_zonep=0x7ffff37e433c) at query.c:1084 #8 0x00007ffff7fb26dd in query_find (client=<value optimized out>, event=0x0, qtype=1) at query.c:5378 #9 0x00007ffff7fb95dc in ns_query_start (client=0x7fffe02c41c0) at query.c:7345 #10 0x00007ffff7f9eb16 in client_request (task=<value optimized out>, event=<value optimized out>) at client.c:1961 #11 0x00007ffff69322f8 in dispatch (uap=0x7ffff7f2d010) at task.c:1012 #12 run (uap=0x7ffff7f2d010) at task.c:1157 #13 0x00007ffff62e79d1 in start_thread () from /lib64/libpthread.so.0 #14 0x00007ffff5848b6d in clone () from /lib64/libc.so.6 Created attachment 950371 [details]
Script updating globals forwarder in LDAP
Note: the used domain may need to be changed in the script
Created attachment 950372 [details]
Script querying the server
Reproducer: 1. install ipa-server, bind-dyndb-ldap, bind 2. run $ ipa-server-install --setup-dns 3. stop named 4. start named - $ named -u named -fg -n 20 5. run in different terminals both attached scripts 6. wait for named to crash :) Upstream ticket: https://fedorahosted.org/bind-dyndb-ldap/ticket/142 Verified using bind-dyndb-ldap-2.3-7.el6.x86_64 Followed steps from #comment 14 1. install ipa-server, bind-dyndb-ldap, bind 2. run $ ipa-server-install --setup-dns ipa-server-install --setup-dns --forwarder=1.2.3.4 --hostname=dell-pesc1425-02.testrelm.test -r TESTRELM.TEST -n testrelm.test -p xxx -P xxx -a xxx -U 3. stop named # service named stop 4. start named - $ named -u named -fg -n 20 #export KRB5_KTNAME=/etc/named.keytab # named -u named -fg -n 20 & 5. run in different terminals both attached scripts updated update_server.py, ran it in one terminal, and ran query.py in two terminals for ~3 min 6. wait for named to crash :) No crash (from #comment11 - should have crashed in 30 sec) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1259.html *** Bug 1272445 has been marked as a duplicate of this bug. *** |