Bug 1070583

Summary:

rhds91 389-ds-base-1.2.11.15-31.el6_5.x86_64 crash in db4 __dbc_get_pp env = 0x0 ?

Product:

Red Hat Enterprise Linux 6

Reporter:

Marc Sauton <msauton>

Component:

389-ds-base

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED ERRATA

QA Contact:

Sankar Ramalingam <sramling>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.5

CC:

acavalla, amsharma, jgalipea, mreynolds, msauton, nhosoi, nkinder

Target Milestone:

Target Release:

6.6

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

389-ds-base-1.2.11.15-34.el6

Doc Type:

Bug Fix

Doc Text:

Cause: Shutting down the server while a Total Replication Update/Initialization was occurring. Consequence: The server could crash. Fix: Do not allow the replication plugin to shutdown while the total is still running. Result: The server does not crash.

Story Points:

---

Clone Of:

Clones:

1108902 (view as bug list)

Environment:

Last Closed:

2014-10-14 07:53:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1108902

Attachments:

Description	Flags
stack trace notes	none

Description Marc Sauton 2014-02-27 07:30:47 UTC

Description of problem:

rhds91 crash with 389-ds-base-1.2.11.15-31.el6_5.x86_64 

scenario may be:
4x MMR with SSL

9 backends
use ldclt to add 10.000 entries to all 9 backends (9x ldclt to one master)
use ldclt to delete all (9x ldclt to same master)
repeat about 10 to 15 times

data is similar to this
cn=rhds00002,ou=people,dc=dc1,dc=example,dc=com

export rmax=10000
export filter=xxxxx
export testname=test1
ldclt -b ou=people,dc=dc1,dc=example,dc=com -D "cn=directory manager" -w password -h $master -e add,incr,noloop,person,commoncounter,referral=off -r 1 -R $rmax -f "cn=rhds$filter" -v > $testname.txt &



Version-Release number of selected component (if applicable):

389-ds-base-1.2.11.15-31.el6_5.x86_64
redhat-ds-base-9.1.0-1.el6dsrv.x86_64


How reproducible:

unknown

Steps to Reproduce:
1. not exactly sure yet
2.
3.

Actual results:

Core was generated by `/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-pool-ldapbsa-provexpresso-01 -i /var/ru'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007ff774fb58eb in __dbc_get_pp (dbc=0x1c2e0d0, key=0x7ff6a0df33b0, data=0x7ff6a0df3380, flags=26) at ../../db/db_iface.c:2091
2091            ENV_ENTER(env, ip);

(gdb) where
#0  0x00007ff774fb58eb in __dbc_get_pp (dbc=0x1c2e0d0, key=0x7ff6a0df33b0, data=0x7ff6a0df3380, flags=26) at ../../db/db_iface.c:2091
#1  0x00007ff7731c5c21 in entryrdn_lookup_dn (be=0x1c0dd50, rdn=<value optimized out>, id=<value optimized out>, dn=0x7ff6a0df34e8, txn=<value optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_entryrdn.c:1241
#2  0x00007ff7731a231d in id2entry (be=0x1c0dd50, id=115270, txn=0x7ff6a0df35c0, err=<value optimized out>) at ldap/servers/slapd/back-ldbm/id2entry.c:378
#3  0x00007ff7731d5602 in ldbm_back_next_search_entry_ext (pb=0x7ff6800b23c0, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1487
#4  0x0000003d90086251 in iterate (pb=0x7ff6800b23c0, be=<value optimized out>, pnentries=0x7ff6a0df77b4, pagesize=-1, pr_statp=0x7ff6a0df77a8, send_result=1) at ldap/servers/slapd/opshared.c:1281
#5  0x0000003d90086777 in send_results_ext (pb=0x7ff6800b23c0, nentries=0x7ff6a0df77b4, pagesize=-1, pr_stat=0x7ff6a0df77a8, send_result=1) at ldap/servers/slapd/opshared.c:1682
#6  0x0000003d900874cd in op_shared_search (pb=0x7ff6800b23c0, send_result=1) at ldap/servers/slapd/opshared.c:853
#7  0x0000003d90093a1c in search_internal_callback_pb (pb=0x7ff6800b23c0, callback_data=<value optimized out>, prc=0x7ff772f32460 <get_result>, psec=0x7ff772f32d60 <send_entry>, prec=0) at ldap/servers/slapd/plugin_internal_op.c:812
#8  0x00007ff772f32af4 in repl5_tot_run (prp=0x2955870) at ldap/servers/plugins/replication/repl5_tot_protocol.c:446
#9  0x00007ff772f20f9a in prot_thread_main (arg=0x2954c40) at ldap/servers/plugins/replication/repl5_protocol.c:355
#10 0x0000003bc8e299a6 in _pt_root (arg=0x2955a20) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#11 0x0000003c9a0079d1 in start_thread (arg=0x7ff6a0dfa700) at pthread_create.c:301
#12 0x00000038f22e8b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb)


   │2067     * __dbc_get_pp --
   │2068     *      DBC->get pre/post processing.
   │2069     *
   │2070     * PUBLIC: int __dbc_get_pp __P((DBC *, DBT *, DBT *, u_int32_t));
   │2071     */
   │2072    int                                                                           
   │2073    __dbc_get_pp(dbc, key, data, flags)                                           
   │2074            DBC *dbc;                                                             
   │2075            DBT *key, *data;                                                      
   │2076            u_int32_t flags;                                                      
   │2077    {                                                                             
   │2078            DB *dbp;                                                              
   │2079            DB_THREAD_INFO *ip;                                                   
   │2080            ENV *env;                                                             
   │2081            int ignore_lease, ret;                                                
   │2082                                                                                  
   │2083            dbp = dbc->dbp;                                                       
   │2084            env = dbp->env;                                                       
   │2085                                                                                  
   │2086            ignore_lease = LF_ISSET(DB_IGNORE_LEASE) ? 1 : 0;                     
   │2087            LF_CLR(DB_IGNORE_LEASE);                                              
   │2088            if ((ret = __dbc_get_arg(dbc, key, data, flags)) != 0)                
   │2089                    return (ret);                                                 
   │2090                                                                                  
 +>│2091            ENV_ENTER(env, ip);
   │2092                                                                                  
   │2093            DEBUG_LREAD(dbc, dbc->txn, "DBcursor->get",                           
   │2094                flags == DB_SET || flags == DB_SET_RANGE ? key : NULL, NULL, flags);
   │2095            ret = __dbc_get(dbc, key, data, flags);
   │2096
   │2097            /*
   │2098             * Check for master leases.
   │2099             */                                                               
   │2100            if (ret == 0 &&                                                   
   │2101                IS_REP_MASTER(env) && IS_USING_LEASES(env) && !ignore_lease)
   │2102                    ret = __rep_lease_check(env, 1);                         
   │2103                                                                             
   │2104            ENV_LEAVE(env, ip);                                              
   │2105            __dbt_userfree(env, key, NULL, data);                            
   │2106            return (ret);                                                    
   │2107    }                                                                        


(gdb)
(gdb) p ip
$1 = <value optimized out>
(gdb) p env
$2 = (ENV *) 0x0
(gdb)
(gdb) p dbp
$3 = <value optimized out>

(gdb) p *dbc
$5 = {dbp = 0x1c2cc60, dbenv = 0x1db93a0, env = 0x1f15f40, thread_info = 0x0, txn = 0x0, priority = DB_PRIORITY_UNCHANGED, links = {tqe_next = 0x0, tqe_prev = 0x1c6a500}, rskey = 0x1c2e128, rkey = 0x1c2e150, rdata = 0x1c2e178,
  my_rskey = {data = 0x0, size = 0, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}, my_rkey = {data = 0x0, size = 0, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}, my_rdata = {data = 0x0, size = 0, ulen = 0,
    dlen = 0, doff = 0, app_data = 0x0, flags = 0}, lref = 0x7ff75d095bd8, locker = 0x7ff75d095bd8, lock_dbt = {data = 0x1c2e1d8, size = 28, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}, lock = {pgno = 35,
    fileid = "\336\367\003\000\001\375\000\000\206g\332\006\232\250\001\000\000\000\000", type = 3}, mylock = {off = 0, ndx = 0, gen = 0, mode = DB_LOCK_NG}, cl_id = 0, dbtype = DB_BTREE, internal = 0x1c2cc70,
  close = 0x7ff774fb5d50 <__dbc_close_pp>, count = 0x7ff774fb5c60 <__dbc_count_pp>, del = 0x7ff774fb5ac0 <__dbc_del_pp>, dup = 0x7ff774fb59e0 <__dbc_dup_pp>, get = 0x7ff774fb5870 <__dbc_get_pp>,
  get_priority = 0x7ff774fa02c0 <__dbc_get_priority>, pget = 0x7ff774fb5410 <__dbc_pget_pp>, put = 0x7ff774fb4d10 <__dbc_put_pp>, set_priority = 0x7ff774fa02b0 <__dbc_set_priority>, c_close = 0x7ff774fb5d50 <__dbc_close_pp>,
  c_count = 0x7ff774fb5c60 <__dbc_count_pp>, c_del = 0x7ff774fb5ac0 <__dbc_del_pp>, c_dup = 0x7ff774fb59e0 <__dbc_dup_pp>, c_get = 0x7ff774fb5870 <__dbc_get_pp>, c_pget = 0x7ff774fb5410 <__dbc_pget_pp>,
  c_put = 0x7ff774fb4d10 <__dbc_put_pp>, am_bulk = 0x7ff774f054c0 <__bam_bulk>, am_close = 0x7ff774f06db0 <__bamc_close>, am_del = 0x7ff774f04990 <__bamc_del>, am_destroy = 0x7ff774f04790 <__bamc_destroy>,
  am_get = 0x7ff774f03fb0 <__bamc_get>, am_put = 0x7ff774f06030 <__bamc_put>, am_writelock = 0x7ff774f03060 <__bamc_writelock>, flags = 64}
(gdb)

see the env = 0x0

(gdb)  p *dbc->dbp
$6 = {pgsize = 0, priority = DB_PRIORITY_UNCHANGED, db_append_recno = 0x171, db_feedback = 0x1c6cb70, dup_compare = 0x1c2e0c0, app_private = 0x1ded00000001, dbenv = 0x10, env = 0x0, type = 10848, mpf = 0x1, mutex = 0,
  fname = 0x1c2ccc8 "\240ýým\367\177", dname = 0x1c2ccc8 "\240┐m\367\177", open_flags = 29543864, fileid = "\000\000\000\000\240┐m\367\177\000\000\020\000\066\001\000\000\000", adj_fileid = 3173888, log_filename = 0x15c7b00002a60,
  meta_pgno = 1, locker = 0x1, cur_locker = 0x0, cur_txn = 0x0, associate_locker = 0x0, handle_lock = {off = 0, ndx = 0, gen = 0, mode = DB_LOCK_NG}, cl_id = 0, timestamp = 0, fid_gen = 0, my_rskey = {data = 0x0, size = 0, ulen = 0,
    dlen = 0, doff = 0, app_data = 0x0, flags = 0}, my_rkey = {data = 0x0, size = 0, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}, my_rdata = {data = 0x0, size = 0, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0},
  saved_open_fhp = 0x7ef, dblistlinks = {tqe_next = 0x0, tqe_prev = 0x1c6aab0}, free_queue = {tqh_first = 0x170, tqh_last = 0x50}, active_queue = {tqh_first = 0x100000008, tqh_last = 0x1dbc310}, join_queue = {tqh_first = 0x0,
    tqh_last = 0x1db4c00}, s_secondaries = {lh_first = 0x1c8f800}, s_links = {le_next = 0x7ff7731cd600, le_prev = 0x1c05ed0}, s_refcnt = 29889472, s_callback = 0x6769666e6f, s_primary = 0xb1, s_assoc_flags = 29544160, f_primaries = {
    lh_first = 0x2}, s_foreign = 0x0, api_internal = 0x0, bt_internal = 0x0, h_internal = 0x1c2cf00, q_internal = 0x0, xa_internal = 0x0, associate = 0, associate_foreign = 0, close = 0x1c2c0a0, compact = 0, cursor = 0x5,
  del = 0x1800bd0, err = 0, errx = 0, exists = 0, fd = 0, get = 0, get_bt_minkey = 0, get_byteswapped = 0, get_cachesize = 0x21, get_dbname = 0x657571696e55746e, get_encrypt_flags = 0x38f2006449, get_env = 0x6700353239,
  get_errcall = 0x21, get_errfile = 0x186a0, get_errpfx = 0, get_flags = 0, get_h_ffactor = 0x31, get_h_nelem = 0x1c2ce30, get_lorder = 0x38f258fe00 <_IO_stdfile_0_lock>, get_mpf = 0x1c93580, get_msgfile = 0x1c2cfb0,
  get_multiple = 0x6769666e6f, get_open_flags = 0x21, get_pagesize = 0x1c8b860, get_priority = 0x62642e7972746e65, get_q_extentsize = 0x34, get_re_delim = 0x41, get_re_len = 0x72726573753d6e63, get_re_pad = 0x6c3d6e632c746f6f,
  get_re_source = 0x61746164206d6264, get_transactional = 0x3d6e632c65736162, get_type = 0x2c736e6967756c70, join = 0x69666e6f633d6e63, key_range = 0x67, open = 0x31, pget = 0x1c2d000, put = 0x7373616c00, remove = 0, rename = 0,
  set_alloc = 0x1de94d0, set_append_recno = 0x21, set_bt_compare = 0x6f4472657355746e, set_bt_minkey = 0x64496e69616d, set_bt_prefix = 0x1c2d120, set_cachesize = 0xb1, set_dup_compare = 0x1c2d0b0, set_encrypt = 0x2, set_errcall = 0,
  set_errfile = 0, set_errpfx = 0, set_feedback = 0x1c2d0d0, set_flags = 0, set_h_compare = 0, set_h_ffactor = 0, set_h_hash = 0, set_h_nelem = 0x1c2cfe0, set_lorder = 0, set_msgcall = 0x5, set_msgfile = 0x1800bd0, set_pagesize = 0,
  set_paniccall = 0, set_priority = 0, set_q_extentsize = 0, set_re_delim = 0, set_re_len = 0, set_re_pad = 0, set_re_source = 0x21, stat = 0x6f4472657355746e, stat_print = 0x64496e69616d, sync = 0x1c72020, truncate = 0x21,
  upgrade = 0x186a0, verify = 0, dump = 0, db_am_remove = 0x21, db_am_rename = 0x637463656a626f28, stored_get = 0x292a3d7373616c, stored_close = 0, am_ok = 33, preserve_fid = 0, orig_flags = 1784835880, flags = 1668572005}
(gdb)


Expected results:


Additional info:

Comment 2 Marc Sauton 2014-02-27 07:34:46 UTC

db4-4.7.25-18.el6_4.x86_64

Comment 3 Marc Sauton 2014-02-27 07:35:29 UTC

Created attachment 868344 [details]
stack trace notes

Comment 5 Rich Megginson 2014-02-27 15:17:04 UTC

Looks like they are initiating a replica init of a consumer while the server is being shutdown?  If this is the case, can they just avoid doing a replica init from the supplier while they are shutting it down?

Comment 8 Rich Megginson 2014-02-27 21:33:10 UTC

So can they avoid doing a shutdown of a supplier that is in the middle of doing a replica init?

Comment 15 mreynolds 2014-03-06 21:10:34 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/47729

Comment 16 mreynolds 2014-03-07 19:28:34 UTC

Fixed upstream.

Comment 30 mreynolds 2014-04-21 21:00:54 UTC

Verification steps

[1]  Import 100,000 entry ldif
[2]  Setup 2 way MMR (Replica A + B)
[3]  Perform an online initialization from Replica A to Replica B
[4]  Once the initialization has started, stop Replica A
[5]  The server should stop gracefully, and when it is restarted there should not be any "Detected disorderly shutdown" messages in the error log(which would indicate that the server had crashed).

Comment 32 Amita Sharma 2014-07-08 10:02:31 UTC

[1]  Import 100,000 entry ldif
[root@dhcp201-155 export]# ls -al testdb.ldif
-rw-r--r--. 1 root root 150808094 Jul  8 03:16 testdb.ldif

[root@dhcp201-155 ~]# service dirsrv stop
Shutting down dirsrv: 
    dhcp201-155...                                         [  OK  ]
    M1...                                                  [  OK  ]
    M2...                                                  [  OK  ]
    M3...                                                  [  OK  ]
    M4...                                                  [  OK  ]

[root@dhcp201-155 export]# /usr/lib64/dirsrv/slapd-M1/ldif2db -s "dc=example,dc=com" -i "/export/testdb.ldif" 
importing data ...
[08/Jul/2014:03:20:42 -0400] - Backend Instance: userRoot
[08/Jul/2014:03:20:42 -0400] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[08/Jul/2014:03:20:42 -0400] - check_and_set_import_cache: pagesize: 4096, pages: 255092, procpages: 52144
[08/Jul/2014:03:20:42 -0400] - WARNING: After allocating import cache 408144KB, the available memory is 612224KB, which is less than the soft limit 1048576KB. You may want to decrease the import cache size and rerun import.
[08/Jul/2014:03:20:42 -0400] - Import allocates 408144KB import cache.
[08/Jul/2014:03:20:42 -0400] - import userRoot: Beginning import job...
[08/Jul/2014:03:20:42 -0400] - import userRoot: Index buffering enabled with bucket size 100
[08/Jul/2014:03:20:42 -0400] - import userRoot: Processing file "/export/testdb.ldif"
[08/Jul/2014:03:21:02 -0400] - import userRoot: Processed 27781 entries -- average rate 1389.0/sec, recent rate 1389.0/sec, hit ratio 0%
[08/Jul/2014:03:21:22 -0400] - import userRoot: Processed 53698 entries -- average rate 1342.5/sec, recent rate 1342.4/sec, hit ratio 100%
[08/Jul/2014:03:21:42 -0400] - import userRoot: Processed 80248 entries -- average rate 1337.5/sec, recent rate 1311.7/sec, hit ratio 100%
[08/Jul/2014:03:21:57 -0400] - import userRoot: Finished scanning file "/export/testdb.ldif" (100006 entries)
[08/Jul/2014:03:21:58 -0400] - import userRoot: Workers finished; cleaning up...
[08/Jul/2014:03:21:58 -0400] - import userRoot: Workers cleaned up.
[08/Jul/2014:03:21:58 -0400] - import userRoot: Cleaning up producer thread...
[08/Jul/2014:03:21:58 -0400] - import userRoot: Indexing complete.  Post-processing...
[08/Jul/2014:03:21:58 -0400] - import userRoot: Generating numSubordinates complete.
[08/Jul/2014:03:21:59 -0400] - import userRoot: Flushing caches...
[08/Jul/2014:03:21:59 -0400] - import userRoot: Closing files...
[08/Jul/2014:03:22:04 -0400] - All database threads now stopped
[08/Jul/2014:03:22:04 -0400] - import userRoot: Import complete.  Processed 100006 entries in 82 seconds. (1219.59 entries/sec)


[2]  Setup 2 way MMR (Replica A + B)

[3]  Perform an online initialization from Replica A to Replica B
ldapmodify -D "cn=directory manager" -w Secret123 -p 30100 -x -h localhost << EOF
dn: cn=M1_to_M2,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
changetype: modify
replace: nsds5beginreplicarefresh
nsds5beginreplicarefresh: start
EOF

[4]  Once the initialization has started, stop Replica A
[root@dhcp201-155 export]# /usr/lib64/dirsrv/slapd-M1/stop-slapd 

[5]  The server should stop gracefully, and when it is restarted there should not be any "Detected disorderly shutdown" messages in the error log(which would indicate that the server had crashed).
[root@dhcp201-155 ~]# tail -f /var/log/dirsrv/slapd-M1/errors
[08/Jul/2014:03:23:15 -0400] - I'm resizing my cache now...cache was 417939456 and is now 8000000
[08/Jul/2014:03:23:16 -0400] - slapd started.  Listening on All Interfaces port 30100 for LDAP requests
[08/Jul/2014:03:23:16 -0400] - Listening on All Interfaces port 30101 for LDAPS requests
[08/Jul/2014:03:24:35 -0400] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=M1_to_M2" (dhcp201-155:30103)".
[08/Jul/2014:03:26:57 -0400] - slapd shutting down - signaling operation threads
[08/Jul/2014:03:27:04 -0400] - slapd shutting down - closing down internal subsystems and plugins
[08/Jul/2014:03:27:15 -0400] NSMMReplicationPlugin - agmt="cn=M1_to_M2" (dhcp201-155:30103): Warning: unable to send endReplication extended operation (Timed out)
[08/Jul/2014:03:27:16 -0400] - Waiting for 4 database threads to stop
[08/Jul/2014:03:27:17 -0400] - All database threads now stopped
[08/Jul/2014:03:27:17 -0400] - slapd stopped.


No crash found. Hence Verified.

Comment 33 errata-xmlrpc 2014-10-14 07:53:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1385.html