Bug 621008
Summary: | parsing purge RUV from changelog at startup fails | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Ulf Weltman <ulf.weltman> | ||||||||
Component: | Replication - General | Assignee: | Nathan Kinder <nkinder> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 1.2.6 | CC: | amsharma, jgalipea | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2015-12-07 17:14:28 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 576869, 639035 | ||||||||||
Attachments: |
|
Description
Ulf Weltman
2010-08-04 00:05:17 UTC
I've prototyped a couple of fixes but they're both awkward,. Any preference, or further suggestions? One fix would be to update ruv_init_from_bervals() and get_ruvelement_from_berval() to take an additional flag that indicates they're being passed a purge RUV and thus allowed to have just a single CSN. Another fix would be to just insert a dummy minimum CSN into each vector in the purge RUV when we flush to disk at shutdown, and leave the parsing code alone. E.g. "{replica 1 ldap://host:389} 00000000000000010000 4c570c71000000010000" Created attachment 436649 [details]
patch proposal
This patch causes the purge RUV to be flushed to changelog in proper, parsable format with a min and max CSN. Note that this only happens at shutdown.
Formerly the purge RUV would be flushed looking like this (single CSN):
dbid: 000000de000000000000
purge ruv:
{replicageneration} 4c59afeb000000010000
{replica 1 ldap://host:389} 4c59b1fa000000010000
{replica 2 ldap://host:389} 4c59b1d3000000020000
With patch, the max CSN is copied to min CSN:
dbid: 000000de000000000000
purge ruv:
{replicageneration} 4c59afeb000000010000
{replica 1 ldap://host:389} 4c59b1fa000000010000 4c59b9c3000000010000
{replica 2 ldap://host:389} 4c59b1d3000000020000 4c59b200000100020000
Comment on attachment 436649 [details]
patch proposal
Looks like the code that uses the purge RUV only needs the csn anyway, so this should make parsing work better.
Created attachment 436660 [details]
better patch
Rich, sorry, better patch: insert a dummy CSN with just the RID set instead of copying the max CSN. The reason is to make it more generic. Insertion of min CSN in purge RUV doesn't need to happen every shutdown, only the first time we shut down after changelog trimming has been enabled. It gets read at next startup, held in-memory, and flushed as-is again at next shutdown. Making it a zero'd-out dummy CSN is more indicative of it not being used for anything too.
dbid: 000000de000000000000
purge ruv:
{replicageneration} 4c59afeb000000010000
{replica 1 ldap://host:389} 00000000000000010000 4c59c9ff000000010000
{replica 2 ldap://host:389} 00000000000000020000 4c59ca06000000020000
Created attachment 468488 [details]
Patch
This is the same as Ulf's second patch, but in git patch format.
I have tested this patch, and it seems to behave correctly. In the scenario mentioned in the initial description, the master halts replication with the consumer instead of continuing and missing updates that were trimmed from the changelog. Pushed to master. Thanks to Ulf for the patch, and to Rich for his review! Counting objects: 17, done. Delta compression using up to 2 threads. Compressing objects: 100% (9/9), done. Writing objects: 100% (9/9), 1.99 KiB, done. Total 9 (delta 7), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 2d80c40..23c5c1b master -> master s1-Master - 34202 c1-Slave - 34206 ========== [root@rhel61-ds90-amita slapd-c1]# /usr/lib64/dirsrv/slapd-c1/stop-slapd [root@rhel61-ds90-amita slapd-c1]# service dirsrv status dirsrv c1 is stopped dirsrv c2 (pid 6028) is running... dirsrv rhel61-ds90-amita (pid 6097) is running... dirsrv s1 (pid 6167) is running... dirsrv s2 (pid 6240) is running... ./db2ldif -n userRoot -E -a /outputc1.ldif -s "o=airius.com" Exported ldif file: /outputc1.ldif ldiffile: /outputc1.ldif [02/Jun/2011:14:48:56 +051800] - export userRoot: Processed 281 entries (100%). [02/Jun/2011:14:48:56 +051800] - All database threads now stopped ============== [root@rhel61-ds90-amita changelog]# ls -l /var/lib/dirsrv/slapd-s1/db/changelog total 220 -rw-------. 1 amsharma amsharma 221184 Jun 2 14:19 8ce36983-878c11e0-8623cf4a-2c436d47_4dde3b91000000010000.db4 -rw-rw-r--. 1 amsharma amsharma 0 Jun 2 14:18 8ce36983-878c11e0-8623cf4a-2c436d47.sema -rw-------. 1 amsharma amsharma 30 May 26 17:07 DBVERSION [root@rhel61-ds90-amita changelog]# ldapadd -x -h localhost -p 34202 -D "cn=Directory Manager" -w Secret123 << EOF > dn: uid=amsharma1,o=airius.com > cn: ams > sn: ams > givenname: ams > objectclass: top > objectclass: person > objectclass: organizationalPerson > objectclass: inetOrgPerson > uid: ams > mail: ams > userpassword: amsamsams > EOF adding new entry "uid=amsharma1,o=airius.com" [root@rhel61-ds90-amita changelog]# ldapadd -x -h localhost -p 34202 -D "cn=Directory Manager" -w Secret123 << EOF > dn: uid=amsharma11,o=airius.com > cn: ams11 > sn: ams11 > givenname: ams11 > objectclass: top > objectclass: person > objectclass: organizationalPerson > objectclass: inetOrgPerson > uid: ams11 > mail: ams11 > userpassword: amsamsams11 > EOF adding new entry "uid=amsharma11,o=airius.com" [root@rhel61-ds90-amita changelog]# ls -l /var/lib/dirsrv/slapd-s1/db/changelog total 220 -rw-------. 1 amsharma amsharma 221184 Jun 2 14:19 8ce36983-878c11e0-8623cf4a-2c436d47_4dde3b91000000010000.db4 -rw-rw-r--. 1 amsharma amsharma 0 Jun 2 14:18 8ce36983-878c11e0-8623cf4a-2c436d47.sema -rw-------. 1 amsharma amsharma 30 May 26 17:07 DBVERSION [root@rhel61-ds90-amita changelog]# I am getting segmentation fault, not sure why?? Any suggestions? [root@rhel61-ds90-amita slapd-c1]# ./ldif2db -n userRoot -E -i /outputc1.ldif importing data ... [02/Jun/2011:15:14:08 +051800] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [02/Jun/2011:15:14:08 +051800] - check_and_set_import_cache: pagesize: 4096, pages: 255396, procpages: 49890 [02/Jun/2011:15:14:08 +051800] - WARNING: After allocating import cache 408632KB, the available memory is 612952KB, which is less than the soft limit 1048576KB. You may want to decrease the import cache size and rerun import. [02/Jun/2011:15:14:08 +051800] - Import allocates 408632KB import cache. [02/Jun/2011:15:14:08 +051800] - import userRoot: Beginning import job... [02/Jun/2011:15:14:08 +051800] - import userRoot: Index buffering enabled with bucket size 100 [02/Jun/2011:15:14:08 +051800] - import userRoot: Processing file "/outputc1.ldif" ./ldif2db: line 32: 6700 Segmentation fault ./ns-slapd ldif2db -D /etc/dirsrv/slapd-c1 "$@" 2>&1 Here is the stack trace from the crash: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff0022700 (LWP 7818)] slapi_be_is_flag_set (be=0x0, flag=1) at ldap/servers/slapd/backend.c:561 561 { (gdb) bt #0 slapi_be_is_flag_set (be=0x0, flag=1) at ldap/servers/slapd/backend.c:561 #1 0x00007ffff3310aa6 in repl_chain_on_update () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so #2 0x00000033c66754a3 in mtn_get_be_distributed (pb=<value optimized out>, target_node=0x74da40, target_sdn=0x7fffd80012f0, flag_stop=0x7ffff001fbac) at ldap/servers/slapd/mapping_tree.c:2430 #3 0x00000033c66766ab in slapi_mapping_tree_find_backend_for_sdn ( sdn=0x7fffd80012f0) at ldap/servers/slapd/mapping_tree.c:1923 #4 0x00007ffff3577aa7 in import_entry_belongs_here (e=<value optimized out>, be=0x8738c0) at ldap/servers/slapd/back-ldbm/import.c:411 #5 0x00007ffff357dd26 in import_producer (param=<value optimized out>) at ldap/servers/slapd/back-ldbm/import-threads.c:564 #6 0x000000389be29633 in ?? () from /lib64/libnspr4.so #7 0x0000003897a077e1 in start_thread () from /lib64/libpthread.so.0 #8 0x00000038972e68ed in clone () from /lib64/libc.so.6 Here is a bette stack trace (ds-replication debuginfo is installed): (gdb) bt #0 slapi_be_is_flag_set (be=0x0, flag=1) at ldap/servers/slapd/backend.c:561 #1 0x00007ffff3310aa6 in repl_chain_on_update (pb=0x7fffd8003e80, target_dn=<value optimized out>, mtn_be_names=<value optimized out>, be_count=2, node_dn=<value optimized out>, mtn_be_states=<value optimized out>) at ldap/servers/plugins/replication/replutil.c:853 #2 0x00000033c66754a3 in mtn_get_be_distributed (pb=<value optimized out>, target_node=0x74da40, target_sdn=0x7fffd80012f0, flag_stop=0x7ffff001fbac) at ldap/servers/slapd/mapping_tree.c:2430 #3 0x00000033c66766ab in slapi_mapping_tree_find_backend_for_sdn ( sdn=0x7fffd80012f0) at ldap/servers/slapd/mapping_tree.c:1923 #4 0x00007ffff3577aa7 in import_entry_belongs_here (e=<value optimized out>, be=0x8738c0) at ldap/servers/slapd/back-ldbm/import.c:411 #5 0x00007ffff357dd26 in import_producer (param=<value optimized out>) at ldap/servers/slapd/back-ldbm/import-threads.c:564 #6 0x000000389be29633 in ?? () from /lib64/libnspr4.so #7 0x0000003897a077e1 in start_thread () from /lib64/libpthread.so.0 #8 0x00000038972e68ed in clone () from /lib64/libc.so.6 The problem seems to be that be is NULL, and slapi_be_is_flag_set() dereferences it. The code in repl_chain_on_update() that sets be is as follows: 850 for (ii = 0; ii < be_count; ++ii) 851 { 852 Slapi_Backend *be = slapi_be_select_by_instance_name(mtn_be_names[ii]); 853 if (slapi_be_is_flag_set(be,SLAPI_BE_FLAG_REMOTE_DATA)) 854 { 855 chaining_backend = ii; We should make like 853 check if be is NULL before calling slapi_be_is_flag_set(), but I'm still not sure why be is NULL in this test yet. This looks like a new unrelated bug that has to do with chain on update being used. Please file a new bug and put the details of your replication/chaining setup in the bug report. Also put the stack trace detail that I provided. ok, Marking this bug as VERIFIED. Opened new bug https://bugzilla.redhat.com/show_bug.cgi?id=710377. |