388021 – MMR breaks from master that has been reinited

Bug 388021 - MMR breaks from master that has been reinited

Summary: MMR breaks from master that has been reinited

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Replication - General
Sub Component:
Version:	1.0.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Rich Megginson
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	436695 (view as bug list)
Depends On:
Blocks:	240316 FDS1.1.0 436832
TreeView+	depends on / blocked

Reported:	2007-11-17 01:53 UTC by Rich Megginson
Modified:	2018-10-19 20:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 16:46:13 UTC
Embargoed:

Attachments	(Terms of Use)
diffs (3.97 KB, patch) 2007-11-19 15:11 UTC, Rich Megginson	no flags	Details \| Diff
cvs commit log (200 bytes, text/plain) 2007-11-19 17:24 UTC, Rich Megginson	no flags	Details
View All

Description Rich Megginson 2007-11-17 01:53:48 UTC

If you have a master that has received and sent updates to other masters, then
you reinit that master, that master will no longer be able to send updates.  You
will see errors like the following in that master's error log:
[14/Nov/2007:15:42:22 +0100] agmt="cn=master-to-other-replica" (master:389) -
Can't locate CSN 6639d5a5000000010000 in the changelog (DB rc=-30989). The
consumer may need to be reinitialized.

The problem is that after being reinitialized, the RUV for this master contains
a CSN that does not exist in the changelog.  When the master attempts to
position the changelog db cursor, it cannot find this record, so the cursor is
invalid, and no changes can be sent.

Comment 1 Rich Megginson 2007-11-17 02:11:40 UTC

A database export/import on a good master followed by a reinit of the other
masters will clear up this problem.  But make sure you have no pending changes
first.

Comment 2 Rich Megginson 2007-11-19 15:11:00 UTC

Created attachment 263531 [details]
diffs

Comment 3 Dael Maselli 2007-11-19 17:03:08 UTC

I think I found a workaround, instead of deleting the changelog I tried
to change the max records in it, then after sending 2 or 3 update from a "good"
server the errors disappears and all updates work well.

Comment 4 Dael Maselli 2007-11-19 17:06:05 UTC

The value I set in Max changelog records was "1"

Comment 5 Rich Megginson 2007-11-19 17:17:40 UTC

Can you reset the max changelog records back to the default?  If you use "1",
you may cause replicas to get out of sync with this one and require reinit.

Comment 6 Rich Megginson 2007-11-19 17:24:32 UTC

Created attachment 263701 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: This problem occurs when you have two or more masters, and you
have updates that have originated at a master that have been sent to other
masters (so that the other masters have a valid min/max csn for that replica in
the ruv).  If that master needs to be reinitialized for some reason (crash,
etc.) the reinit will erase the changelog.  The RUV for that master will now
contain CSNs that are not in the changelog.  If that master attempts to update
another master, it will first look at the RUV from the consumer, which will
contain the old CSNs, and it will look for those CSNs in the changelog, fail,
and abort the update process, meaning this master can no longer send updates to
other servers.
The solution is for the master to just use the min CSN in its own RUV as the
new starting point, if it has not been purged.	In the case of purging, if the
CSN is not found, this means the consumer is too far behind and must be
reinitialized.
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no

Comment 7 Dael Maselli 2007-11-19 18:40:48 UTC

(In reply to comment #5)
> Can you reset the max changelog records back to the default?  If you use "1",
> you may cause replicas to get out of sync with this one and require reinit.

Sure. I've done it just after I saw all working fine. It still works fine.

Comment 9 reinhard nappert 2008-02-07 17:43:46 UTC

Hi,

I was wondering if you have some scripts to reproduce this bug. Once in a while,
I come across this issue, but I can not reproduce it. 
Also, do you see issues back-porting this fix to 104.

Thanks,
-Reinhard

Comment 10 Rich Megginson 2008-02-07 17:54:40 UTC

(In reply to comment #9)
> Hi,
> 
> I was wondering if you have some scripts to reproduce this bug. Once in a while,
> I come across this issue, but I can not reproduce it. 

I don't think we have any script specifically for this test.  We used some
scripts to create masters, different scripts to create ldif files, different
scripts to add entries, and different scripts to reinit the master.  These
scripts are unfortunately not open source yet, but we are working on it.

Here are the basic steps:

- setup 3 instances of slapd (M1,M2,M3). They replicated like this

  M1  ---  M2
   \      /
    \    /
      M3
- Initialized M1 with 100,000 entries generated using dbgen.pl
- Initialized M2 from M1. Initialized M3 from M2.
- Used LDCLT to add 10000 different entries to M1. 
- Used LDCLT to add 10000 different entries to M2.
- Used LDCLT to add 10000 different entries to M3.

- re-initialize M2 from M1.
- used ldclt to add 10000 more entries to M2.

No errors were seen.  dbgen.pl and ldclt are included with the Fedora DS software.

> Also, do you see issues back-porting this fix to 104.

We don't have any plans to make any patch release RPMs of the 1.0.x line.  All
new development work is focused on Fedora DS 1.1.x.

> 
> Thanks,
> -Reinhard

Comment 11 Chandrasekar Kannan 2008-04-07 21:35:03 UTC

*** Bug 436695 has been marked as a duplicate of this bug. ***

Comment 18 Juan 2009-08-12 14:48:39 UTC

Hi

I have exactly the same problem described above. I have 4 servers in multimaster mode. The version of the packages are these, on a Centos 5 installation:

# rpm -qa | grep -i fedora
fedora-ds-console-1.1.0-5.fc6
fedora-ds-base-1.1.0-3.fc6
fedora-admin-console-1.1.0-4.fc6
fedora-idm-console-1.1.0-5.fc6
fedora-ds-admin-1.1.1-1.fc6
fedora-ds-1.1.0-3.fc6

# uname -a
Linux XXXXXXXXXXXXXXX 2.6.18-8.el5PAE #1 SMP Thu Mar 15 20:29:51 EDT 2007 i686 i686 i386 GNU/Linux

Which version of FDS has this bug fixed? Could be dangerous to apply the solution described in comments #c3 and #c4? What should be the best value for max changelog records (now i have set it to unlimited)?

Regards and thanks in advance.

Comment 19 Rich Megginson 2009-08-12 14:55:04 UTC

(In reply to comment #18)
> Hi
> 
> I have exactly the same problem described above. I have 4 servers in
> multimaster mode. The version of the packages are these, on a Centos 5
> installation:
> 
> # rpm -qa | grep -i fedora
> fedora-ds-console-1.1.0-5.fc6
> fedora-ds-base-1.1.0-3.fc6
> fedora-admin-console-1.1.0-4.fc6
> fedora-idm-console-1.1.0-5.fc6
> fedora-ds-admin-1.1.1-1.fc6
> fedora-ds-1.1.0-3.fc6
> 
> # uname -a
> Linux XXXXXXXXXXXXXXX 2.6.18-8.el5PAE #1 SMP Thu Mar 15 20:29:51 EDT 2007 i686
> i686 i386 GNU/Linux
> 
> Which version of FDS has this bug fixed?

The fix is in fedora-ds-base-1.2.0

> Could be dangerous to apply the
> solution described in comments #c3 and #c4?

I'm not sure - the original poster reported success.

> What should be the best value for
> max changelog records (now i have set it to unlimited)?

If you use the method in #c3 and #c4 then you set the max changelog records to 1, then verify the error is gone after a couple of updates, then set it back to unlimited.

> 
> Regards and thanks in advance.

Comment 20 Thomas 2010-06-12 15:19:51 UTC

It occured on RedHat DS 7.1 SP3. Does it be fixed on RedHat DS 8.1?

Comment 21 Rich Megginson 2010-06-14 13:38:04 UTC

(In reply to comment #20)
> It occured on RedHat DS 7.1 SP3. Does it be fixed on RedHat DS 8.1?    

Yes.  This is fixed in Red Hat DS 8.1

Note You need to log in before you can comment on or make changes to this bug.