Red Hat Bugzilla – Bug 388021
MMR breaks from master that has been reinited
Last modified: 2015-12-07 11:46:13 EST
If you have a master that has received and sent updates to other masters, then
you reinit that master, that master will no longer be able to send updates. You
will see errors like the following in that master's error log:
[14/Nov/2007:15:42:22 +0100] agmt="cn=master-to-other-replica" (master:389) -
Can't locate CSN 6639d5a5000000010000 in the changelog (DB rc=-30989). The
consumer may need to be reinitialized.
The problem is that after being reinitialized, the RUV for this master contains
a CSN that does not exist in the changelog. When the master attempts to
position the changelog db cursor, it cannot find this record, so the cursor is
invalid, and no changes can be sent.
A database export/import on a good master followed by a reinit of the other
masters will clear up this problem. But make sure you have no pending changes
Created attachment 263531 [details]
I think I found a workaround, instead of deleting the changelog I tried
to change the max records in it, then after sending 2 or 3 update from a "good"
server the errors disappears and all updates work well.
The value I set in Max changelog records was "1"
Can you reset the max changelog records back to the default? If you use "1",
you may cause replicas to get out of sync with this one and require reinit.
Created attachment 263701 [details]
cvs commit log
Reviewed by: nkinder (Thanks!)
Fix Description: This problem occurs when you have two or more masters, and you
have updates that have originated at a master that have been sent to other
masters (so that the other masters have a valid min/max csn for that replica in
the ruv). If that master needs to be reinitialized for some reason (crash,
etc.) the reinit will erase the changelog. The RUV for that master will now
contain CSNs that are not in the changelog. If that master attempts to update
another master, it will first look at the RUV from the consumer, which will
contain the old CSNs, and it will look for those CSNs in the changelog, fail,
and abort the update process, meaning this master can no longer send updates to
The solution is for the master to just use the min CSN in its own RUV as the
new starting point, if it has not been purged. In the case of purging, if the
CSN is not found, this means the consumer is too far behind and must be
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no
(In reply to comment #5)
> Can you reset the max changelog records back to the default? If you use "1",
> you may cause replicas to get out of sync with this one and require reinit.
Sure. I've done it just after I saw all working fine. It still works fine.
I was wondering if you have some scripts to reproduce this bug. Once in a while,
I come across this issue, but I can not reproduce it.
Also, do you see issues back-porting this fix to 104.
(In reply to comment #9)
> I was wondering if you have some scripts to reproduce this bug. Once in a while,
> I come across this issue, but I can not reproduce it.
I don't think we have any script specifically for this test. We used some
scripts to create masters, different scripts to create ldif files, different
scripts to add entries, and different scripts to reinit the master. These
scripts are unfortunately not open source yet, but we are working on it.
Here are the basic steps:
- setup 3 instances of slapd (M1,M2,M3). They replicated like this
M1 --- M2
- Initialized M1 with 100,000 entries generated using dbgen.pl
- Initialized M2 from M1. Initialized M3 from M2.
- Used LDCLT to add 10000 different entries to M1.
- Used LDCLT to add 10000 different entries to M2.
- Used LDCLT to add 10000 different entries to M3.
- re-initialize M2 from M1.
- used ldclt to add 10000 more entries to M2.
No errors were seen. dbgen.pl and ldclt are included with the Fedora DS software.
> Also, do you see issues back-porting this fix to 104.
We don't have any plans to make any patch release RPMs of the 1.0.x line. All
new development work is focused on Fedora DS 1.1.x.
*** Bug 436695 has been marked as a duplicate of this bug. ***
I have exactly the same problem described above. I have 4 servers in multimaster mode. The version of the packages are these, on a Centos 5 installation:
# rpm -qa | grep -i fedora
# uname -a
Linux XXXXXXXXXXXXXXX 2.6.18-8.el5PAE #1 SMP Thu Mar 15 20:29:51 EDT 2007 i686 i686 i386 GNU/Linux
Which version of FDS has this bug fixed? Could be dangerous to apply the solution described in comments #c3 and #c4? What should be the best value for max changelog records (now i have set it to unlimited)?
Regards and thanks in advance.
(In reply to comment #18)
> I have exactly the same problem described above. I have 4 servers in
> multimaster mode. The version of the packages are these, on a Centos 5
> # rpm -qa | grep -i fedora
> # uname -a
> Linux XXXXXXXXXXXXXXX 2.6.18-8.el5PAE #1 SMP Thu Mar 15 20:29:51 EDT 2007 i686
> i686 i386 GNU/Linux
> Which version of FDS has this bug fixed?
The fix is in fedora-ds-base-1.2.0
> Could be dangerous to apply the
> solution described in comments #c3 and #c4?
I'm not sure - the original poster reported success.
> What should be the best value for
> max changelog records (now i have set it to unlimited)?
If you use the method in #c3 and #c4 then you set the max changelog records to 1, then verify the error is gone after a couple of updates, then set it back to unlimited.
> Regards and thanks in advance.
It occured on RedHat DS 7.1 SP3. Does it be fixed on RedHat DS 8.1?
(In reply to comment #20)
> It occured on RedHat DS 7.1 SP3. Does it be fixed on RedHat DS 8.1?
Yes. This is fixed in Red Hat DS 8.1