Bug 1309963 - keep alive entries can break replication
keep alive entries can break replication
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
7.3
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Noriko Hosoi
Viktor Ashirov
: ZStream
Depends On: 1307151
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-19 01:05 EST by Jan Kurik
Modified: 2016-03-31 18:04 EDT (History)
8 users (show)

See Also:
Fixed In Version: 389-ds-base-1.3.4.0-27.el7
Doc Type: Bug Fix
Doc Text:
Previously, a keep alive entry was being created at too many opportunities during replication, potentially causing a race condition when adding the entry to the replica changelog and resulting in operations being dropped from the replication. With this update, unnecessary keep alive entry creation has been eliminated, and missing replication no longer occurs.
Story Points: ---
Clone Of: 1307151
Environment:
Last Closed: 2016-03-31 18:04:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan Kurik 2016-02-19 01:05:08 EST
This bug has been copied from bug #1307151 and has been proposed
to be backported to 7.2 z-stream (EUS).
Comment 4 Sankar Ramalingam 2016-03-01 08:48:56 EST
Hi Noriko, to verify this bug... I understand we need to create replication with "cn=repl keep alive" entry under the suffix. However, its not clear enough how the setup should be. So, request you to provide the steps to reproduce this.
Comment 5 Noriko Hosoi 2016-03-01 12:10:36 EST
Hi Thierry, 

You put a reproducer in this comment which requires IPA:
https://fedorahosted.org/389/ticket/48445#comment:1

Could there be an easy steps that duplicate this scenario?  It seems total update triggered this bug.  Just repeating the total updates and checking that the data loss did not occur is good enough?

Or it's safer to run the IPA install and check?
Comment 6 thierry bordaz 2016-03-01 12:29:20 EST
Hi Noriko,

The bug silently cleared (from the CL) the ADD of the replica keep alive entry.
This ADD occurred (on the consumer side) at the end of the total init.
It was not a detected, until the replica needs to update the keep alive entry. Those updates hit err=32 when they were replicated.

So the reproducer are:

  - Create a Master and a Replica

  - Configure the Master, with fractional replication so that it skips sending updates on 'description'

  - Do a total init of the replica
    (with the bug fix, it does not create the keep alive entry)

  - ON Master loop doing more than 100 updates of 'description' attribute (skipped by replica agreement)

  - Check that the keep alive entry (ADD) is created and sent the replica

  - Checks that the update of the keep alive entry is replicated
Comment 8 Sankar Ramalingam 2016-03-07 01:04:43 EST
1. Created 6 master replication setup.
2. Created few users and groups on M1
3. Total init completed from M1. No fractional replication configured.
4. Configured fractional replication for telePhoneNumber attribute
5. Run total init to check if telePhoneNumber is removed from other Masters and Consumer than M1.
6. Its successful. telePhoneNumber attribute removed in all masters but M1
7. About 100 modify operations for telePhoneNumber attribute on M1.

[root@vm-idm-004 ~]# no=1 ; while [ $no -lt 101 ]; do ldapmodify -x -p 1189 -h localhost -D "cn=Directory Manager" -w Secret123 << EOF > /dev/null
dn: uid=users1189users7,ou=People,dc=passsync,dc=com
replace: telephoneNumber
telephoneNumber: 999999$no
EOF            
 sleep 0.3; no=`expr $no + 1`; done

8. Checked if telePhone attribute values exist on other masters. 

[root@vm-idm-004 ~]# for PORT in `echo "1189 1289 1389 1489 2189 2289"`; do ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "uid=users1189users7,ou=People,dc=passsync,dc=com" |grep -i tele ; RC=$? ; if [ $RC -eq 0 ]; then echo "telePhoneNumber fractional attribute only on PORT-$PORT"; fi ; done
telephoneNumber: 999999100
telePhoneNumber fractional attribute only on PORT-1189

9. Checking "cn=repl keep alive" entry on all servers.
[root@vm-idm-004 ~]# grep -li "cn=repl keep" /var/log/dirsrv/slapd-*/errors
/var/log/dirsrv/slapd-M1/errors
/var/log/dirsrv/slapd-M4/errors
/var/log/dirsrv/slapd-M5/errors

I will redo the testing with the fractional replication setup in start and update the bug with my findings.
Comment 9 Sankar Ramalingam 2016-03-08 09:44:44 EST
This time on a fresh setup.

1. Created 6 master replication setup with fractional replication for description and telephoneNumber attributes.
2. Created few users and groups on M1.
3. Total init completed from M1.
4. Checked if "cn=repl keep alive $replica_id,dc=passsync,dc=com is created. 
5. "cn=repl keep alive 2211,dc=passsync,dc=com" is created on all masters.


for PORT in `echo "1189 1289 1389 1489 2189 2289 3189 3289"`; do ldapsearch -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 -b "cn=repl keep alive 2211,dc=passsync,dc=com" -s base |grep -i "dn: cn=repl keep alive 2211" ; RC=$? ; if [ $RC -eq 0 ]; then echo "cn=repl keep alive entry created on PORT-$PORT"; fi ; done
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-1189
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-1289
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-1389
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-1489
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-2189
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-2289
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-3189
dn: cn=repl keep alive 2211,dc=passsync,dc=com
cn=repl keep alive entry created on PORT-3289


Note: I haven't yet run ldapmodify for frational attributes in M1. As per your comment, the entry "cn=repl keep alive 2211,dc=passsync,dc=com" should be created only after ldapmodify operations. I appreciate, if you could give us more instructions to proceed with the bug verification.

Thanks.
Comment 10 thierry bordaz 2016-03-08 10:15:47 EST
Hi Sankar,

keep alive entry is created in two cases:

* If a supplier does a total init of a consumer, it first checks that its own keep alive entry exists.

* If any instance needs to update its keep alive entry (for example, it skips more than 100 updates), then if the keep alive entry does not exist it creates it first.

In your case, the keep alive entry of M1 is created (although no updates was done on M1) because M1 did a total update of others replica.

thanks
theirry
Comment 11 Sankar Ramalingam 2016-03-09 09:16:47 EST
Thanks Thierry for answers :). So, as per your comment #10, the feature seems to be working. Anyways, I modified fractional replication attributes more than 100 times and checked the cn=repl keep alive entry and replica states. Everything seems working fine. Hence, marking the bug as Verified.

[root@vm-idm-004 ~]# rpm -qa |grep -i 389-ds
389-ds-base-debuginfo-1.3.4.0-27.el7_2.x86_64
389-ds-base-libs-1.3.4.0-27.el7_2.x86_64
389-ds-base-devel-1.3.4.0-27.el7_2.x86_64
389-ds-base-1.3.4.0-27.el7_2.x86_64
Comment 13 errata-xmlrpc 2016-03-31 18:04:39 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0550.html

Note You need to log in before you can comment on or make changes to this bug.