Bug 1183655 - IPA replica missing data after master upgraded
Summary: IPA replica missing data after master upgraded
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: thierry bordaz
QA Contact: Viktor Ashirov
URL:
Whiteboard:
Depends On:
Blocks: 1176995
TreeView+ depends on / blocked
 
Reported: 2015-01-19 11:57 UTC by Martin Kosek
Modified: 2015-03-05 09:40 UTC (History)
12 users (show)

Fixed In Version: 389-ds-base-1.3.3.1-13.el7
Doc Type: Bug Fix
Doc Text:
Clone Of: 1176995
Environment:
Last Closed: 2015-03-05 09:40:25 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0416 normal SHIPPED_LIVE Important: 389-ds-base security, bug fix, and enhancement update 2015-03-05 14:26:33 UTC

Comment 1 Martin Kosek 2015-01-19 11:58:42 UTC
This is the DS part of the IPA fix (https://fedorahosted.org/freeipa/ticket/4833). DS upstream ticket:

https://fedorahosted.org/389/ticket/47988

Comment 4 Noriko Hosoi 2015-01-19 19:39:08 UTC
Thanks, Scott!

Comment 5 Scott Poore 2015-01-21 19:15:57 UTC
How can I test this one?  Is there a log message I should see that indicates the schema replication problem?

Would this be an indicator here of the problem on the server with the newer version?

# grep "unable to replicate schema" /var/log/dirsrv/slapd-TESTRELM-TEST/errors
....
[21/Jan/2015:12:14:01 -0600] NSMMReplicationPlugin - agmt="cn=meTovm2.testrelm.test" (vm2:389): Warning: unable to replicate schema: rc=2


Thanks,
Scott

Comment 6 thierry bordaz 2015-01-21 20:09:26 UTC
Hi Scott,

Yes having 'unable to replicate schema' is showing a problem.
It can be transient error, but if regularly you are seeing this error that mean the supplier is not able to learn/build a superset schema and so does not want to send its schema.

Fixing ipatokenTOTP could hide the problem of replication of the schema.
This problem is dynamic and depends on which Replica Agreement (master->replica or replica->master) is starting the replication session first.

thanks
thierry

Comment 7 Scott Poore 2015-01-22 14:16:18 UTC
Thierry,

Is there a clear entry in the log indicating which starts the replication?

And does the issue occur if the server running the older version starts the replication session first?

Thanks,
Scott

Comment 8 thierry bordaz 2015-01-22 15:01:47 UTC

Here is the test case I am using to reproduce:

	Step 1
	------

	Create MMR with 2 suppliers M1 and M2
	(enable on both replication error logging: nsslapd-errorlog-level: 8192)
	(enable on both internal operation logging: nsslapd-accesslog-level: 260)

	Check update on M1 are replicated to M2 and the opposite

	Stop M1 and M2. 

	Step 2
	------

	Update the schema of M1 with the IPA 3.3 schema files
	Update the schema of M2 with the IPA 4.1 schema files
	(Checks that ipatokenTOTP contains only MAY (https://bugzilla.redhat.com/show_bug.cgi?id=1176995))

	Check that the schema files (especially 99user.ldif) do not contain 'nsSchemaCSN' attribute.

	Start M1 and M2

	#
	# at this point the server will not try to replicate the schema because there is no 'nsSchemaCSN' values
	# Regarding the updates both servers should be in sync
	#

	Step 3
	------

	Disable replica agreement M2 to M1 (nsds5ReplicaEnabled: off)

	#
	# At this point M2 (that contains a superset schema) will not update schema on M1
	#

	Do a dummy update of the schema on M1. For example
ldapmodify -h ... <<EOF
dn: cn=schema
changetype: modify
add: objectclasses
objectclasses: ( scott-oid NAME 'scott' DESC 'test for 1183655'
 STRUCTURAL MUST ( cn ) MAY sn X-ORIGIN ( 'user defined' ) )
EOF

	#
	# At this poing nsSchemaCSN on M1 > M2
	#

	Step 4
	------

	Do a dummy update on M1 on any entry

	#
	# This will trigger a replication session M1->M2, that start with a comparison of the schema
	# As Schema of M2 > M1, M1 will learn/extend the extra definition
	#
	In M1 errors logs you will see:
NSMMReplicationPlugin - [S] Schema agmt="M1->M2" must not be overwritten (set replication log for additional info)
NSMMReplicationPlugin - agmt="M1->M2" : Warning: unable to replicate schema: rc=1

	In M1 access logs you will see (at the same timestamp)
conn=Internal op=-1 MOD dn="cn=schema"
conn=Internal op=-1 RESULT err=20 tag=48 nentries=0 etime=0
conn=Internal op=-1 MOD dn="cn=schema"
conn=Internal op=-1 RESULT err=20 tag=48 nentries=0 etime=1


	This last error (err=20) has two consequences
		- M1 is not able to learn some schema definitions.
		- M1 nsschemaCSN will increase its nsSchemaCSN, preventing M2 to send its schema 
		  in case M2->M1 is re-enabled


	To reproduce you need the following conditions:
		- The supplier having the older schema must has the higher nsSchemaCSN
		- The supplier having the older schema must start a replication session before
		  the other supplier
		- The newer schema must contains extended definitions (not only new definitions)




I hope I do not miss any steps and it is clear enough. Please do not hesitate to ping me in any doubt.

Comment 9 thierry bordaz 2015-01-23 15:13:31 UTC
I made a fix that is currently under review (upstream ticket https://fedorahosted.org/389/ticket/47988).

Comment 13 thierry bordaz 2015-01-27 21:06:53 UTC
Hello Scott, 

So frustrating..

Please would you also provide the access logs from 7.0 and 7.1.

Looking at errors logs from 7.1 master, we can see that the schema learning phase was successful but then when the master decided to send (push) the schema the connection was closed:

[27/Jan/2015:14:54:49 -0500] NSMMReplicationPlugin - agmt="cn=meTocloud-qe-15.testrelm.test" (cloud-qe-15:389): Replica was successfully acquired.
[27/Jan/2015:14:54:49 -0500] NSMMReplicationPlugin - agmt="cn=meTocloud-qe-15.testrelm.test" (cloud-qe-15:389): State: backoff -> sending_updates


        <<< Schema checking >>>
[27/Jan/2015:14:54:49 -0500] NSMMReplicationPlugin - [S] Checking consumer schema localcsn:54c7ed07000000000000 / remotecsn:54c7ea11000000000000


[27/Jan/2015:14:54:50 -0500] NSMMReplicationPlugin - [S] Reread remotecsn:54c7ea11000000000000

       <<< master ready to send its schema >>>
[27/Jan/2015:14:54:50 -0500] NSMMReplicationPlugin - Schema checking successful: ok to push the schema (agmt="cn=meTocloud-qe-15.testrelm.test" (cloud-qe-15:389))

[27/Jan/2015:14:54:50 -0500] NSMMReplicationPlugin - agmt="cn=meTocloud-qe-15.testrelm.test" (cloud-qe-15:389): Disconnected from the consumer
[27/Jan/2015:14:54:51 -0500] NSMMReplicationPlugin - agmt="cn=meTocloud-qe-15.testrelm.test" (cloud-qe-15:389): Warning: unable to replicate schema: rc=2


As the time between the replica acquire and connection close is only 2sec, and that same closure seems to occur again and again, I do not think it is related to timeout.

Comment 14 thierry bordaz 2015-01-27 21:14:56 UTC

Hello Scott,


Looking at replica logs, I see

[27/Jan/2015:14:54:50 -0500] - sasl_io_recv failed to decode packet for connection 127

The schema is a quite big entry and possibly hit the size limit of a sasl PDU.
Just to clarify it is not a problem of tuning of sasl size, would you please increase it and try again...

thanks
thierry

Comment 15 Scott Poore 2015-01-27 21:31:21 UTC
Thierry,

I changed the tests to fix a few things to help here:

- set access log level to 260 as you described above.
- upped the nsslapd-sasl-max-buffer-size to 2621440
- fixed upgrade to make sure we also upgrade the 389-ds-base-debuginfo rpm

So, I'm running a new test now with these updates and I'll see what happens.

Thanks,
Scott

Comment 16 Scott Poore 2015-01-28 02:30:07 UTC
I've got to run my tests a few more times to be sure but, it looks like increasing the sasl max buffer size may have fixed my latest issue.

I'll kick off some tests tonight and check them in the morning.

Thanks,
Scott

Comment 17 Scott Poore 2015-01-28 15:11:18 UTC
ok, I saw some other strange behavior on one of my runs when I had set it that high.

I'm seeing this for several different ipa commands:

ipa: ERROR: limits exceeded for this query

I'm guessing setting sasl max buffer too high can have an adverse affect on a server if running without enough memory?

I tuned it down to 1048576 to see if that still occurs and if that still helps the replication.

I'll keep you posted.

Comment 18 Scott Poore 2015-01-29 01:49:11 UTC
ok, I think the limits error was related to the higher buffer size.  I haven't seen it again since reducing this.  

I think I am close to verifying this.

Comment 19 Sankar Ramalingam 2015-01-30 09:21:53 UTC
1. Created 2 way MMR setup. M1 - 1189; M2 - 1289
2. Replication works fine. Creation/Deletion from M1->M2 and vice versa works.
3. Copied Schema files from RHEL7.1 IPA master to M1. Works just fine.
    - Removed the complete set of schema files from /etc/dirsrv/slapd-M1/schema/ and copied from RHEL7.1 IPA master.
4. Copied Schema files from RHEL7.0 IPA master to M2. It doesn't work by default. It throws error as 

[30/Jan/2015:03:12:42 -0500] - Entry "cn=encryption,cn=config" -- attribute "sslversionmin" not allowed
[30/Jan/2015:03:12:42 -0500] - Entry "cn=replSchema,cn=config" has unknown object class "nsSchemaPolicy"
[30/Jan/2015:03:12:42 -0500] - Entry "cn=consumerUpdatePolicy,cn=replSchema,cn=config" has unknown object class "nsSchemaPolicy"
[30/Jan/2015:03:12:42 -0500] - Entry "cn=supplierUpdatePolicy,cn=replSchema,cn=config" has unknown object class "nsSchemaPolicy"

5. Hence, I copied the original 01core389.ldif file from RHEL7.1 backed up schema folder. Then, the server started working.
6. Then, checked whether IPA 3.3 schema files contain nsSchemaCSN attribute. Its not present in IPA 3.3 schema.
7. Checked the presence of the same attribute in RHEL7.1. Its present in IPA 4.1 schema.
8. Successfully restarted/started servers. Both the servers are in sync also.
9. Disabled replication on M1 and a dummy update of schema as mentioned in step #3 in comment #8
10. Updated CN attribute for one of the user in M2. Dummy update of user in M2 as per comment #8.



[30/Jan/2015:04:01:29 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1389_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1389): Incremental update failed and requires administrator action
[30/Jan/2015:04:01:29 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189): Incremental update failed and requires administrator action
[30/Jan/2015:04:18:49 -0500] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189)".
[30/Jan/2015:04:18:55 -0500] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189)". Sent 110 entries.
[30/Jan/2015:04:18:57 -0500] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189)".
[30/Jan/2015:04:19:02 -0500] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189)". Sent 110 entries.
[30/Jan/2015:04:19:03 -0500] NSMMReplicationPlugin - changelog program - _cl5AddThread: invalid changelog state - 2
[30/Jan/2015:04:19:03 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189): Changelog database was in an incorrect state
[30/Jan/2015:04:19:03 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1189_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1189): Incremental update failed and requires administrator action



Incremental update failed from M2 -> M1. Where as the full update succeeds from M2 -> M1.

Comment 20 thierry bordaz 2015-01-30 19:59:29 UTC
Sankar,

I think you updated(copy) the schema files being root. so the files are not readable. Then the behavior is very strange, checking of the schema is done on invalid definitions.

From replication point of view. The replica agreement exist on both side. But M1 database does not contain RUV, as well as M2. This although I did an update on M2.
This is a really weird situation.

Would you try to reproduce especially making sure replication is working before updating the schema files. Then enabling replicatin logging level. Then Changing the schema files with the appropriate rights.

Comment 21 thierry bordaz 2015-02-02 09:49:59 UTC
(In reply to Scott Poore from comment #17)
> ok, I saw some other strange behavior on one of my runs when I had set it
> that high.
> 
> I'm seeing this for several different ipa commands:
> 
> ipa: ERROR: limits exceeded for this query
> 
> I'm guessing setting sasl max buffer too high can have an adverse affect on
> a server if running without enough memory?
> 
> I tuned it down to 1048576 to see if that still occurs and if that still
> helps the replication.
> 
> I'll keep you posted.

Scott,

In an other thread I also get this ipa error: 'limits exceeded for the query'.
This was related to ipa request being timeout. This can be check looking for 'err=3' in DS access logs.
If it is the case, it could/should be avoided with modification of ipaSearchtimelimit in ipaconfig.
In an extrem limit, it also happened while retrieving the ipaconfig entry. In that case, it needs to be fixed in ldap2.py (find_entry).

thanks
thierry

Comment 22 Scott Poore 2015-02-02 16:43:39 UTC
(In reply to thierry bordaz from comment #21)
...
> 
> In an other thread I also get this ipa error: 'limits exceeded for the
> query'.
> This was related to ipa request being timeout. This can be check looking for
> 'err=3' in DS access logs.
> If it is the case, it could/should be avoided with modification of
> ipaSearchtimelimit in ipaconfig.

So, are you saying this is a tuning change that IPA needs to make?  Should I open a separate bug for that?

> In an extrem limit, it also happened while retrieving the ipaconfig entry.
> In that case, it needs to be fixed in ldap2.py (find_entry).

Sounds like this one is a bug?  Or are we simply hitting extreme cases where tuning is necessary?

Comment 23 thierry bordaz 2015-02-02 17:06:27 UTC
Scott,

We should first confirm if there is timeout (err=3), then testing where it needs to be fixed.


thnaks
thierry

Comment 24 Sankar Ramalingam 2015-02-03 11:13:02 UTC
(In reply to thierry bordaz from comment #20)
> Sankar,
> 
> I think you updated(copy) the schema files being root. so the files are not
> readable. Then the behavior is very strange, checking of the schema is done
> on invalid definitions.
> 
> From replication point of view. The replica agreement exist on both side.
> But M1 database does not contain RUV, as well as M2. This although I did an
> update on M2.
> This is a really weird situation.

There was an issue with the replication setup. I fixed that and recreated the setup.

Things are working fine. I see this in the error logs.

==> /var/log/dirsrv/slapd-M2/errors <==
[03/Feb/2015:05:52:27 -0500] NSMMReplicationPlugin - [S] Schema agmt="cn=1289_to_1489_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1489) must not be overwritten (set replication log for additional info)
[03/Feb/2015:05:52:28 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1489_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1489): Warning: unable to replicate schema: rc=1
[03/Feb/2015:05:52:28 -0500] NSMMReplicationPlugin - [S] Schema agmt="cn=1289_to_1389_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1389) must not be overwritten (set replication log for additional info)
[03/Feb/2015:05:52:28 -0500] NSMMReplicationPlugin - agmt="cn=1289_to_1389_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1389): Warning: unable to replicate schema: rc=1

==> /var/log/dirsrv/slapd-M1/errors <==
[03/Feb/2015:05:46:05 -0500] NSMMReplicationPlugin - [S] Schema agmt="cn=1189_to_1389_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1389) must not be overwritten (set replication log for additional info)
[03/Feb/2015:05:46:05 -0500] NSMMReplicationPlugin - [S] Schema agmt="cn=1189_to_1489_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1489) must not be overwritten (set replication log for additional info)
[03/Feb/2015:05:46:05 -0500] NSMMReplicationPlugin - agmt="cn=1189_to_1389_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1389): Warning: unable to replicate schema: rc=1
[03/Feb/2015:05:46:05 -0500] NSMMReplicationPlugin - agmt="cn=1189_to_1489_on_intel-s3eb1-03.lab.bos.redhat.com" (intel-s3eb1-03:1489): Warning: unable to replicate schema: rc=1


However, I don't see any errors in the access log as "err=20". I guess, this shows that the schema definition is learnt by M2.

The replication from M2 -> M1 and vice versa works fine once I re-enabled the replication from M1 to M2.

Can this be marked as Verified?
> 
> Would you try to reproduce especially making sure replication is working
> before updating the schema files. Then enabling replicatin logging level.
> Then Changing the schema files with the appropriate rights.

Comment 25 thierry bordaz 2015-02-03 13:07:21 UTC
Hello Sankar,

You deployment hits a different issue than this bug. 
the definitions of 'automount' and 'automountMap' would need a MERGE

This is something that schema comparison/learning can not handle.

So the schema are replicated M1 <-> M2 but are not pushed on C1 and C2

/etc/dirsrv/slapd-M1/schema/99user.ldif

objectClasses: ( 1.3.6.1.1.1.2.17 NAME 'automount' DESC 'Automount information
 ' SUP top STRUCTURAL MUST ( automountKey $ automountInformation ) MAY descrip
 tion X-ORIGIN ( 'RFC 2307bis' 'user defined' ) )
objectClasses: ( 1.3.6.1.1.1.2.16 NAME 'automountMap' DESC 'Automount Map info
 rmation' SUP top STRUCTURAL MUST automountMapName MAY description X-ORIGIN (
 'RFC 2307bis' 'user defined' ) )

/etc/dirsrv/slapd-M2/schema/99user.ldif

objectClasses: ( 1.3.6.1.1.1.2.17 NAME 'automount' DESC 'Automount information
 ' SUP top STRUCTURAL MUST ( automountKey $ automountInformation ) MAY descrip
 tion X-ORIGIN ( 'RFC 2307bis' 'user defined' ) )
objectClasses: ( 1.3.6.1.1.1.2.16 NAME 'automountMap' DESC 'Automount Map info
 rmation' SUP top STRUCTURAL MUST automountMapName MAY description X-ORIGIN (
 'RFC 2307bis' 'user defined' ) )

/etc/dirsrv/slapd-C1/schema/60autofs.ldif
objectClasses: (
  1.3.6.1.1.1.2.17
  NAME 'automount'
  DESC 'An entry in an automounter map'
  SUP top
  STRUCTURAL
  MUST ( cn $ automountInformation )
  MAY ( description )
  X-ORIGIN 'draft-howard-rfc2307bis'
  )

objectClasses: (
  1.3.6.1.1.1.2.16
  NAME 'automountMap'
  DESC 'An group of related automount objects'
  SUP top
  STRUCTURAL
  MUST ( ou )
  X-ORIGIN 'draft-howard-rfc2307bis'
  )


Would you clarify what versions of schema are under each instances.

Comment 26 Sankar Ramalingam 2015-02-03 13:29:46 UTC
M1 - RHEL7.1 IPA schema
M2 - RHEL7.0 IPA, but tweaked a little. Copied 01core389.ldif back from RHEL7.1 389-ds-base schema to work around - unknown object class "nsSchemaPolicy" error.
C1 - RHEL7.1 389-ds-base schema
C2 - RHEL7.1 389-ds-base schema

Comment 27 thierry bordaz 2015-02-03 14:14:37 UTC
Shankar,

Those two objectclasses are defined in 60autofs.ldif.

On M1 and M2 these definitions are overwritten by 60basev2.ldif
On C1 and C2, 60basev2.ldif do not exist. So it keeps the 60autofs.ldif definitions.

The problem is that those definitons on these two files are very different and would require a merge if managed during replicatoin of the schema.
60basev2.ldif does not exist under /etc/dirsrv/schema, do you know where this file is coming from on M1-M2 and why it is missing in C1-C2?

Comment 28 Sankar Ramalingam 2015-02-03 15:11:35 UTC
(In reply to thierry bordaz from comment #27)
> Shankar,
> 
> Those two objectclasses are defined in 60autofs.ldif.
> 
> On M1 and M2 these definitions are overwritten by 60basev2.ldif
> On C1 and C2, 60basev2.ldif do not exist. So it keeps the 60autofs.ldif
> definitions.
> 
> The problem is that those definitons on these two files are very different
> and would require a merge if managed during replicatoin of the schema.
> 60basev2.ldif does not exist under /etc/dirsrv/schema, do you know where
> this file is coming from on M1-M2
I copied it directly to /etc/dirsrv/slapd-M1(M2)/schema directory from IPA7.1 and RHEL7.0 schema directories respectively. And I kept C1 and C2 untouched. 
I am not sure this testing is in the scope of Directory Server. 
If you think the original is fixed not any more, then we could mark the bug as Verified. If needed, we can open a new bug for the other issue.
 and why it is missing in C1-C2?

Comment 29 thierry bordaz 2015-02-03 15:48:32 UTC
Sankar,

I agree, ability to merge schema definitions is out of the scope of this bug and we could mark that bug to verified with the tests Scott already did.

Regarding your test, consumers definitions (automount, automountMap) are not compatible with supplier definitions. Making automatically (schema learning) them compatible is looking quite dangerous, for example the MUST attribute list would only keep the smallest set of common attribute. Making this silently could be risk for applications relying on the schema.

thanks
thierry

Comment 30 Scott Poore 2015-02-04 01:25:04 UTC
Thierry,

Per comment #23:

I haven't seen the limits exceeded errors again to confirm that yet.

Per comment #29:

> I agree, ability to merge schema definitions is out of the scope of this bug
> and we could mark that bug to verified with the tests Scott already did.

Which tests are you talking about?

I'm still having problems with some IPA functionality after an upgrade and I'm seeing a lot of named LDAP errors:

Feb  4 02:59:35 vm-idm-023 named[16726]: LDAP instance 'ipa' is being synchronized, please ignore message 'all zones loaded'
Feb  4 02:59:35 vm-idm-023 named[16726]: LDAP error: Critical extension is unavailable: unable to start SyncRepl session: is RFC 4533 supported by LDAP server?
Feb  4 02:59:35 vm-idm-023 named[16726]: ldap_syncrepl will reconnect in 60 seconds

Also, I just noticed that during the upgrade I see this:

Add failure missing required attribute "objectclass"

Looking in /var/log/ipaupgrade.log:

2015-02-03T20:52:12Z INFO New entry: cn=Content Synchronization,cn=plugins,cn=config
2015-02-03T20:52:12Z DEBUG ---------------------------------------------
2015-02-03T20:52:12Z DEBUG Initial value
2015-02-03T20:52:12Z DEBUG dn: cn=Content Synchronization,cn=plugins,cn=config
2015-02-03T20:52:12Z DEBUG only: set nsslapd-pluginEnabled to 'on', current value []
2015-02-03T20:52:12Z DEBUG only: updated value ['on']
2015-02-03T20:52:12Z DEBUG ---------------------------------------------
2015-02-03T20:52:12Z DEBUG Final value after applying updates
2015-02-03T20:52:12Z DEBUG dn: cn=Content Synchronization,cn=plugins,cn=config
2015-02-03T20:52:12Z DEBUG nsslapd-pluginEnabled:
2015-02-03T20:52:12Z DEBUG      on
2015-02-03T20:52:12Z ERROR Add failure missing required attribute "objectclass"
2015-02-03T20:52:12Z INFO Updating existing entry: cn=kerberos,dc=testrelm,dc=test

I'm guessing this is a problem?  But, if so, would it relate to the issues I'm still seeing (mostly with DNS/named I think).

I'm going to open another bug related to those issues and will link to it here.
d it relate to the issues I'm still seeing (mostly with DNS/named I think).

I'll add logs for what I'm seeing now.

Thanks,
Scott

Comment 33 thierry bordaz 2015-02-04 09:45:43 UTC
(In reply to Scott Poore from comment #30)
> Thierry,
> 
> Per comment #23:
> 
> I haven't seen the limits exceeded errors again to confirm that yet.
> 
> Per comment #29:
> 
> > I agree, ability to merge schema definitions is out of the scope of this bug
> > and we could mark that bug to verified with the tests Scott already did.
> 
> Which tests are you talking about?

Hi scott,
  
  This was about the tests https://bugzilla.redhat.com/show_bug.cgi?id=1183655#c28 that Sankar was doing. This is a separated test/issue from what you are doing.

> 
> I'm still having problems with some IPA functionality after an upgrade and
> I'm seeing a lot of named LDAP errors:
> 
> Feb  4 02:59:35 vm-idm-023 named[16726]: LDAP instance 'ipa' is being
> synchronized, please ignore message 'all zones loaded'
> Feb  4 02:59:35 vm-idm-023 named[16726]: LDAP error: Critical extension is
> unavailable: unable to start SyncRepl session: is RFC 4533 supported by LDAP
> server?
> Feb  4 02:59:35 vm-idm-023 named[16726]: ldap_syncrepl will reconnect in 60
> seconds
> 
> Also, I just noticed that during the upgrade I see this:
> 
> Add failure missing required attribute "objectclass"
> 
> Looking in /var/log/ipaupgrade.log:
> 
> 2015-02-03T20:52:12Z INFO New entry: cn=Content
> Synchronization,cn=plugins,cn=config
> 2015-02-03T20:52:12Z DEBUG ---------------------------------------------
> 2015-02-03T20:52:12Z DEBUG Initial value
> 2015-02-03T20:52:12Z DEBUG dn: cn=Content
> Synchronization,cn=plugins,cn=config
> 2015-02-03T20:52:12Z DEBUG only: set nsslapd-pluginEnabled to 'on', current
> value []
> 2015-02-03T20:52:12Z DEBUG only: updated value ['on']
> 2015-02-03T20:52:12Z DEBUG ---------------------------------------------
> 2015-02-03T20:52:12Z DEBUG Final value after applying updates
> 2015-02-03T20:52:12Z DEBUG dn: cn=Content
> Synchronization,cn=plugins,cn=config
> 2015-02-03T20:52:12Z DEBUG nsslapd-pluginEnabled:
> 2015-02-03T20:52:12Z DEBUG      on
> 2015-02-03T20:52:12Z ERROR Add failure missing required attribute
> "objectclass"
> 2015-02-03T20:52:12Z INFO Updating existing entry:
> cn=kerberos,dc=testrelm,dc=test
> 
> I'm guessing this is a problem?  But, if so, would it relate to the issues
> I'm still seeing (mostly with DNS/named I think).
> 
> I'm going to open another bug related to those issues and will link to it
> here.
> d it relate to the issues I'm still seeing (mostly with DNS/named I think).
> 

Both error messages (are related to the same problem). There is a failure to ADD the ReplSync config entry.
[04/Feb/2015:02:22:11 +051800] conn=12 op=65 RESULT err=0 tag=101 nentries=1 etime=0
[04/Feb/2015:02:22:11 +051800] conn=12 op=66 SRCH base="cn=Content Synchronization,cn=plugins,cn=config" scope=0 filter="(objectClass=*)" attrs="aci * attributeTypes objectClasses"
[04/Feb/2015:02:22:11 +051800] conn=12 op=66 RESULT err=32 tag=101 nentries=0 etime=0
[04/Feb/2015:02:22:11 +051800] conn=12 op=67 ADD dn="cn=Content Synchronization,cn=plugins,cn=config"
[04/Feb/2015:02:22:12 +051800] conn=12 op=67 RESULT err=65 tag=105 nentries=0 etime=1

According to the returned message, the added entry was missing the 'objectclass' attribute and so the ADD failed.

The later in ipaupgrade errors were reported related to DNS/named, because it requires ReplSync support. But as ReplSync was not configured/enabled, 389-ds was failing to decode ReplSync controls

[04/Feb/2015:02:59:34 +051800] conn=100 op=3 SRCH base="(null)" scope=2 filter="(|(objectClass=idnsConfigObject)(objectClass=idnsZone)(objectClass=idnsForwardZone)(objectClass=idnsRecord))", failed to decode LDAP controls
[04/Feb/2015:02:59:34 +051800] conn=100 op=3 RESULT err=12 tag=101 nentries=0 etime=0


So this DNS errors are a different bug (not related to schema update). This new bug seems to be that the ADD 'content synchronization' plugin entry was missing objectclass

thanks
thierry

Comment 34 Scott Poore 2015-02-04 16:59:59 UTC
Thanks.  FYI, I added bug #1189154 for the DNS ReplSync issue.

As for the tests you were referring to in comment #29, I meant which tests that I did already?  Tests I did for this bug?

So, do the current logs show this bug is fixed?

How can I confirm that for a normal IPA environment like this with two servers?  Is this what I can look for in the logs?

* errors log on master:

Schema checking successful: ok to push the schema
...
Then will see a series of messages about sending updates:
...
replay_update: Sending modify operation
replay_update: Consumer successfully sent operation with csn
...

See the session summary:
...
session end: state=5 load=1 sent=7 skipped=0 skipped_new_rid=0 skipped_csn_gt_cons_maxcsn=0 skipped_up_to_date=0 skipped_csn_gt_ruv=0 skipped_csn_covered=0
...

Then see state return to normal:

Successfully released consumer
Beginning linger on the connection
State: sending_updates -> wait_for_changes


^^^^^^^^^^^^^^
Is that right or do I have something wrong above or am missing something to confirm this is fixed?

Thanks,
Scott

Comment 35 thierry bordaz 2015-02-04 17:17:19 UTC
Hi Scott,

You are correct and did not miss anything.
The message 'ok to push' is a very good indication that the schema converge on the topology.

You may also do :ldapsearch -h -p -D "cn=DM" -w xx -b "cn=schema" nsSchemaCSN

on both servers. Eventually the nsSchemaCSN should converge to the same value

Thnaks
thierry

Comment 36 Scott Poore 2015-02-04 17:26:20 UTC
Thierry,

How long should it take for thensSchemaCSN to converge?

This has been a while since the upgrade (~20hrs) and I still see a difference:

Checking master:

[root@vm-idm-023 ~]# ldapsearch -h 10.65.206.157 -p 389 -D "cn=Directory Manager" -w Secret123 -b "cn=schema" nsSchemaCSN | grep nsSchemaCSN:
nsSchemaCSN: 54d134f1000000000000

Checking replica:

[root@vm-idm-023 ~]# ldapsearch -h 10.65.206.143 -p 389 -D "cn=Directory Manager" -w Secret123 -b "cn=schema" nsSchemaCSN | grep nsSchemaCSN:
nsSchemaCSN: 54d141e7000000000000

Is this a concern?  Could there be a different issue causing that?

Thanks,
Scott

Comment 37 Scott Poore 2015-02-05 20:13:54 UTC
(In reply to thierry bordaz from comment #21)
> (In reply to Scott Poore from comment #17)
> > ok, I saw some other strange behavior on one of my runs when I had set it
> > that high.
> > 
> > I'm seeing this for several different ipa commands:
> > 
> > ipa: ERROR: limits exceeded for this query
> > 
> > I'm guessing setting sasl max buffer too high can have an adverse affect on
> > a server if running without enough memory?
> > 
> > I tuned it down to 1048576 to see if that still occurs and if that still
> > helps the replication.
> > 
> > I'll keep you posted.
> 
> Scott,
> 
> In an other thread I also get this ipa error: 'limits exceeded for the
> query'.
> This was related to ipa request being timeout. This can be check looking for
> 'err=3' in DS access logs.
> If it is the case, it could/should be avoided with modification of
> ipaSearchtimelimit in ipaconfig.
> In an extrem limit, it also happened while retrieving the ipaconfig entry.
> In that case, it needs to be fixed in ldap2.py (find_entry).
> 
> thanks
> thierry

I think I confirmed this:  

[05/Feb/2015:22:31:05 +051800] conn=55 op=3 SRCH base="cn=ipaconfig,cn=etc,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=ALL
[05/Feb/2015:22:31:05 +051800] conn=55 op=3 RESULT err=3 tag=101 nentries=0 etime=6

...

[05/Feb/2015:22:33:02 +051800] conn=56 op=3 SRCH base="cn=ipaconfig,cn=etc,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=ALL
[05/Feb/2015:22:33:02 +051800] conn=56 op=3 RESULT err=3 tag=101 nentries=0 etime=114

So should I open another bug for that issue for IPA?


FYI, I'm still trying to reproduce the nsSchemaCSN issue above.

Comment 38 thierry bordaz 2015-02-06 17:29:03 UTC
(In reply to Scott Poore from comment #37)
> (In reply to thierry bordaz from comment #21)
> > (In reply to Scott Poore from comment #17)
> > > ok, I saw some other strange behavior on one of my runs when I had set it
> > > that high.
> > > 
> > > I'm seeing this for several different ipa commands:
> > > 
> > > ipa: ERROR: limits exceeded for this query
> > > 
> > > I'm guessing setting sasl max buffer too high can have an adverse affect on
> > > a server if running without enough memory?
> > > 
> > > I tuned it down to 1048576 to see if that still occurs and if that still
> > > helps the replication.
> > > 
> > > I'll keep you posted.
> > 
> > Scott,
> > 
> > In an other thread I also get this ipa error: 'limits exceeded for the
> > query'.
> > This was related to ipa request being timeout. This can be check looking for
> > 'err=3' in DS access logs.
> > If it is the case, it could/should be avoided with modification of
> > ipaSearchtimelimit in ipaconfig.
> > In an extrem limit, it also happened while retrieving the ipaconfig entry.
> > In that case, it needs to be fixed in ldap2.py (find_entry).
> > 
> > thanks
> > thierry
> 
> I think I confirmed this:  
> 
> [05/Feb/2015:22:31:05 +051800] conn=55 op=3 SRCH
> base="cn=ipaconfig,cn=etc,dc=testrelm,dc=test" scope=0
> filter="(objectClass=*)" attrs=ALL
> [05/Feb/2015:22:31:05 +051800] conn=55 op=3 RESULT err=3 tag=101 nentries=0
> etime=6
> 
> ...
> 
> [05/Feb/2015:22:33:02 +051800] conn=56 op=3 SRCH
> base="cn=ipaconfig,cn=etc,dc=testrelm,dc=test" scope=0
> filter="(objectClass=*)" attrs=ALL
> [05/Feb/2015:22:33:02 +051800] conn=56 op=3 RESULT err=3 tag=101 nentries=0
> etime=114
> 
> So should I open another bug for that issue for IPA?

Yes please this is a different issue.
I do not understand why a base search would take more more that 114 second. This is likely a problem at 389-ds level.
Test case should be described and possible workaround in freeipa (https://bugzilla.redhat.com/show_bug.cgi?id=1183655#c21) also

> 
> 
> FYI, I'm still trying to reproduce the nsSchemaCSN issue above.

Comment 39 Noriko Hosoi 2015-02-06 18:06:04 UTC
(In reply to thierry bordaz from comment #38)
> 
> Yes please this is a different issue.
> I do not understand why a base search would take more more that 114 second.
> This is likely a problem at 389-ds level.
> Test case should be described and possible workaround in freeipa
> (https://bugzilla.redhat.com/show_bug.cgi?id=1183655#c21) also

Sorry to jump in...  I wonder if any task could be invoked at the same time when the search was running?  Are there any activities on the other threads reported in the access log and error log?

> [05/Feb/2015:22:33:02 +051800] conn=56 op=3 SRCH
> base="cn=ipaconfig,cn=etc,dc=testrelm,dc=test" scope=0
> filter="(objectClass=*)" attrs=ALL
> [05/Feb/2015:22:33:02 +051800] conn=56 op=3 RESULT err=3 tag=101 nentries=0
> etime=114

The search with scope=0 should be quite fast -- getting the OID from the entryrdn index, then retrieving the entry from the id2entry.  In this case, traversing the entryrdn should have returned none, which should not take THAT long time...

Also, could you examine the entryrdn index file is not corrupted?  E.g., by ...
# dbscan -f /path/to/server_instance/db/backend_instance/entrydn.db

Thanks!

Comment 42 thierry bordaz 2015-02-11 17:11:09 UTC
Replication of the schema was working successfully, we can see that the master was able to send its schema (2015:21:16:33) to the replica. This occurred after a transient failure ('unable to replicate schema' on 2015:21:16:31):

	master log
[10/Feb/2015:21:16:31 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): Warning: unable to replicate schema: rc=1
[10/Feb/2015:21:16:33 -0500] NSMMReplicationPlugin - [S] Checking consumer schema localcsn:54dabb80000000000000 / remotecsn:NULL
[10/Feb/2015:21:16:34 -0500] NSMMReplicationPlugin - [S] Reread remotecsn:54dab76a000000000000
[10/Feb/2015:21:16:34 -0500] NSMMReplicationPlugin - Schema checking successful: ok to push the schema (agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389))

	replica log
[10/Feb/2015:21:16:33 -0500] conn=119 op=12 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[10/Feb/2015:21:16:33 -0500] conn=Internal op=-1 SRCH base="cn=dc\3Dtestrelm\2Cdc\3Dtest,cn=mapping tree,cn=config" scope=0 filter="objectclass=nsMappingTree" attrs="nsslapd-referral"
[10/Feb/2015:21:16:33 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0
[10/Feb/2015:21:16:33 -0500] conn=Internal op=-1 SRCH base="cn=dc\3Dtestrelm\2Cdc\3Dtest,cn=mapping tree,cn=config" scope=0 filter="objectclass=nsMappingTree" attrs="nsslapd-state"
[10/Feb/2015:21:16:33 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0
[10/Feb/2015:21:16:33 -0500] conn=119 op=12 RESULT err=0 tag=120 nentries=0 etime=0
[10/Feb/2015:21:16:33 -0500] conn=119 op=13 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="objectClasses"
[10/Feb/2015:21:16:33 -0500] conn=119 op=13 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:16:33 -0500] conn=119 op=14 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="attributeTypes"
[10/Feb/2015:21:16:33 -0500] conn=119 op=14 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:16:33 -0500] conn=119 op=15 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[10/Feb/2015:21:16:33 -0500] conn=119 op=15 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:16:33 -0500] conn=119 op=16 MOD dn="cn=schema"
[10/Feb/2015:21:16:34 -0500] conn=119 op=16 RESULT err=0 tag=103 nentries=0 etime=1

Later we can see that the replica also succeeded to update the master with its schema (2015:21:37:34)

	replica log
[10/Feb/2015:21:37:31 -0500] NSMMReplicationPlugin - [S] Checking consumer schema localcsn:54dac06b000000000000 / remotecsn:NULL
[10/Feb/2015:21:37:33 -0500] NSMMReplicationPlugin - [S] Reread remotecsn:54dabb96000000000000
10/Feb/2015:21:37:33 -0500] NSMMReplicationPlugin - Schema checking successful: ok to push the schema (agmt="cn=meToipaqavmg.testrelm.test" (ipaqavmg:389))

	master log
[10/Feb/2015:21:37:31 -0500] conn=238 fd=108 slot=108 connection from 10.16.46.51 to 10.16.98.192
[10/Feb/2015:21:37:31 -0500] conn=238 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[10/Feb/2015:21:37:31 -0500] conn=238 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress
[10/Feb/2015:21:37:31 -0500] conn=238 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
[10/Feb/2015:21:37:31 -0500] conn=238 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress
[10/Feb/2015:21:37:31 -0500] conn=238 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
[10/Feb/2015:21:37:31 -0500] conn=238 op=2 RESULT err=0 tag=97 nentries=0 etime=0 dn="krbprincipalname=ldap/hp-dl380pgen8-02-vm-2.testrelm.test@testrelm.test,cn=services,cn=accounts,dc=testrelm,dc=test"
[10/Feb/2015:21:37:31 -0500] conn=238 op=3 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
[10/Feb/2015:21:37:31 -0500] conn=238 op=3 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:31 -0500] conn=238 op=4 SRCH base="" scope=0 filter="(objectClass=*)" attrs="supportedControl supportedExtension"
[10/Feb/2015:21:37:31 -0500] conn=238 op=4 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:31 -0500] conn=238 op=5 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[10/Feb/2015:21:37:31 -0500] conn=238 op=5 RESULT err=0 tag=120 nentries=0 etime=0
[10/Feb/2015:21:37:31 -0500] conn=238 op=6 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="objectClasses"
[10/Feb/2015:21:37:31 -0500] conn=238 op=6 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:32 -0500] conn=238 op=7 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="attributeTypes"
[10/Feb/2015:21:37:32 -0500] conn=238 op=7 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:33 -0500] conn=238 op=8 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="objectClasses"
[10/Feb/2015:21:37:33 -0500] conn=238 op=8 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:33 -0500] conn=238 op=9 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="attributeTypes"
[10/Feb/2015:21:37:33 -0500] conn=238 op=9 RESULT err=0 tag=101 nentries=1 etime=0
[10/Feb/2015:21:37:33 -0500] conn=238 op=10 SRCH base="cn=schema" scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[10/Feb/2015:21:37:34 -0500] conn=238 op=10 RESULT err=0 tag=101 nentries=1 etime=1
[10/Feb/2015:21:37:34 -0500] conn=238 op=11 MOD dn="cn=schema"
[10/Feb/2015:21:37:34 -0500] conn=238 op=11 RESULT err=0 tag=103 nentries=0 etime=0

Comment 43 thierry bordaz 2015-02-12 15:18:03 UTC
The problem reported in 
	https://bugzilla.redhat.com/show_bug.cgi?id=1183655#c40 
	https://bugzilla.redhat.com/show_bug.cgi?id=1183655#c41 
is a different problem than the replication of the schema.


Here are my findings regarding this last issue:


	On Master a first DNS entry was added then later deleted
	and a second entry (with a '.' at the end of the RDN) was added

[10/Feb/2015:21:04:03 -0500] conn=202 op=7 ADD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:04:03 -0500] conn=202 op=7 RESULT err=0 tag=105 nentries=0 etime=0 csn=54dab894000100040000
...
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 DEL dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:18:02 -0500] conn=2 op=1155 RESULT err=0 tag=107 nentries=0 etime=0 csn=54dabbdb001100040000
...
[10/Feb/2015:21:18:02 -0500] conn=2 op=1160 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=""
[10/Feb/2015:21:18:02 -0500] conn=2 op=1160 RESULT err=32 tag=101 nentries=0 etime=0
..
[10/Feb/2015:21:18:02 -0500] conn=2 op=1163 ADD dn="idnsname=ipaad2012r2.test.,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:18:02 -0500] conn=2 op=1163 RESULT err=0 tag=105 nentries=0 etime=0 csn=54dabbdb001200040000

	Later we can see that first DSN entry was no longer present
	and the second one can be found

[10/Feb/2015:22:06:41 -0500] conn=590 op=4 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs=""
[10/Feb/2015:22:06:41 -0500] conn=590 op=4 RESULT err=32 tag=101 nentries=0 etime=0
[10/Feb/2015:22:06:41 -0500] conn=590 op=5 SRCH base="idnsname=ipaad2012r2.test.,cn=dns,dc=testrelm,dc=test" scope=0 filter="(objectClass=*)" attrs="objectClass"
[10/Feb/2015:22:06:41 -0500] conn=590 op=5 RESULT err=0 tag=101 nentries=1 etime=0
	



	The problem is that the replica did not get the DEL operation of the first entry

[10/Feb/2015:21:04:03 -0500] conn=22 op=328 ADD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:04:03 -0500] conn=22 op=328 RESULT err=0 tag=105 nentries=0 etime=1 csn=54dab894000100040000


[10/Feb/2015:21:38:14 -0500] conn=6 op=8 MOD dn="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test"
[10/Feb/2015:21:38:14 -0500] conn=Internal op=-1 SRCH base="idnsname=ipaad2012r2.test,cn=dns,dc=testrelm,dc=test" scope=0 filter="(|(objectclass=*)(objectclass=ldapsubentry))" attrs=ALL
[10/Feb/2015:21:38:14 -0500] conn=Internal op=-1 RESULT err=0 tag=48 nentries=1 etime=0
[10/Feb/2015:21:38:14 -0500] conn=6 op=8 RESULT err=0 tag=103 nentries=0 etime=0 csn=54dac097000700030000


	When the master was supposed to replicate the DEL, several stop/start occured.
	I do not know it it can explain why the DEL got not replicated
[10/Feb/2015:21:18:04 -0500] - slapd stopped.
[10/Feb/2015:21:18:08 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:18:44 -0500] - slapd stopped.
[10/Feb/2015:21:18:47 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:19:44 -0500] - slapd stopped.
[10/Feb/2015:21:19:51 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[10/Feb/2015:21:22:24 -0500] - slapd stopped.
[10/Feb/2015:21:22:28 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

	Looking at the master logs there is an unexpected
	RA wants to start replication at update 54dabbd4001300040000
	But access to the changelog via a cursor reports an error (DB_DELETED : -30988 like if changelog db file was deleted)
	On this errors, the server takes a "backup" starting csn 54daba2c000300030000 and
	starts sending update from csn=54dabc4a000000040000. 
		So it skipped more that 100 updates !

[10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54dabbd4001300040000
[10/Feb/2015:21:20:23 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - clcache_load_buffer: rc=-30988
[10/Feb/2015:21:20:23 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54dabbd4001300040000 not found and no purging, probably a reinit
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): Will try to use supplier min CSN 54daba2c000300030000 to load changelog
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - session start: anchorcsn=54daba2c000300030000
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - changelog program - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): CSN 54daba2c000300030000 found, position set for replay
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - Skipping update because the changelog buffer current csn [54daba2c000300030000] is less than or equal to the consumer max csn [54dabc68000000030000]
[10/Feb/2015:21:20:24 -0500] agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389) - load=1 rec=2 csn=54dabc4a000000040000
[10/Feb/2015:21:20:24 -0500] NSMMReplicationPlugin - agmt="cn=meTohp-dl380pgen8-02-vm-2.testrelm.test" (hp-dl380pgen8-02-vm-2:389): replay_update: Sending modify operation (dn="fqdn=ipaqavmg.testrelm.test,cn=computers,cn=accounts,dc=testrelm,dc=test" csn=54dabc4a000000040000)


	In conclusion: 
                Scott, please open a separated bug for this last issue.
		We need to find why changelog cursor returned DB_DELETED.
		Also determine how replication should handle this error because in that 
		case is silently skipped tons of updates

Comment 44 Sankar Ramalingam 2015-02-20 11:05:04 UTC
Based on Thierry's comments(#25, #27, #29 and #43), it looks like the original issue is fixed. The problems reported by Scott should be filed as separate bugs.
Hence, marking the bug as Verified.

Comment 45 Scott Poore 2015-02-20 15:58:21 UTC
Yes, I've got separate bugs for the other issues that are separate from the schema copy issue here.

I've also added a quick check to the upgrade tests that confirms a couple things and have a run from yesterday:

:: [   PASS   ] :: File '/var/log/dirsrv/slapd-TESTRELM-TEST/errors' should contain 'Schema checking successful' 
:: [  BEGIN   ] :: Running 'scp -o StrictHostKeyChecking=no qe-blade-14.<DOMAINREPLACED>:/var/log/dirsrv/slapd-TESTRELM-TEST/access /tmp/qe-blade-14.<DOMAINREPLACED>.dirsrv.access'
:: [   PASS   ] :: Command 'scp -o StrictHostKeyChecking=no qe-blade-14.<DOMAINREPLACED>:/var/log/dirsrv/slapd-TESTRELM-TEST/access /tmp/qe-blade-14.<DOMAINREPLACED>.dirsrv.access' (Expected 0, got 0)
:: [   PASS   ] :: File '/tmp/qe-blade-14.<DOMAINREPLACED>.dirsrv.access' should contain 'MOD.*cn=schema' 
:: [   PASS   ] :: File 'ipa_upgrade_bz1183655.JoLij' should not contain 'err=[1-9_-]' 
:: [ 16:36:12 ] :: SRCHOST=54e65554000000000000
:: [ 16:36:13 ] :: TGTHOST=54e65554000000000000
:: [   PASS   ] :: nsSchemaCSN is the same on both servers 

Version:

389-ds-base-1.3.3.1-14.el7.x86_64

Comment 47 errata-xmlrpc 2015-03-05 09:40:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0416.html


Note You need to log in before you can comment on or make changes to this bug.