Bug 720461

Summary: Need TLS/SSL error messages in repl status and errors log
Product: Red Hat Enterprise Linux 6 Reporter: Rich Megginson <rmeggins>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: amsharma, jgalipea, mbarper, nkinder, shaines
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.9.11-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 691313 Environment:
Last Closed: 2011-12-06 17:55:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 691313    
Bug Blocks: 690318    

Description Rich Megginson 2011-07-11 17:03:22 UTC
+++ This bug was initially created as a clone of Bug #691313 +++

Description of problem:
My directory server have a permanent 'incremental update' failure witch stay throught after a 'send updates now' from the administration console.

When I set the replication log to see whats happening I get this:

[24/Mar/2011:17:31:49 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_fatal_error
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_normal_termination
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_normal_termination -> stop_normal_termination
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Disconnected from the consumer
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Disconnected from the consumer
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: start -> ready_to_acquire_replica
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Trying secure slapi_ldap_init_ext
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): binddn = cn=Replication Manager, cn=config,  passwd = {DES}xxxxxxxxxxxxxxxxxxx==
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Beginning linger on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: ready_to_acquire_replica -> stop_fatal_error
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Incremental update failed and requires administrator action
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_fatal_error

The log is not very clear about the problem, I tried to look for the error in the code but there was many ways for that problem. Finally I try to do a ldapsearch between the two servers in the replication getting this error (from the supplier to the consumer):

'hostname does not match CN in peer certificate'

To solve the problem I have deleted the entry for the consumer server into the /etc/hosts file of the supplier server. (I don't know why but that worked)

Version-Release number of selected component (if applicable):
CentOS 5.5
389DS 1.2.5

How reproducible:
Actually I don't know. I have get this error for first time and all my servers have about the same configuration.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

--- Additional comment from rmeggins on 2011-03-28 09:29:47 EDT ---

I think the problem here is that the replication status and the errors log should contain the 'hostname does not match CN in peer certificate' message, or any other relevant message that could help diagnose connection problems.

--- Additional comment from mbarperoi on 2011-03-29 02:21:47 EDT ---

I think so too, because the 'stop_fatal_error' is very general to get the problem easy.

--- Additional comment from nkinder on 2011-07-08 14:09:47 EDT ---

With replication level logging, I see the following in the errors log on my system the first time the certificate hostname mismatch error is encountered:

---------------------------------------------------------------------------
[08/Jul/2011:11:03:21 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:2636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS: hostname does not match CN in peer certificate)
---------------------------------------------------------------------------

After this first error, replication goes into backoff mode.  This message is not logged a second time.  This is a different state that "fatal stop", so the original scenario must be a bit different than my test case.  I'm not sure why my test doesn't exactly match up with the original reported case though.

--- Additional comment from nkinder on 2011-07-08 14:18:45 EDT ---

When we encounter an error during the bind stage in a replication session, we only log the error the first time we encounter it.  We do this to prevent filling the logs with the same error over and over.

From looking at the code, I believe that the certificate hostname error was likely logged on the reporters system when it first tried to start the replication session.  The message is then suppressed until we get a different return code (or the DS instance is restarted).

--- Additional comment from nkinder on 2011-07-08 16:08:01 EDT ---

Created attachment 511991 [details]
Patch

--- Additional comment from nkinder on 2011-07-08 16:18:27 EDT ---

Pushed to master.  Thanks to Noriko for her review!

Counting objects: 15, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.08 KiB, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   354249c..a24d9cc  master -> master

Comment 3 Chandrasekar Kannan 2011-09-16 21:33:48 UTC
ds-replication is no longer a component of rhel. folding back to 389-ds-base.

Comment 5 Jenny Severance 2011-09-27 18:46:33 UTC
Please add steps to reproduce / verify this bug.  Thanks

Comment 6 Rich Megginson 2011-09-27 19:24:03 UTC
(In reply to comment #5)
> Please add steps to reproduce / verify this bug.  Thanks

see https://bugzilla.redhat.com/show_bug.cgi?id=691313#c8

Comment 7 Amita Sharma 2011-11-08 12:57:41 UTC
[08/Nov/2011:17:03:40 +051800] NSMMReplicationPlugin - agmt="cn=mmr2" (snmaptest:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS: hostname does not match CN in peer certificate)

Hence marking as VERIFIED

Comment 8 errata-xmlrpc 2011-12-06 17:55:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1711.html