691313 – Need TLS/SSL error messages in repl status and errors log

Bug 691313 - Need TLS/SSL error messages in repl status and errors log

Summary: Need TLS/SSL error messages in repl status and errors log

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Replication - General
Sub Component:
Version:	1.2.5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Nathan Kinder
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	690318 389_1.2.9 720461
TreeView+	depends on / blocked

Reported:	2011-03-28 07:53 UTC by Moisés Barba Pérez
Modified:	2015-12-07 17:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Clones:	720461 (view as bug list)
Environment:
Last Closed:	2015-12-07 17:06:00 UTC
Embargoed:

Attachments	(Terms of Use)
Patch (3.52 KB, patch) 2011-07-08 20:08 UTC, Nathan Kinder	nhosoi: review+	Details \| Diff
View All

Description Moisés Barba Pérez 2011-03-28 07:53:16 UTC

Description of problem:
My directory server have a permanent 'incremental update' failure witch stay throught after a 'send updates now' from the administration console.

When I set the replication log to see whats happening I get this:

[24/Mar/2011:17:31:49 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_fatal_error
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_normal_termination
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_normal_termination -> stop_normal_termination
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Disconnected from the consumer
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Disconnected from the consumer
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: start -> ready_to_acquire_replica
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Trying secure slapi_ldap_init_ext
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): binddn = cn=Replication Manager, cn=config,  passwd = {DES}xxxxxxxxxxxxxxxxxxx==
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): No linger to cancel on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Beginning linger on the connection
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: ready_to_acquire_replica -> stop_fatal_error
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): Incremental update failed and requires administrator action
[24/Mar/2011:17:31:50 +0100] NSMMReplicationPlugin - agmt="cn=x32" (server:636): State: stop_fatal_error -> stop_fatal_error

The log is not very clear about the problem, I tried to look for the error in the code but there was many ways for that problem. Finally I try to do a ldapsearch between the two servers in the replication getting this error (from the supplier to the consumer):

'hostname does not match CN in peer certificate'

To solve the problem I have deleted the entry for the consumer server into the /etc/hosts file of the supplier server. (I don't know why but that worked)

Version-Release number of selected component (if applicable):
CentOS 5.5
389DS 1.2.5

How reproducible:
Actually I don't know. I have get this error for first time and all my servers have about the same configuration.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Rich Megginson 2011-03-28 13:29:47 UTC

I think the problem here is that the replication status and the errors log should contain the 'hostname does not match CN in peer certificate' message, or any other relevant message that could help diagnose connection problems.

Comment 2 Moisés Barba Pérez 2011-03-29 06:21:47 UTC

I think so too, because the 'stop_fatal_error' is very general to get the problem easy.

Comment 3 Nathan Kinder 2011-07-08 18:09:47 UTC

With replication level logging, I see the following in the errors log on my system the first time the certificate hostname mismatch error is encountered:

---------------------------------------------------------------------------
[08/Jul/2011:11:03:21 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:2636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS: hostname does not match CN in peer certificate)
---------------------------------------------------------------------------

After this first error, replication goes into backoff mode.  This message is not logged a second time.  This is a different state that "fatal stop", so the original scenario must be a bit different than my test case.  I'm not sure why my test doesn't exactly match up with the original reported case though.

Comment 4 Nathan Kinder 2011-07-08 18:18:45 UTC

When we encounter an error during the bind stage in a replication session, we only log the error the first time we encounter it.  We do this to prevent filling the logs with the same error over and over.

From looking at the code, I believe that the certificate hostname error was likely logged on the reporters system when it first tried to start the replication session.  The message is then suppressed until we get a different return code (or the DS instance is restarted).

Comment 5 Nathan Kinder 2011-07-08 20:08:01 UTC

Created attachment 511991 [details]
Patch

Comment 6 Nathan Kinder 2011-07-08 20:18:27 UTC

Pushed to master.  Thanks to Noriko for her review!

Counting objects: 15, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.08 KiB, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   354249c..a24d9cc  master -> master

Comment 9 Amita Sharma 2011-11-08 12:41:45 UTC

Followed steps in comment#8 and I got :
08/Nov/2011:17:41:06 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:06 +051800] NSMMReplicationPlugin - agmt="cn=mmr2" (snmaptest:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS: hostname does not match CN in peer certificate)
[08/Nov/2011:17:41:10 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:13 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:13 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:16 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:17 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)
[08/Nov/2011:17:41:17 +051800] slapi_ldap_bind - Error: could not send bind request for id [cn=replication manager,cn=config] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8157 (Certificate extension not found.) 115 (Operation now in progress)

Hence marking as VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.