Bug 821176

Summary:

ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0

Product:

Red Hat Enterprise Linux 6

Reporter:

Scott Poore <spoore>

Component:

389-ds-base

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED ERRATA

QA Contact:

IDM QE LIST <seceng-idm-qe-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.3

CC:

jgalipea, mkosek, nhosoi, nkinder, rmeggins

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

389-ds-base.1.2.10.2-12.el6

Doc Type:

Bug Fix

Doc Text:

This bug was introduced by the fix for Bug 819643 - "Database RUV could mismatch the one in changelog under the stress" which is in the same errata.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-06-20 07:15:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Stack trace	none

Description Scott Poore 2012-05-12 19:13:51 UTC

Description of problem:

Dirsrv seems to be crashing.  Initially this was seen after running the following on an IPA replica:

ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD"

Closer inspection shows that it's not just this causing the issue but, also happening outside of this.  Any time replication is attempted maybe?

Version-Release number of selected component (if applicable):

ipa-server-2.2.0-13.el6.x86_64
389-ds-base-1.2.10.2-11.el6.x86_64

How reproducible:
very.

Steps to Reproduce:
1. <setup rhel6.2 IPA Master>
2. <setup rhel6.2 IPA Replica>
3. <point servers to yum repos with rhel6.3>
4. on both run:  yum -y update 'ipa*' 
5. on replica run:  ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD"
  
Actual results:

Shortly after, if not during the force-sync, dirsrv on the master is stopped.  Looking at  the log, I see a ns-slapd segfault in /var/log/messages:

May 12 15:03:57 qe-blade-12 kernel: ns-slapd[11425]: segfault at 30008 ip 00007f1c1b54d276 sp 00007f1bfaff0840 error 4 in libreplication-plugin.so[7f1c1b526000+7d000]

Right around the force-sync and segfault, on the replica, I see this in /var/log/dirsrv/slapd-$INSTANCE/errors:

[12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1
2:389): Replication bind with GSSAPI auth resumed
[12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1
2:389): Consumer failed to replay change (uniqueid 97259e87-9c4f11e1-b596ca1b-778d212c, CSN 4fae8fef000
100030000): Operations error. Will retry later.
[12/May/2012:15:03:55 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Consumer failed to replay change (uniqueid d30b348a-9c4c11e1-b596ca1b-778d212c, CSN 4fae8ff9000000030000): Can't contact LDAP server. Will retry later.
[12/May/2012:15:03:56 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Warning: unable to send endReplication extended operation (Can't contact LDAP server)

Expected results:

replica sync'd without failing the Master dirsrv.

Additional info:

Comment 2 Rob Crittenden 2012-05-14 13:55:06 UTC

Are you getting a core dump? Can you get a backtrace of the crashed 389-ds instance?

Comment 3 Scott Poore 2012-05-14 16:43:52 UTC

Created attachment 584409 [details]
Stack trace

Comment 4 Noriko Hosoi 2012-05-14 17:20:31 UTC

I ran into the similar problem.  This patch is supposed to fix the bug.  The current llist code fails to set list->tail to NULL at the right place.  I'm going to rebuild 389-ds-base with this patch and others in 1.2.10.2-12 once our reliability test is passed.

diff --git a/ldap/servers/plugins/replication/llist.c b/ldap/servers/plugins/rep
index e80f532..05cfa48 100644
--- a/ldap/servers/plugins/replication/llist.c
+++ b/ldap/servers/plugins/replication/llist.c
@@ -165,14 +165,14 @@ void* llistRemoveCurrentAndGetNext (LList *list, void **it
        if (node)
        {
                prevNode->next = node->next;    
+               if (list->tail == node) {
+                       list->tail = prevNode;
+               }
                _llistDestroyNode (&node, NULL);
                node = prevNode->next;
                if (node) {
                        return node->data;
                } else {
-                       if (list->head->next == NULL) {
-                               list->tail = NULL;
-                       }
                        return NULL;
                }
        }


Thread 1 (Thread 0x7fc84e1fc700 (LWP 18031)):
#0  0x00007fc86b0e8276 in csnplInsert (csnpl=0x7fc838008090, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/csnpl.c:155
        rc = <value optimized out>
        csnplnode = 0x30000
        csn_str = "\000\000\000\000\000\000\000\000\246A\020k\310\177\000\000p\344\t\001"
#1  0x00007fc86b1051ac in ruv_add_csn_inprogress (ruv=0x147f5f0, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_ruv.c:1438
        replica = 0x7fc8380044e0
        csn_str = "\024\000\000\000\000\000\000\000\300\067C\001\000\000\000\000p\203M\001"
        rc = 0
#2  0x00007fc86b0fa08c in process_operation (pb=<value optimized out>, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_plugins.c:1316
        r_obj = 0x1442c30
        r = <value optimized out>
        ruv_obj = 0x11e8120
        ruv = <value optimized out>
        rc = <value optimized out>
#3  0x00007fc86b0fa683 in multimaster_preop_modify (pb=0x14d8370) at ldap/servers/plugins/replication/repl5_plugins.c:452
        csn = 0x7fc8280014a0
        target_uuid = 0x7fc828000e40 "d30b348a-9c4c11e1-b596ca1b-778d212c"
        drc = <value optimized out>
        ctrlp = 0x7fc828002b90
        sessionid = "conn=19 op=9", '\000' <repeats 12 times>, " j\0
[...]

Comment 5 Nathan Kinder 2012-05-14 20:46:08 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/359

Comment 10 Scott Poore 2012-05-17 20:44:13 UTC

Verified.

Version ::

ipa-server-2.2.0-14.el6.x86_64
389-ds-base-1.2.10.2-12.el6.x86_64

Automated Test Results ::

automation not yet run from beaker.  This was manually executed...


On MASTER:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [15:34:05] ::  Machine in recipe is MASTER
:: [15:34:06] ::  Restarting IPA services
Restarting Directory Service
Shutting down dirsrv: 
    PKI-IPA...                                             [  OK  ]
    TESTRELM-COM...                                        [  OK  ]
Starting dirsrv: 
    PKI-IPA...                                             [  OK  ]
    TESTRELM-COM...                                        [  OK  ]
Restarting KDC Service
Stopping Kerberos 5 KDC:                                   [  OK  ]
Starting Kerberos 5 KDC:                                   [  OK  ]
Restarting KPASSWD Service
Stopping Kerberos 5 Admin Server:                          [  OK  ]
Starting Kerberos 5 Admin Server:                          [  OK  ]
Restarting DNS Service
Stopping named: .                                          [  OK  ]
Starting named:                                            [  OK  ]
Restarting MEMCACHE Service
Stopping ipa_memcached:                                    [  OK  ]
Starting ipa_memcached:                                    [  OK  ]
Restarting HTTP Service
Stopping httpd:                                            [  OK  ]
Starting httpd: [Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker
[Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker
                                                           [  OK  ]
Restarting CA Service
Stopping pki-ca:                                           [  OK  ]
Starting pki-ca:                                           [  OK  ]
:: [   PASS   ] :: Running 'ipactl restart'
result_server not set, assuming developer mode.
Setting 192.168.122.101 to state upgrade_bz_821176.18.1
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.1' -m 192.168.122.101'
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.102']
are in one of these states: ['upgrade_bz_821176.18.2']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.2' 192.168.122.102'
:: [15:36:17] ::  Checking /var/log/messages for ns-slapd segfault
:: [   PASS   ] :: BZ 821176 not found.  No ns-slapd segfault found in /var/log/messages
:: [15:36:17] ::  Checking /var/log/dirsrv/slapd-TESTRELM-COM/errors for LDAP error
:: [   PASS   ] :: BZ 821176 not found...didn't find LDAP error in dirsrv log
result_server not set, assuming developer mode.
Setting 192.168.122.101 to state upgrade_bz_821176.18.3
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.3' -m 192.168.122.101'


On REPLICA:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [15:34:14] ::  Machine in recipe is SLAVE
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.101']
are in one of these states: ['upgrade_bz_821176.18.1']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.1' 192.168.122.101'
:: [15:36:04] ::  Running ipa-replica-manage force-sync to make sure that works
ipa: INFO: Setting agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config schedule to 2358-2359 0 to force synch
ipa: INFO: Deleting schedule 2358-2359 0 from agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config
:: [   PASS   ] :: Running 'ipa-replica-manage force-sync --from=spoore-dvm1.testrelm.com --password=Secret123'
result_server not set, assuming developer mode.
Setting 192.168.122.102 to state upgrade_bz_821176.18.2
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.2' -m 192.168.122.102'
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.101']
are in one of these states: ['upgrade_bz_821176.18.3']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.3' 192.168.122.101'

Manual Test Results ::

# grep -i segfault /var/log/messages
#

 
# grep -i "NSMMReplicationPlugin.*Warning: unable to send endReplication extended operation.*Can't contact LDAP server" /var/log/dirsrv/slapd-TESTRELM-COM/errors
#

Comment 11 Noriko Hosoi 2012-05-25 01:15:17 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
This bug was introduced by the fix for Bug 819643 - "Database RUV could mismatch the one in changelog under the stress" which is in the same errata.

Comment 12 errata-xmlrpc 2012-06-20 07:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0813.html