821176 – ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 821176 - ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0

Summary: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Rich Megginson
QA Contact:	IDM QE LIST
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-05-12 19:13 UTC by Scott Poore
Modified:	2020-09-13 20:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:	389-ds-base.1.2.10.2-12.el6
Doc Type:	Bug Fix
Doc Text:	This bug was introduced by the fix for Bug 819643 - "Database RUV could mismatch the one in changelog under the stress" which is in the same errata.
Clone Of:
Environment:
Last Closed:	2012-06-20 07:15:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Stack trace (4.14 KB, application/x-gzip) 2012-05-14 16:43 UTC, Scott Poore	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 359	0	None	None	None	2020-09-13 20:10:16 UTC
Red Hat Product Errata	RHSA-2012:0813	0	normal	SHIPPED_LIVE	Low: 389-ds-base security, bug fix, and enhancement update	2012-06-19 19:29:15 UTC

Description Scott Poore 2012-05-12 19:13:51 UTC

Description of problem:

Dirsrv seems to be crashing.  Initially this was seen after running the following on an IPA replica:

ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD"

Closer inspection shows that it's not just this causing the issue but, also happening outside of this.  Any time replication is attempted maybe?

Version-Release number of selected component (if applicable):

ipa-server-2.2.0-13.el6.x86_64
389-ds-base-1.2.10.2-11.el6.x86_64

How reproducible:
very.

Steps to Reproduce:
1. <setup rhel6.2 IPA Master>
2. <setup rhel6.2 IPA Replica>
3. <point servers to yum repos with rhel6.3>
4. on both run:  yum -y update 'ipa*' 
5. on replica run:  ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD"
  
Actual results:

Shortly after, if not during the force-sync, dirsrv on the master is stopped.  Looking at  the log, I see a ns-slapd segfault in /var/log/messages:

May 12 15:03:57 qe-blade-12 kernel: ns-slapd[11425]: segfault at 30008 ip 00007f1c1b54d276 sp 00007f1bfaff0840 error 4 in libreplication-plugin.so[7f1c1b526000+7d000]

Right around the force-sync and segfault, on the replica, I see this in /var/log/dirsrv/slapd-$INSTANCE/errors:

[12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1
2:389): Replication bind with GSSAPI auth resumed
[12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1
2:389): Consumer failed to replay change (uniqueid 97259e87-9c4f11e1-b596ca1b-778d212c, CSN 4fae8fef000
100030000): Operations error. Will retry later.
[12/May/2012:15:03:55 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Consumer failed to replay change (uniqueid d30b348a-9c4c11e1-b596ca1b-778d212c, CSN 4fae8ff9000000030000): Can't contact LDAP server. Will retry later.
[12/May/2012:15:03:56 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Warning: unable to send endReplication extended operation (Can't contact LDAP server)

Expected results:

replica sync'd without failing the Master dirsrv.

Additional info:

Comment 2 Rob Crittenden 2012-05-14 13:55:06 UTC

Are you getting a core dump? Can you get a backtrace of the crashed 389-ds instance?

Comment 3 Scott Poore 2012-05-14 16:43:52 UTC

Created attachment 584409 [details]
Stack trace

Comment 4 Noriko Hosoi 2012-05-14 17:20:31 UTC

I ran into the similar problem.  This patch is supposed to fix the bug.  The current llist code fails to set list->tail to NULL at the right place.  I'm going to rebuild 389-ds-base with this patch and others in 1.2.10.2-12 once our reliability test is passed.

diff --git a/ldap/servers/plugins/replication/llist.c b/ldap/servers/plugins/rep
index e80f532..05cfa48 100644
--- a/ldap/servers/plugins/replication/llist.c
+++ b/ldap/servers/plugins/replication/llist.c
@@ -165,14 +165,14 @@ void* llistRemoveCurrentAndGetNext (LList *list, void **it
        if (node)
        {
                prevNode->next = node->next;    
+               if (list->tail == node) {
+                       list->tail = prevNode;
+               }
                _llistDestroyNode (&node, NULL);
                node = prevNode->next;
                if (node) {
                        return node->data;
                } else {
-                       if (list->head->next == NULL) {
-                               list->tail = NULL;
-                       }
                        return NULL;
                }
        }


Thread 1 (Thread 0x7fc84e1fc700 (LWP 18031)):
#0  0x00007fc86b0e8276 in csnplInsert (csnpl=0x7fc838008090, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/csnpl.c:155
        rc = <value optimized out>
        csnplnode = 0x30000
        csn_str = "\000\000\000\000\000\000\000\000\246A\020k\310\177\000\000p\344\t\001"
#1  0x00007fc86b1051ac in ruv_add_csn_inprogress (ruv=0x147f5f0, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_ruv.c:1438
        replica = 0x7fc8380044e0
        csn_str = "\024\000\000\000\000\000\000\000\300\067C\001\000\000\000\000p\203M\001"
        rc = 0
#2  0x00007fc86b0fa08c in process_operation (pb=<value optimized out>, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_plugins.c:1316
        r_obj = 0x1442c30
        r = <value optimized out>
        ruv_obj = 0x11e8120
        ruv = <value optimized out>
        rc = <value optimized out>
#3  0x00007fc86b0fa683 in multimaster_preop_modify (pb=0x14d8370) at ldap/servers/plugins/replication/repl5_plugins.c:452
        csn = 0x7fc8280014a0
        target_uuid = 0x7fc828000e40 "d30b348a-9c4c11e1-b596ca1b-778d212c"
        drc = <value optimized out>
        ctrlp = 0x7fc828002b90
        sessionid = "conn=19 op=9", '\000' <repeats 12 times>, " j\0
[...]

Comment 5 Nathan Kinder 2012-05-14 20:46:08 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/359

Comment 10 Scott Poore 2012-05-17 20:44:13 UTC

Verified.

Version ::

ipa-server-2.2.0-14.el6.x86_64
389-ds-base-1.2.10.2-12.el6.x86_64

Automated Test Results ::

automation not yet run from beaker.  This was manually executed...


On MASTER:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [15:34:05] ::  Machine in recipe is MASTER
:: [15:34:06] ::  Restarting IPA services
Restarting Directory Service
Shutting down dirsrv: 
    PKI-IPA...                                             [  OK  ]
    TESTRELM-COM...                                        [  OK  ]
Starting dirsrv: 
    PKI-IPA...                                             [  OK  ]
    TESTRELM-COM...                                        [  OK  ]
Restarting KDC Service
Stopping Kerberos 5 KDC:                                   [  OK  ]
Starting Kerberos 5 KDC:                                   [  OK  ]
Restarting KPASSWD Service
Stopping Kerberos 5 Admin Server:                          [  OK  ]
Starting Kerberos 5 Admin Server:                          [  OK  ]
Restarting DNS Service
Stopping named: .                                          [  OK  ]
Starting named:                                            [  OK  ]
Restarting MEMCACHE Service
Stopping ipa_memcached:                                    [  OK  ]
Starting ipa_memcached:                                    [  OK  ]
Restarting HTTP Service
Stopping httpd:                                            [  OK  ]
Starting httpd: [Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker
[Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker
                                                           [  OK  ]
Restarting CA Service
Stopping pki-ca:                                           [  OK  ]
Starting pki-ca:                                           [  OK  ]
:: [   PASS   ] :: Running 'ipactl restart'
result_server not set, assuming developer mode.
Setting 192.168.122.101 to state upgrade_bz_821176.18.1
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.1' -m 192.168.122.101'
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.102']
are in one of these states: ['upgrade_bz_821176.18.2']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.2' 192.168.122.102'
:: [15:36:17] ::  Checking /var/log/messages for ns-slapd segfault
:: [   PASS   ] :: BZ 821176 not found.  No ns-slapd segfault found in /var/log/messages
:: [15:36:17] ::  Checking /var/log/dirsrv/slapd-TESTRELM-COM/errors for LDAP error
:: [   PASS   ] :: BZ 821176 not found...didn't find LDAP error in dirsrv log
result_server not set, assuming developer mode.
Setting 192.168.122.101 to state upgrade_bz_821176.18.3
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.3' -m 192.168.122.101'


On REPLICA:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [15:34:14] ::  Machine in recipe is SLAVE
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.101']
are in one of these states: ['upgrade_bz_821176.18.1']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.1' 192.168.122.101'
:: [15:36:04] ::  Running ipa-replica-manage force-sync to make sure that works
ipa: INFO: Setting agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config schedule to 2358-2359 0 to force synch
ipa: INFO: Deleting schedule 2358-2359 0 from agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config
:: [   PASS   ] :: Running 'ipa-replica-manage force-sync --from=spoore-dvm1.testrelm.com --password=Secret123'
result_server not set, assuming developer mode.
Setting 192.168.122.102 to state upgrade_bz_821176.18.2
:: [   PASS   ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.2' -m 192.168.122.102'
result_server not set, assuming developer mode.
Enter STATE:STATE:etc. when the following machines
 ['192.168.122.101']
are in one of these states: ['upgrade_bz_821176.18.3']

:: [   PASS   ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.3' 192.168.122.101'

Manual Test Results ::

# grep -i segfault /var/log/messages
#

 
# grep -i "NSMMReplicationPlugin.*Warning: unable to send endReplication extended operation.*Can't contact LDAP server" /var/log/dirsrv/slapd-TESTRELM-COM/errors
#

Comment 11 Noriko Hosoi 2012-05-25 01:15:17 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
This bug was introduced by the fix for Bug 819643 - "Database RUV could mismatch the one in changelog under the stress" which is in the same errata.

Comment 12 errata-xmlrpc 2012-06-20 07:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0813.html

Note You need to log in before you can comment on or make changes to this bug.