Red Hat Bugzilla – Bug 821176
ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0
Last modified: 2013-10-07 14:56:05 EDT
Description of problem: Dirsrv seems to be crashing. Initially this was seen after running the following on an IPA replica: ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD" Closer inspection shows that it's not just this causing the issue but, also happening outside of this. Any time replication is attempted maybe? Version-Release number of selected component (if applicable): ipa-server-2.2.0-13.el6.x86_64 389-ds-base-1.2.10.2-11.el6.x86_64 How reproducible: very. Steps to Reproduce: 1. <setup rhel6.2 IPA Master> 2. <setup rhel6.2 IPA Replica> 3. <point servers to yum repos with rhel6.3> 4. on both run: yum -y update 'ipa*' 5. on replica run: ipa-replica-manage force-sync --from=$MASTER --password="$ROOTPWD" Actual results: Shortly after, if not during the force-sync, dirsrv on the master is stopped. Looking at the log, I see a ns-slapd segfault in /var/log/messages: May 12 15:03:57 qe-blade-12 kernel: ns-slapd[11425]: segfault at 30008 ip 00007f1c1b54d276 sp 00007f1bfaff0840 error 4 in libreplication-plugin.so[7f1c1b526000+7d000] Right around the force-sync and segfault, on the replica, I see this in /var/log/dirsrv/slapd-$INSTANCE/errors: [12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1 2:389): Replication bind with GSSAPI auth resumed [12/May/2012:15:03:54 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-1 2:389): Consumer failed to replay change (uniqueid 97259e87-9c4f11e1-b596ca1b-778d212c, CSN 4fae8fef000 100030000): Operations error. Will retry later. [12/May/2012:15:03:55 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Consumer failed to replay change (uniqueid d30b348a-9c4c11e1-b596ca1b-778d212c, CSN 4fae8ff9000000030000): Can't contact LDAP server. Will retry later. [12/May/2012:15:03:56 -0400] NSMMReplicationPlugin - agmt="cn=meToqe-blade-12.testrelm.com" (qe-blade-12:389): Warning: unable to send endReplication extended operation (Can't contact LDAP server) Expected results: replica sync'd without failing the Master dirsrv. Additional info:
Are you getting a core dump? Can you get a backtrace of the crashed 389-ds instance?
Created attachment 584409 [details] Stack trace
I ran into the similar problem. This patch is supposed to fix the bug. The current llist code fails to set list->tail to NULL at the right place. I'm going to rebuild 389-ds-base with this patch and others in 1.2.10.2-12 once our reliability test is passed. diff --git a/ldap/servers/plugins/replication/llist.c b/ldap/servers/plugins/rep index e80f532..05cfa48 100644 --- a/ldap/servers/plugins/replication/llist.c +++ b/ldap/servers/plugins/replication/llist.c @@ -165,14 +165,14 @@ void* llistRemoveCurrentAndGetNext (LList *list, void **it if (node) { prevNode->next = node->next; + if (list->tail == node) { + list->tail = prevNode; + } _llistDestroyNode (&node, NULL); node = prevNode->next; if (node) { return node->data; } else { - if (list->head->next == NULL) { - list->tail = NULL; - } return NULL; } } Thread 1 (Thread 0x7fc84e1fc700 (LWP 18031)): #0 0x00007fc86b0e8276 in csnplInsert (csnpl=0x7fc838008090, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/csnpl.c:155 rc = <value optimized out> csnplnode = 0x30000 csn_str = "\000\000\000\000\000\000\000\000\246A\020k\310\177\000\000p\344\t\001" #1 0x00007fc86b1051ac in ruv_add_csn_inprogress (ruv=0x147f5f0, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_ruv.c:1438 replica = 0x7fc8380044e0 csn_str = "\024\000\000\000\000\000\000\000\300\067C\001\000\000\000\000p\203M\001" rc = 0 #2 0x00007fc86b0fa08c in process_operation (pb=<value optimized out>, csn=0x7fc8280014a0) at ldap/servers/plugins/replication/repl5_plugins.c:1316 r_obj = 0x1442c30 r = <value optimized out> ruv_obj = 0x11e8120 ruv = <value optimized out> rc = <value optimized out> #3 0x00007fc86b0fa683 in multimaster_preop_modify (pb=0x14d8370) at ldap/servers/plugins/replication/repl5_plugins.c:452 csn = 0x7fc8280014a0 target_uuid = 0x7fc828000e40 "d30b348a-9c4c11e1-b596ca1b-778d212c" drc = <value optimized out> ctrlp = 0x7fc828002b90 sessionid = "conn=19 op=9", '\000' <repeats 12 times>, " j\0 [...]
Upstream ticket: https://fedorahosted.org/389/ticket/359
Verified. Version :: ipa-server-2.2.0-14.el6.x86_64 389-ds-base-1.2.10.2-12.el6.x86_64 Automated Test Results :: automation not yet run from beaker. This was manually executed... On MASTER: :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [15:34:05] :: Machine in recipe is MASTER :: [15:34:06] :: Restarting IPA services Restarting Directory Service Shutting down dirsrv: PKI-IPA... [ OK ] TESTRELM-COM... [ OK ] Starting dirsrv: PKI-IPA... [ OK ] TESTRELM-COM... [ OK ] Restarting KDC Service Stopping Kerberos 5 KDC: [ OK ] Starting Kerberos 5 KDC: [ OK ] Restarting KPASSWD Service Stopping Kerberos 5 Admin Server: [ OK ] Starting Kerberos 5 Admin Server: [ OK ] Restarting DNS Service Stopping named: . [ OK ] Starting named: [ OK ] Restarting MEMCACHE Service Stopping ipa_memcached: [ OK ] Starting ipa_memcached: [ OK ] Restarting HTTP Service Stopping httpd: [ OK ] Starting httpd: [Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker [Thu May 17 15:34:23 2012] [warn] worker ajp://localhost:9447/ already used by another worker [ OK ] Restarting CA Service Stopping pki-ca: [ OK ] Starting pki-ca: [ OK ] :: [ PASS ] :: Running 'ipactl restart' result_server not set, assuming developer mode. Setting 192.168.122.101 to state upgrade_bz_821176.18.1 :: [ PASS ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.1' -m 192.168.122.101' result_server not set, assuming developer mode. Enter STATE:STATE:etc. when the following machines ['192.168.122.102'] are in one of these states: ['upgrade_bz_821176.18.2'] :: [ PASS ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.2' 192.168.122.102' :: [15:36:17] :: Checking /var/log/messages for ns-slapd segfault :: [ PASS ] :: BZ 821176 not found. No ns-slapd segfault found in /var/log/messages :: [15:36:17] :: Checking /var/log/dirsrv/slapd-TESTRELM-COM/errors for LDAP error :: [ PASS ] :: BZ 821176 not found...didn't find LDAP error in dirsrv log result_server not set, assuming developer mode. Setting 192.168.122.101 to state upgrade_bz_821176.18.3 :: [ PASS ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.3' -m 192.168.122.101' On REPLICA: :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ LOG ] :: upgrade_bz_821176: ns-slapd segfault in libreplication-plugin after IPA upgrade from 2.1.3 to 2.2.0 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [15:34:14] :: Machine in recipe is SLAVE result_server not set, assuming developer mode. Enter STATE:STATE:etc. when the following machines ['192.168.122.101'] are in one of these states: ['upgrade_bz_821176.18.1'] :: [ PASS ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.1' 192.168.122.101' :: [15:36:04] :: Running ipa-replica-manage force-sync to make sure that works ipa: INFO: Setting agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config schedule to 2358-2359 0 to force synch ipa: INFO: Deleting schedule 2358-2359 0 from agreement cn=meTospoore-dvm2.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config :: [ PASS ] :: Running 'ipa-replica-manage force-sync --from=spoore-dvm1.testrelm.com --password=Secret123' result_server not set, assuming developer mode. Setting 192.168.122.102 to state upgrade_bz_821176.18.2 :: [ PASS ] :: Running 'rhts-sync-set -s 'upgrade_bz_821176.18.2' -m 192.168.122.102' result_server not set, assuming developer mode. Enter STATE:STATE:etc. when the following machines ['192.168.122.101'] are in one of these states: ['upgrade_bz_821176.18.3'] :: [ PASS ] :: Running 'rhts-sync-block -s 'upgrade_bz_821176.18.3' 192.168.122.101' Manual Test Results :: # grep -i segfault /var/log/messages # # grep -i "NSMMReplicationPlugin.*Warning: unable to send endReplication extended operation.*Can't contact LDAP server" /var/log/dirsrv/slapd-TESTRELM-COM/errors #
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: This bug was introduced by the fix for Bug 819643 - "Database RUV could mismatch the one in changelog under the stress" which is in the same errata.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0813.html