Bug 870158

Summary: slapd entered to infinite loop during new index addition
Product: Red Hat Enterprise Linux 6 Reporter: Noriko Hosoi <nhosoi>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Sankar Ramalingam <sramling>
Severity: unspecified Docs Contact:
Priority: high    
Version: 6.4CC: jgalipea, nkinder, nsoman, pspacek, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.11.15-3.el6 Doc Type: Bug Fix
Doc Text:
Cause: When MMR is under the heavy stress, there was a chance to delete an entry already deleted. Consequence: It breaks the entry's parent/child relationship. Fix: This fix adds an missing check if the retrieved entry is already deleted or not. Result: There is no more chance to break the entry's parent/child relationship.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 08:21:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Noriko Hosoi 2012-10-25 16:40:42 UTC
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/494

During IPA 2.2.0->3.0.0 upgrade various indexes are added. During index addition 389 locks in a infinite loop. Stacktraces etc. attached.

Comment 1 Noriko Hosoi 2012-10-26 17:09:14 UTC
Committed to the external 389-ds-base-1.2.11 branch.

Comment 8 Namita Soman 2012-11-29 20:16:14 UTC
Started to take steps below to verify:
1. install ipa-server 2.2 on a machine using hostname mgmt13.e.org
2. from https://fedorahosted.org/389/ticket/494, downloaded E-ORG-userRoot-2012_10_15_184158.ldif​
3. import this ldif into the ipa-server

# hostname
mgmt13.e.org

# ipactl stop
<all services stopped successfully>

# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_11_29_151133.ldif
[29/Nov/2012:15:11:33 -0500] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
[29/Nov/2012:15:11:33 -0500] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_11_29_151133.ldif
[29/Nov/2012:15:11:33 -0500] - export userRoot: Processed 224 entries (100%).
[29/Nov/2012:15:11:33 -0500] - All database threads now stopped

Ques: Should I be running the command, and if yes - should it be as above? 
The steps I plan to take to verify this:
After importing the data, upgrade IPA to 3.0, and verify it upgrades successfully.

Comment 9 Petr Spacek 2012-11-30 10:48:52 UTC
First of all - the problem is/was a race condition, so any test results can't be taken as definitive ...

This problem can potentially be reproduced with following files:

- Import following DNS zone to IPA:
https://fedorahosted.org/389/attachment/ticket/494/e.org.ldif

- Download following scripts to some directory:
https://fedorahosted.org/389/attachment/ticket/494/loop.sh
https://fedorahosted.org/389/attachment/ticket/494/nsupdate_add_del_gen.py

- Modify NPROC value in loop.sh to 10 or so

- kinit admin
- run `bash loop.sh`
- let it running for several hours or before the script finishes

These scripts try to do chaotically 10000*NPROC DNS updates in DNS zone e.org. The whole thing runs in NPROC processes in parallel. It should put 389 and BIND under heavy load and hopefully it will triger the right condition. Nothing is sure...

I would recommend to use latest version of IPA and bind-dyndb-ldap (ideally version which goes to RHEL 6.4). Persistent search and serial_autoincrement should be enabled in /etc/named.conf before the test. Both of them should be enabled in default IPA 3.0 install, please, check it to be sure.

Expected results:
- zone e.org contain some DNS names like c<number>
- associated TXT records have randomly generated values
- BIND not crashed during the whole test (please let me know if it thows some error messages or somethig suspicious)
- 389 is able to export the database

Comment 10 Namita Soman 2012-12-03 04:05:07 UTC
Ran the script. 
Didn't see any hangs or errors. /var/log/messages had messages like:
Nov 30 16:40:36 cloud-qe-10 named[1634]: update_record (psearch) failed, dn 'idnsName=c55,idnsName=e.org,cn=dns,dc=testrelm,dc=com' change type 0x1. Records can be outdated, run `rndc reload`: not found

Went through list of expected results as listed above - 
Zone e.org contained  DNS names like c<number>
Associated TXT records had randomly generated values
There were 100 entries, and an example:
  Record name: c96
  TXT record: "19"
# service named status 
indicated it was still running

todo: 389 is able to export db

Comment 11 Petr Spacek 2012-12-03 08:26:53 UTC
(In reply to comment #10)
> Didn't see any hangs or errors. /var/log/messages had messages like:
> Nov 30 16:40:36 cloud-qe-10 named[1634]: update_record (psearch) failed, dn
> 'idnsName=c55,idnsName=e.org,cn=dns,dc=testrelm,dc=com' change type 0x1.
> Records can be outdated, run `rndc reload`: not found
This particular error message is acceptable. Records are added and deleted too quickly and BIND is not able to process them at full speed.

To be more specific:
1) Some record is added to LDAP
2) BIND receives change notification and saves this notification to internal "to be processed" queue
3) The same record is deleted from LDAP
4) BIND pulls the notification from internal queue and tries to load new data from LDAP - but there are no data anymore


> todo: 389 is able to export db
This will break the tie.

Comment 12 Namita Soman 2012-12-03 13:28:11 UTC
was able to export the db:
# /var/lib/dirsrv/scripts-TESTRELM-COM/db2ldif -a /tmp/output.ldif -s "dc=testrelm,dc=com"
Exported ldif file: /tmp/output.ldif
[03/Dec/2012:08:24:40 -0500] - /etc/dirsrv/slapd-TESTRELM-COM/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
[03/Dec/2012:08:24:40 -0500] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
[03/Dec/2012:08:24:40 -0500] - Backend Instance(s): 
[03/Dec/2012:08:24:40 -0500] - 	userRoot
ldiffile: /tmp/output.ldif
[03/Dec/2012:08:24:43 -0500] - export userRoot: Processed 365 entries (100%).
[03/Dec/2012:08:24:44 -0500] - All database threads now stopped

# ls -l /tmp/output.ldif 
-rw------- 1 dirsrv dirsrv 239692 Dec  3 08:24 /tmp/output.ldif

Comment 13 Namita Soman 2012-12-03 13:44:51 UTC
verfied using:
ipa-server-3.0.0-8.el6.x86_64
389-ds-base-1.2.11.15-4.el6.x86_64
bind-dyndb-ldap-2.3-1.el6.x86_64

Comment 14 errata-xmlrpc 2013-02-21 08:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0503.html