Bug 1257543

Summary: slapd crash in do_search
Product: Red Hat Enterprise Linux 6 Reporter: Dennis van Dok <dennisvd>
Component: openldapAssignee: Matus Honek <mhonek>
Status: CLOSED ERRATA QA Contact: Patrik Kis <pkis>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.7CC: andrea.manzi, andreas.haupt, anrussel, dennisvd, mhonek, nkinder, oxana.smirnova, pkis, sgadekar, simon.fayer05, skremen
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openldap-2.4.40-8.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1316450 (view as bug list) Environment:
Last Closed: 2016-05-11 00:59:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1272422, 1316450, 1665441    
Attachments:
Description Flags
slapd conf
none
another slapd conf file generating crashes none

Description Dennis van Dok 2015-08-27 09:55:54 UTC
Description of problem:

slapd crashes under certain high load conditions.

Version-Release number of selected component (if applicable):

openldap-servers-2.4.40-5.el6

How reproducible:

It's consistent with the introduction of this version of openldap-servers; the previous version of 6.6 did not have this issue. But reproducing requires setting up a top-level BDII which is not trivial.

http://www.eu-emi.eu/releases/emi-3-montebianco/products/-/asset_publisher/5dKm/content/bdii-top-2

Steps to Reproduce:
1. Install a top BDII on CentOS 6.7
2. wait for it to crash


Additional info:

An ABRT report was created, but it's too big to attach here. Here is the URL:

http://www.nikhef.nl/~dennisvd/ccpp-2015-08-27-11:10:14-27199.tar.gz

Comment 2 Andrea 2015-10-06 08:47:27 UTC
As now a new version of openldap-servers

openldap-servers-2.4.40-6.el6

has been released as a security fix, this issue is becoming quite urgent to be analysed, cause the new version of openldap-servers  is going to be automatically installed as a security update on all our installations.

could you please check this problem ASAP?

Comment 3 Andreas Haupt 2015-10-06 10:18:56 UTC
A recent (bzipped) core dump (with openldap-servers-2.4.40-6.el6):

https://desycloud.desy.de/public.php?service=files&t=4bef5af0a89f1d9d73011fbc44907bb9&download

Comment 4 Matus Honek 2015-10-06 10:37:59 UTC
I went through the ABRT report and there is an openldap config file missing (I presume this is due to non-standard location); it should be in /etc/bdii according to the command ran.
What more, I am even quite confused what the software you mention (BDII-top) is or does. It might also, quite likely, depend on the data that are transferred which depend on the software's configuration itself (which I do not happen to be able to find in the sosreport).

Comment 5 Andrea 2015-10-06 13:15:09 UTC
Created attachment 1080241 [details]
slapd conf

This is a standard slapd configuration that we have in our top bdii

Comment 6 Dennis van Dok 2015-10-06 13:34:53 UTC
To eleborate a bit: the BDII is a software package used in the scientific grid computing communities to collect information about available grid resources around the world. The underlying implementation is an openldap directory. There is a hierarchy of BDIIs holding information from local resources, local sites, to global. The last category is called the 'top BDII' and is the largest as it holds all information. There are dozens of top BDIIs around the world that contain the same information.

More background here: http://gridinfo.web.cern.ch/information-system-sys-admins

One of the grid admins has already narrowed down the problem to the use of the relay backend, with the o=shadow vs. o=grid massaging. Removing that from the config seems to result in no more crashes.

Comment 7 Andrea 2015-10-08 14:16:53 UTC
Hi,
we have reports from slapd crashes only in case the slapd conf includes relay  databases with overlay rwm, so it looks like it does not depend on the suffixmassage. I'm going to attach the other configuration file which leads to a slapd crash.
thanks
Andrea

Comment 8 Andrea 2015-10-08 14:19:39 UTC
Created attachment 1081029 [details]
another slapd conf file generating crashes

Comment 9 Andrea 2015-10-14 14:27:57 UTC
Hi,
did you have the possibility to look at this problem after we attached the config file?
thanks
Andrea

Comment 10 Matus Honek 2015-10-14 18:30:08 UTC
Hi Andrea,

looks like I have found something. Thanks to a hint about back-relay and notably thanks to core dumps above, I have found an upstream commit in between versions 2.4.39 and 2.4.40 that introduces function that is called in the core dumps (which both show it called from back-relay). This function in some cases seems to call a pointer as a function which results in SIGSEGV (note, it is not a null pointer, so it got somehow mangled).

If I built packages that would lack this arbitrary function, would you be willing to test this for me? As this is a case of a search only, it should not be dangerous but, please, be aware this might break something. Best would be to use a testing environment of yours.

Thank you.

Comment 11 Andrea 2015-10-14 21:01:59 UTC
Hi Matus, 
thanks a lot for you effort. Yes if you provide us the packages we will be really happy to try them in our testing instances.

cheers
Andrea

Comment 12 Dennis van Dok 2015-10-14 21:13:13 UTC
Hi Matus,

yes, we can test this in a testbed setup. It would take some time to determine the stability, as the crash usually takes some time (between a few hours and a few days) to manifest.

Thanks,

Dennis

Comment 13 Andreas Haupt 2015-10-15 06:22:55 UTC
Hi Dennis,

I'm quite surprised it takes so long to get your BDII segfaulting. I can easily make it coredump by running a search query - like lcg-infosites (that's how I produced the core dump).

Cheers,
Andreas

Comment 14 Dennis van Dok 2015-10-15 14:01:37 UTC
(In reply to Andreas Haupt from comment #13)
> Hi Dennis,
> 
> I'm quite surprised it takes so long to get your BDII segfaulting. I can
> easily make it coredump by running a search query.

OK, I didn't try that. I just witness 'spontaneous' crashes but they will probably be due to queries run from the outside.

Comment 17 Oxana Smirnova 2015-10-23 17:14:42 UTC
(In reply to Matus Honek from comment #10)

Hi, I've got two questions/requests:

The bug severely affects our software (see http://bugzilla.nordugrid.org/show_bug.cgi?id=3504) so we'd love to test the potential fix as well. Where can one find the test build?

And another question: is there a reason why it is not reported to upstream? Or is it? (Couldn't find anything resembling it in the OpenLDAP ITS).

Cheers,
Oxana

Comment 19 Andrea 2015-11-09 13:02:23 UTC
Hi Matus,
do you have an update on this? we would like to understand if it's possible to have an rpm with the fix you have mentioned for testing. We are evaluating possible workarounds, but we are still stuck and and we cannot move our installations to openldap v2.4.40
thanks
cheers
Andrea

Comment 20 Andrea 2015-12-04 09:35:12 UTC
Hi all,
we got privately from Matus a new build of openldap with a workaround/fix to this problem.
We have installed it at CERN on a TopBDII and it looks fine

Dennis, Andreas if you have time could you also test the new rpms on your testing nodes ?

https://drive.google.com/file/d/0B0VkVqWTkgPjblczMkM4dWZPZkE/view?usp=sharing

thanks!
cheers
Andrea

Comment 21 Andreas Haupt 2015-12-04 13:54:05 UTC
Hi Andrea,

updated the packages on our test bdii node. Also reverted back to the original bdii configuration (with "o=shadow" relay db enabled again).

It looks really promising so far! I ran a couple of 'lcg-infosites' requests against the patched top-level bdii without any crashes. These resulted in slapd segfaults with the broken version.

Cheers,
Andreas

Comment 22 Andrea 2015-12-04 15:15:03 UTC
Hi  Andres
thanks a lot!
i got also confirmation from ARC that this build of openldap fixes the crash on ARC-CEs
installations

Matus do you think this patch can be released ? and how long it will take?

thanks!
cheers
Andrea

Comment 23 Andrea 2015-12-14 08:51:30 UTC
Hi Matus,
we have been testing the new rpms  quite intensively these days, and  we can definitely say that our services are working fine with that build of openldap.

Can we do something in order to speed up the integration and the release of that version by Red Hat?
thanks
cheers
Andrea

Comment 24 Andrea 2016-01-19 12:48:36 UTC
Hello, 
sorry to bother again, are there any news regarding the integration and release of this change on openldap? 
related to this i have also seen that a new version of openldap for RHEL 7 (2.4.40-8.el7) is in CentOS 7 now, ( we are starting supporting this OS as well) i haven't tested it yet  so i don't know if this problem will appear also there but this change may be applied also on the openldap released in RHEL7.
thanks
cheers
Andrea

Comment 25 Matus Honek 2016-01-19 13:46:58 UTC
Hello Andrea,

I am sorry for not answering sooner. The fix is proposed for rhel-6.8 and should be included with it's release.
Also, I should clone this bugzilla for rhel-7.

Regards,
Matus

Comment 33 errata-xmlrpc 2016-05-11 00:59:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0943.html

Comment 34 Matus Honek 2016-05-30 12:31:40 UTC
*** Bug 1318904 has been marked as a duplicate of this bug. ***