Red Hat Bugzilla – Bug 1257543
slapd crash in do_search
Last modified: 2016-05-30 08:31:40 EDT
Description of problem:
slapd crashes under certain high load conditions.
Version-Release number of selected component (if applicable):
It's consistent with the introduction of this version of openldap-servers; the previous version of 6.6 did not have this issue. But reproducing requires setting up a top-level BDII which is not trivial.
Steps to Reproduce:
1. Install a top BDII on CentOS 6.7
2. wait for it to crash
An ABRT report was created, but it's too big to attach here. Here is the URL:
As now a new version of openldap-servers
has been released as a security fix, this issue is becoming quite urgent to be analysed, cause the new version of openldap-servers is going to be automatically installed as a security update on all our installations.
could you please check this problem ASAP?
A recent (bzipped) core dump (with openldap-servers-2.4.40-6.el6):
I went through the ABRT report and there is an openldap config file missing (I presume this is due to non-standard location); it should be in /etc/bdii according to the command ran.
What more, I am even quite confused what the software you mention (BDII-top) is or does. It might also, quite likely, depend on the data that are transferred which depend on the software's configuration itself (which I do not happen to be able to find in the sosreport).
Created attachment 1080241 [details]
This is a standard slapd configuration that we have in our top bdii
To eleborate a bit: the BDII is a software package used in the scientific grid computing communities to collect information about available grid resources around the world. The underlying implementation is an openldap directory. There is a hierarchy of BDIIs holding information from local resources, local sites, to global. The last category is called the 'top BDII' and is the largest as it holds all information. There are dozens of top BDIIs around the world that contain the same information.
More background here: http://gridinfo.web.cern.ch/information-system-sys-admins
One of the grid admins has already narrowed down the problem to the use of the relay backend, with the o=shadow vs. o=grid massaging. Removing that from the config seems to result in no more crashes.
we have reports from slapd crashes only in case the slapd conf includes relay databases with overlay rwm, so it looks like it does not depend on the suffixmassage. I'm going to attach the other configuration file which leads to a slapd crash.
Created attachment 1081029 [details]
another slapd conf file generating crashes
did you have the possibility to look at this problem after we attached the config file?
looks like I have found something. Thanks to a hint about back-relay and notably thanks to core dumps above, I have found an upstream commit in between versions 2.4.39 and 2.4.40 that introduces function that is called in the core dumps (which both show it called from back-relay). This function in some cases seems to call a pointer as a function which results in SIGSEGV (note, it is not a null pointer, so it got somehow mangled).
If I built packages that would lack this arbitrary function, would you be willing to test this for me? As this is a case of a search only, it should not be dangerous but, please, be aware this might break something. Best would be to use a testing environment of yours.
thanks a lot for you effort. Yes if you provide us the packages we will be really happy to try them in our testing instances.
yes, we can test this in a testbed setup. It would take some time to determine the stability, as the crash usually takes some time (between a few hours and a few days) to manifest.
I'm quite surprised it takes so long to get your BDII segfaulting. I can easily make it coredump by running a search query - like lcg-infosites (that's how I produced the core dump).
(In reply to Andreas Haupt from comment #13)
> Hi Dennis,
> I'm quite surprised it takes so long to get your BDII segfaulting. I can
> easily make it coredump by running a search query.
OK, I didn't try that. I just witness 'spontaneous' crashes but they will probably be due to queries run from the outside.
(In reply to Matus Honek from comment #10)
Hi, I've got two questions/requests:
The bug severely affects our software (see http://bugzilla.nordugrid.org/show_bug.cgi?id=3504) so we'd love to test the potential fix as well. Where can one find the test build?
And another question: is there a reason why it is not reported to upstream? Or is it? (Couldn't find anything resembling it in the OpenLDAP ITS).
do you have an update on this? we would like to understand if it's possible to have an rpm with the fix you have mentioned for testing. We are evaluating possible workarounds, but we are still stuck and and we cannot move our installations to openldap v2.4.40
we got privately from Matus a new build of openldap with a workaround/fix to this problem.
We have installed it at CERN on a TopBDII and it looks fine
Dennis, Andreas if you have time could you also test the new rpms on your testing nodes ?
updated the packages on our test bdii node. Also reverted back to the original bdii configuration (with "o=shadow" relay db enabled again).
It looks really promising so far! I ran a couple of 'lcg-infosites' requests against the patched top-level bdii without any crashes. These resulted in slapd segfaults with the broken version.
thanks a lot!
i got also confirmation from ARC that this build of openldap fixes the crash on ARC-CEs
Matus do you think this patch can be released ? and how long it will take?
we have been testing the new rpms quite intensively these days, and we can definitely say that our services are working fine with that build of openldap.
Can we do something in order to speed up the integration and the release of that version by Red Hat?
sorry to bother again, are there any news regarding the integration and release of this change on openldap?
related to this i have also seen that a new version of openldap for RHEL 7 (2.4.40-8.el7) is in CentOS 7 now, ( we are starting supporting this OS as well) i haven't tested it yet so i don't know if this problem will appear also there but this change may be applied also on the openldap released in RHEL7.
I am sorry for not answering sooner. The fix is proposed for rhel-6.8 and should be included with it's release.
Also, I should clone this bugzilla for rhel-7.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1318904 has been marked as a duplicate of this bug. ***