Bug 1257543 - slapd crash in do_search
slapd crash in do_search
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: openldap (Show other bugs)
6.7
Unspecified Unspecified
urgent Severity urgent
: rc
: ---
Assigned To: Matus Honek
Patrik Kis
:
: 1318904 (view as bug list)
Depends On:
Blocks: 1272422 1316450
  Show dependency treegraph
 
Reported: 2015-08-27 05:55 EDT by Dennis van Dok
Modified: 2016-05-30 08:31 EDT (History)
11 users (show)

See Also:
Fixed In Version: openldap-2.4.40-8.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1316450 (view as bug list)
Environment:
Last Closed: 2016-05-10 20:59:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
slapd conf (4.06 KB, text/plain)
2015-10-06 09:15 EDT, Andrea
no flags Details
another slapd conf file generating crashes (2.57 KB, text/plain)
2015-10-08 10:19 EDT, Andrea
no flags Details

  None (edit)
Description Dennis van Dok 2015-08-27 05:55:54 EDT
Description of problem:

slapd crashes under certain high load conditions.

Version-Release number of selected component (if applicable):

openldap-servers-2.4.40-5.el6

How reproducible:

It's consistent with the introduction of this version of openldap-servers; the previous version of 6.6 did not have this issue. But reproducing requires setting up a top-level BDII which is not trivial.

http://www.eu-emi.eu/releases/emi-3-montebianco/products/-/asset_publisher/5dKm/content/bdii-top-2

Steps to Reproduce:
1. Install a top BDII on CentOS 6.7
2. wait for it to crash


Additional info:

An ABRT report was created, but it's too big to attach here. Here is the URL:

http://www.nikhef.nl/~dennisvd/ccpp-2015-08-27-11:10:14-27199.tar.gz
Comment 2 Andrea 2015-10-06 04:47:27 EDT
As now a new version of openldap-servers

openldap-servers-2.4.40-6.el6

has been released as a security fix, this issue is becoming quite urgent to be analysed, cause the new version of openldap-servers  is going to be automatically installed as a security update on all our installations.

could you please check this problem ASAP?
Comment 3 Andreas Haupt 2015-10-06 06:18:56 EDT
A recent (bzipped) core dump (with openldap-servers-2.4.40-6.el6):

https://desycloud.desy.de/public.php?service=files&t=4bef5af0a89f1d9d73011fbc44907bb9&download
Comment 4 Matus Honek 2015-10-06 06:37:59 EDT
I went through the ABRT report and there is an openldap config file missing (I presume this is due to non-standard location); it should be in /etc/bdii according to the command ran.
What more, I am even quite confused what the software you mention (BDII-top) is or does. It might also, quite likely, depend on the data that are transferred which depend on the software's configuration itself (which I do not happen to be able to find in the sosreport).
Comment 5 Andrea 2015-10-06 09:15 EDT
Created attachment 1080241 [details]
slapd conf

This is a standard slapd configuration that we have in our top bdii
Comment 6 Dennis van Dok 2015-10-06 09:34:53 EDT
To eleborate a bit: the BDII is a software package used in the scientific grid computing communities to collect information about available grid resources around the world. The underlying implementation is an openldap directory. There is a hierarchy of BDIIs holding information from local resources, local sites, to global. The last category is called the 'top BDII' and is the largest as it holds all information. There are dozens of top BDIIs around the world that contain the same information.

More background here: http://gridinfo.web.cern.ch/information-system-sys-admins

One of the grid admins has already narrowed down the problem to the use of the relay backend, with the o=shadow vs. o=grid massaging. Removing that from the config seems to result in no more crashes.
Comment 7 Andrea 2015-10-08 10:16:53 EDT
Hi,
we have reports from slapd crashes only in case the slapd conf includes relay  databases with overlay rwm, so it looks like it does not depend on the suffixmassage. I'm going to attach the other configuration file which leads to a slapd crash.
thanks
Andrea
Comment 8 Andrea 2015-10-08 10:19 EDT
Created attachment 1081029 [details]
another slapd conf file generating crashes
Comment 9 Andrea 2015-10-14 10:27:57 EDT
Hi,
did you have the possibility to look at this problem after we attached the config file?
thanks
Andrea
Comment 10 Matus Honek 2015-10-14 14:30:08 EDT
Hi Andrea,

looks like I have found something. Thanks to a hint about back-relay and notably thanks to core dumps above, I have found an upstream commit in between versions 2.4.39 and 2.4.40 that introduces function that is called in the core dumps (which both show it called from back-relay). This function in some cases seems to call a pointer as a function which results in SIGSEGV (note, it is not a null pointer, so it got somehow mangled).

If I built packages that would lack this arbitrary function, would you be willing to test this for me? As this is a case of a search only, it should not be dangerous but, please, be aware this might break something. Best would be to use a testing environment of yours.

Thank you.
Comment 11 Andrea 2015-10-14 17:01:59 EDT
Hi Matus, 
thanks a lot for you effort. Yes if you provide us the packages we will be really happy to try them in our testing instances.

cheers
Andrea
Comment 12 Dennis van Dok 2015-10-14 17:13:13 EDT
Hi Matus,

yes, we can test this in a testbed setup. It would take some time to determine the stability, as the crash usually takes some time (between a few hours and a few days) to manifest.

Thanks,

Dennis
Comment 13 Andreas Haupt 2015-10-15 02:22:55 EDT
Hi Dennis,

I'm quite surprised it takes so long to get your BDII segfaulting. I can easily make it coredump by running a search query - like lcg-infosites (that's how I produced the core dump).

Cheers,
Andreas
Comment 14 Dennis van Dok 2015-10-15 10:01:37 EDT
(In reply to Andreas Haupt from comment #13)
> Hi Dennis,
> 
> I'm quite surprised it takes so long to get your BDII segfaulting. I can
> easily make it coredump by running a search query.

OK, I didn't try that. I just witness 'spontaneous' crashes but they will probably be due to queries run from the outside.
Comment 17 Oxana Smirnova 2015-10-23 13:14:42 EDT
(In reply to Matus Honek from comment #10)

Hi, I've got two questions/requests:

The bug severely affects our software (see http://bugzilla.nordugrid.org/show_bug.cgi?id=3504) so we'd love to test the potential fix as well. Where can one find the test build?

And another question: is there a reason why it is not reported to upstream? Or is it? (Couldn't find anything resembling it in the OpenLDAP ITS).

Cheers,
Oxana
Comment 19 Andrea 2015-11-09 08:02:23 EST
Hi Matus,
do you have an update on this? we would like to understand if it's possible to have an rpm with the fix you have mentioned for testing. We are evaluating possible workarounds, but we are still stuck and and we cannot move our installations to openldap v2.4.40
thanks
cheers
Andrea
Comment 20 Andrea 2015-12-04 04:35:12 EST
Hi all,
we got privately from Matus a new build of openldap with a workaround/fix to this problem.
We have installed it at CERN on a TopBDII and it looks fine

Dennis, Andreas if you have time could you also test the new rpms on your testing nodes ?

https://drive.google.com/file/d/0B0VkVqWTkgPjblczMkM4dWZPZkE/view?usp=sharing

thanks!
cheers
Andrea
Comment 21 Andreas Haupt 2015-12-04 08:54:05 EST
Hi Andrea,

updated the packages on our test bdii node. Also reverted back to the original bdii configuration (with "o=shadow" relay db enabled again).

It looks really promising so far! I ran a couple of 'lcg-infosites' requests against the patched top-level bdii without any crashes. These resulted in slapd segfaults with the broken version.

Cheers,
Andreas
Comment 22 Andrea 2015-12-04 10:15:03 EST
Hi  Andres
thanks a lot!
i got also confirmation from ARC that this build of openldap fixes the crash on ARC-CEs
installations

Matus do you think this patch can be released ? and how long it will take?

thanks!
cheers
Andrea
Comment 23 Andrea 2015-12-14 03:51:30 EST
Hi Matus,
we have been testing the new rpms  quite intensively these days, and  we can definitely say that our services are working fine with that build of openldap.

Can we do something in order to speed up the integration and the release of that version by Red Hat?
thanks
cheers
Andrea
Comment 24 Andrea 2016-01-19 07:48:36 EST
Hello, 
sorry to bother again, are there any news regarding the integration and release of this change on openldap? 
related to this i have also seen that a new version of openldap for RHEL 7 (2.4.40-8.el7) is in CentOS 7 now, ( we are starting supporting this OS as well) i haven't tested it yet  so i don't know if this problem will appear also there but this change may be applied also on the openldap released in RHEL7.
thanks
cheers
Andrea
Comment 25 Matus Honek 2016-01-19 08:46:58 EST
Hello Andrea,

I am sorry for not answering sooner. The fix is proposed for rhel-6.8 and should be included with it's release.
Also, I should clone this bugzilla for rhel-7.

Regards,
Matus
Comment 33 errata-xmlrpc 2016-05-10 20:59:16 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0943.html
Comment 34 Matus Honek 2016-05-30 08:31:40 EDT
*** Bug 1318904 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.