Bug 1418202

Summary: ns-slapd dies under heavy load
Product: Red Hat Enterprise Linux 7 Reporter: Sami Silén <sami.silen>
Component: 389-ds-baseAssignee: mreynolds
Status: CLOSED ERRATA QA Contact: Viktor Ashirov <vashirov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: mreynolds, msauton, nkinder, rmeggins, sami.silen, sramling
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: There was a bug in calculating the size of the string format entry which was shorter than the real size when there are lots of deleted attributes. Consequence: It overflows the allocated memory with the longer size of entry and crashes the server. Fix: Fixed the calculation. Result: No more overflow and the server crash.
Story Points: ---
Clone Of:
: 1429495 (view as bug list) Environment:
Last Closed: 2017-08-01 21:14:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1429495    
Attachments:
Description Flags
Stacktrace
none
08:37 Valgrind output
none
08:37 access log
none
08:37 errors log
none
Valgrind output before patch
none
Access log output before patch
none
errors log after patch
none
Valgrind output after patch none

Description Sami Silén 2017-02-01 09:32:32 UTC
Created attachment 1246596 [details]
Stacktrace

Description of problem:

Our multimaster environment with four nodes is attached to the Novel IDM.
When large dataset is migrated to the 389-ds it just dies.

Version-Release number of selected component (if applicable):

Name        : 389-ds-base
Arch        : x86_64
Version     : 1.3.5.10
Release     : 15.el7_3

How reproducible:

I just enable driver from the Novel to start synching data to the multimaster cluster.

Actual results:
* [4373826.739387] ns-slapd[2666]: segfault at 7f68b4000000 ip 00007f68b29513ac sp 00007f68807eef98 error 4 in libc-2.17.so[7f68b28d1000+1b6000]
* [4720607.858694] traps: ns-slapd[21687] general protection ip:7f3545b5cdab sp:7f35107e6240 error:0 in libc-2.17.so[7f3545ade000+1b6000]

Also from the errors log I can see huge amount of following
* dn2entry_ext: Failed to get id for ,ou=<node>,dc=<domain>,dc=<com> from entryrdn index (34)

And after new start
* Detected Disorderly Shutdown last time Directory Server was running, recovering database.
* NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-<INSTANCE>/changelogdb/731ede0d-cdc611e6-801cd48d-418e285b.sema; NSPR error - -5943

* NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=<domain>,dc=<com>. Check if DB RUV needs to be updated

Expected results:

Data should be synched

Comment 2 Noriko Hosoi 2017-02-01 17:20:27 UTC
Thank you for your report, Sami.

Looking at the stacktraces, it seems there was a memory corruption prior to the crash point.  Could it be possible to run the Directory Server (ns-slapd) with valgrind and share the valgrind output with us?  Here is the How To section for running valgrind [1].

Before doing so, please delete this .sema file.

* NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-<INSTANCE>/changelogdb/731ede0d-cdc611e6-801cd48d-418e285b.sema; NSPR error - -5943

You may also need to delete /dev/shm/sem.slapd-<INSTANCE>.stats

[1] - www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind

Also, could you please share the error and access log: /var/log/dirsrv/slapd-<INSTANCE>/{errors,access}?

It might be helpful if you could enable the audit log by [2] and provide us /var/log/dirsrv/slapd-<INSTANCE>/audit.

[2] - https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Administration_Guide/Configuring_Logs.html#enabling-logs

Comment 3 Sami Silén 2017-02-02 08:41:25 UTC
Created attachment 1247033 [details]
08:37 Valgrind output

Comment 4 Sami Silén 2017-02-02 08:42:04 UTC
Created attachment 1247034 [details]
08:37 access log

Comment 5 Sami Silén 2017-02-02 08:42:43 UTC
Created attachment 1247035 [details]
08:37 errors log

Comment 6 Sami Silén 2017-02-02 08:46:58 UTC
After some fight and modifications I managed to run ns-slapd.sh standalone, I couldn't make it work thru systemd, but...

Here are valgrind output and last lines from access and error logs. Of course I forgot to enable audit log, but let me know if it's needed or any of these gives some ideas why our server cant survive with Novel IdM.

Comment 7 Noriko Hosoi 2017-02-02 17:31:48 UTC
(In reply to Sami Silén from comment #6)
> After some fight and modifications I managed to run ns-slapd.sh standalone,
> I couldn't make it work thru systemd, but...
> 
> Here are valgrind output and last lines from access and error logs. Of
> course I forgot to enable audit log, but let me know if it's needed or any
> of these gives some ideas why our server cant survive with Novel IdM.

Sami, your inputs are excellent.  I believe you captured the cause of the crash.

If you could tell us how this crashed entry looks like:
    cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com>
and what the modify operation was trying to change, which caused the crash, that'll certainly help our debugging.

Actually, we are interested in this error message in the error log.
[02/Feb/2017:08:34:04.888362392 +0200] Entry "CN=<USER>,ou=<NODES multiple ou's>,dc=<DOMAIN>,dc=<COM>" -- attribute "logindisabled" not allowed

I believe the attribute "logindisabled" is for the account control in the Novell server, which we do not support.  RHDS has its own account policy and management.

Please take a look at this chapter of our Doc, which talks about the account availability.
https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Administration_Guide/User_Account_Management.html

That being said, I'm afraid you cannot sync such internal entries/attributes between the two servers which are implemented differently.

But that does not mean it's ok to crash.  What the RHDS can do is returns an error reporting "not supported" and continue.

A couple of more questions...  
Have you configured RHDS with Multi-Mastered Replication?  
Have you enabled the database encryption?
Could there be any other special configurations you added?

Thanks for your help.

Comment 8 Sami Silén 2017-02-02 18:13:55 UTC
Actually we have had this problem for some years. This situation always comes ahead when we do large data migrations or we modify something (enable some plugin or similar in our 389-cluster). After modifications cluster comes unstable but starts to stabilize little by little.
 
We already prevented logindisabled attribute to reach 389-ds because I also noticed that it shouldn't come there. I haven't yet studied how we could disable user via Novel, my plan was to write 'accountUnlockTime' via Novel. It's also an operational attribute so I guess that's not doable either.

I also believe that kind of these modifications are the main reason.
This one is an ADD operation which tries to add object with attributes which doesn't have needed objectclasses set, only groupofnames is defined but it doesn't describe postalcode nor postofficebox. Before I doubled the memory from 389ds, node receiving these entries died quite fast, now it stays up much longer.

dn: CN=<NUMBER>,OU=Organizations,ou=<node>,dc=<domain>,dc=<com>
postalCode: 00023
postOfficeBox: 12
OrgNameFi: Organization name
physicalDeliveryOfficeName: Helsinki
street: Erottajakatu
cn: <NUMBER>
objectclass: groupOfNames
objectclass: top

Of course we try to fix these from the Novel side. Sometimes someone does modifications there and kind of these situations occurs again. I would be quite glad if it wouldn't kill 389-ds.

The situation with the logindisabled is the same as above I think, it tries to add attribute which is not defined in any of the object objectclasses?

And yes, I know we're doing nasty stuff. But like you said it's not the reason for giving up on the 389-ds side. ;)

Yes, we have MultiMaster replication environment. There isn't any special configurations. Generic installation with TLS/SSL and Account policy plugin enabled. Encryption for the data, no I don't recall doing anything like that.
Only special thing I quess is that we have really large dataset which we're maintaining in the LDAP.

Comment 9 Sami Silén 2017-02-03 07:13:11 UTC
I have now 'fixed' all ADD/MODIFY operations coming from the Novel IdM so that there are no Errors on the error log.

Still it looks like that server doesn't stay up over 5 minutes. So I think it's something else (replication?).

So, now the reason cant be even that heavy load nor numberous operations with uncomplete objects.

Comment 10 Noriko Hosoi 2017-02-03 07:21:21 UTC
Thank you for your research, Sami.  I think I've found a bug in calculating the entry size.  I'm going to make a patch and create a test build tomorrow.

When it's done, could it be possible for your to test in your environment with it?  It would be a test build and we have to ask you to update it when the official one is released, though...

Thanks.

Comment 11 Sami Silén 2017-02-03 09:04:45 UTC
No problem,

This cluster is not in production and it's going to be build again from the scratch anyways. 

I am also waiting with valgrind If I can provide more information from the latest dying. Now it seems that it stays up for hours (starting to stabilize little by little, same effect as in earlier cluster 1.2...).

Comment 12 Noriko Hosoi 2017-02-03 16:39:13 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/49121

Comment 13 Sami Silén 2017-02-06 15:31:11 UTC
Server haven't died during weekend, It's of course now getting stable when I want it to crash. Same effect like with 1.2 branch. I'll try to remove some fixes from Novel If I could reproduce the error before I update fixes.

Comment 14 Noriko Hosoi 2017-02-06 17:19:50 UTC
(In reply to Sami Silén from comment #13)
> Server haven't died during weekend, It's of course now getting stable when I
> want it to crash. Same effect like with 1.2 branch. I'll try to remove some
> fixes from Novel If I could reproduce the error before I update fixes.

Hi Sami,

Your comment means the server does not crash even before replacing the Directory Server with the debug build 389-ds-base-1.3.5.10-17.49121.el7_3, right?

I'd imagine the crash is data dependent.  Once the problematic data is somehow made updated or skipped, it might not happen until the next occasion for the entry to be modified.

Still we haven't fully understood the direct cause, but this entry likely has some anomaly.  I'm hoping to figure it out what it is.

    cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com>

It might be interesting to run modify each attribute in the entry one by one and see what it happens.  The command-line to do is:
ldapmodify -x -h <host> -p <port> -D "cn=directory manager" -W << EOF
dn: cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com>
changetype: modify
replace: <ATTRIBUTE>
<ATTRIBUTE>: <NEW VALUE>
EOF

And if you could run "dbscan -f /var/lib/dirsrv/slapd-<INSTANCE>/db/<BACKEND>/id2entry.db" and find out "rdn: cn=00154", it may give us some more info.  If you see an obvious character in the entry which is different from the other entries and share it with us, that might be a big help, too.

BTW, do you use memberOf plug-in or automember plug-in with the Directory Server?

Comment 15 Sami Silén 2017-02-06 17:26:38 UTC
Created attachment 1248091 [details]
Valgrind output before patch

Comment 16 Sami Silén 2017-02-06 17:27:17 UTC
Created attachment 1248092 [details]
Access log output before patch

Comment 17 Sami Silén 2017-02-06 17:30:17 UTC
After removing Novel fixes and started migration, I managed to kill server again.
Here are output from this occurrence.

Now I update the node connected to the Novel, other nodes in the cluster will stay at the current RPM versions.

Comment 18 Sami Silén 2017-02-06 17:58:59 UTC
Created attachment 1248109 [details]
errors log after patch

Comment 19 Sami Silén 2017-02-06 18:00:01 UTC
Created attachment 1248110 [details]
Valgrind output after patch

Comment 20 Sami Silén 2017-02-06 18:15:14 UTC
After patch server died immediately, for the first occurrence I didn't have debugs on. For that everything I have is the following error in the error log.

[06/Feb/2017:19:34:40.968303179 +0200] entry2str_internal_ext: array boundary wrote: bufsize=6158 wrote=6230

I haven't now increased entry cache size (migration is not finished, and amount grows all the time).

The logs before patch, i didn't upload errors log, because there wasn't any change before crash. Now after patch, I didn't upload access log because there wasn't any change. 
Now It looks like that I am not even able to get server started with patched RPMs. But it can also be that I am now at the worst moment on this anomaly and after multiple restarts server starts to get more stable time by time (so could be that situation would be the same without patched versions).
I didn't install that 389-ds-base-snmp, because there weren't any previous versions installed on the system.

I did a short checkout for 
"dbscan -f /var/lib/dirsrv/slapd-<INSTANCE>/db/<BACKEND>/id2entry.db" I didn't notice any change, but I continue tomorrow and try to look with fresh eyes. I am just afraid that I am not able to get 389ds up and running with patched versions.

Yes, I also think that direct cause is not found. I am still believing it's not any certain entry but the amount of certain type of entries.

Comment 21 Noriko Hosoi 2017-02-06 18:30:57 UTC
Thanks, Sami.  

Apparently, the patch did not solve the problem.  Let me keep investigating...

> I am still believing it's not any certain entry but the amount of certain type of entries.
Regarding your input, do you happen to get a core file from the crash?  If yes, could you find which entry caused the crash?  Are there any similarity to the previous?  Thanks...

Comment 24 Sankar Ramalingam 2017-05-15 16:37:30 UTC
Hi Mark, should this be Sanity only verification? or do we have steps to reproduce the crash?

Comment 25 mreynolds 2017-05-15 19:14:23 UTC
(In reply to Sankar Ramalingam from comment #24)
> Hi Mark, should this be Sanity only verification? or do we have steps to
> reproduce the crash?

Actually there is a test case for this:

dirsrvtests/tests/tickets/ticket49121_test.py

Comment 26 Sankar Ramalingam 2017-05-16 00:25:48 UTC
Thanks Mark! for pointing out this!!.

[0 root@qeos-38 tests]# py.test -s -v tickets/ticket49121_test.py 
===================================================== test session starts =====================================================
platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python
cachedir: .cache

INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user5,ou=People,dc=example,dc=com - replacing attrs...
INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user6,ou=People,dc=example,dc=com - replacing attrs...
INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user7,ou=People,dc=example,dc=com - replacing attrs...
INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user8,ou=People,dc=example,dc=com - replacing attrs...
INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user9,ou=People,dc=example,dc=com - replacing attrs...
INFO:dirsrvtests.tests.tickets.ticket49121_test:Valgrind is happy!
INFO:lib389.utils:Valgrind is now disabled.
INFO:dirsrvtests.tests.tickets.ticket49121_test:Testcase PASSED
PASSEDInstance slapd-master_1 removed.
Instance slapd-master_2 removed.


============== 1 passed in 68.92 seconds ===============

Marking the bug as Verified since the above tests PASSed with the latest of 389-ds-base on RHEL-7.4.
[0 root@qeos-38 tests]# rpm -qa |grep -i 389-ds
389-ds-base-snmp-1.3.6.1-13.el7.x86_64
389-ds-base-1.3.6.1-13.el7.x86_64
389-ds-base-debuginfo-1.3.6.1-13.el7.x86_64
389-ds-base-libs-1.3.6.1-13.el7.x86_64

Comment 27 errata-xmlrpc 2017-08-01 21:14:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086