Bug 1418202
Summary: | ns-slapd dies under heavy load | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sami Silén <sami.silen> | ||||||||||||||||||
Component: | 389-ds-base | Assignee: | mreynolds | ||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Viktor Ashirov <vashirov> | ||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||
Version: | 7.3 | CC: | mreynolds, msauton, nkinder, rmeggins, sami.silen, sramling | ||||||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
Doc Text: |
Cause: There was a bug in calculating the size of the string format entry which was shorter than the real size when there are lots of deleted attributes.
Consequence: It overflows the allocated memory with the longer size of entry and crashes the server.
Fix: Fixed the calculation.
Result: No more overflow and the server crash.
|
Story Points: | --- | ||||||||||||||||||
Clone Of: | |||||||||||||||||||||
: | 1429495 (view as bug list) | Environment: | |||||||||||||||||||
Last Closed: | 2017-08-01 21:14:10 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||
Bug Blocks: | 1429495 | ||||||||||||||||||||
Attachments: |
|
Thank you for your report, Sami. Looking at the stacktraces, it seems there was a memory corruption prior to the crash point. Could it be possible to run the Directory Server (ns-slapd) with valgrind and share the valgrind output with us? Here is the How To section for running valgrind [1]. Before doing so, please delete this .sema file. * NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-<INSTANCE>/changelogdb/731ede0d-cdc611e6-801cd48d-418e285b.sema; NSPR error - -5943 You may also need to delete /dev/shm/sem.slapd-<INSTANCE>.stats [1] - www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind Also, could you please share the error and access log: /var/log/dirsrv/slapd-<INSTANCE>/{errors,access}? It might be helpful if you could enable the audit log by [2] and provide us /var/log/dirsrv/slapd-<INSTANCE>/audit. [2] - https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Administration_Guide/Configuring_Logs.html#enabling-logs Created attachment 1247033 [details]
08:37 Valgrind output
Created attachment 1247034 [details]
08:37 access log
Created attachment 1247035 [details]
08:37 errors log
After some fight and modifications I managed to run ns-slapd.sh standalone, I couldn't make it work thru systemd, but... Here are valgrind output and last lines from access and error logs. Of course I forgot to enable audit log, but let me know if it's needed or any of these gives some ideas why our server cant survive with Novel IdM. (In reply to Sami Silén from comment #6) > After some fight and modifications I managed to run ns-slapd.sh standalone, > I couldn't make it work thru systemd, but... > > Here are valgrind output and last lines from access and error logs. Of > course I forgot to enable audit log, but let me know if it's needed or any > of these gives some ideas why our server cant survive with Novel IdM. Sami, your inputs are excellent. I believe you captured the cause of the crash. If you could tell us how this crashed entry looks like: cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com> and what the modify operation was trying to change, which caused the crash, that'll certainly help our debugging. Actually, we are interested in this error message in the error log. [02/Feb/2017:08:34:04.888362392 +0200] Entry "CN=<USER>,ou=<NODES multiple ou's>,dc=<DOMAIN>,dc=<COM>" -- attribute "logindisabled" not allowed I believe the attribute "logindisabled" is for the account control in the Novell server, which we do not support. RHDS has its own account policy and management. Please take a look at this chapter of our Doc, which talks about the account availability. https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Administration_Guide/User_Account_Management.html That being said, I'm afraid you cannot sync such internal entries/attributes between the two servers which are implemented differently. But that does not mean it's ok to crash. What the RHDS can do is returns an error reporting "not supported" and continue. A couple of more questions... Have you configured RHDS with Multi-Mastered Replication? Have you enabled the database encryption? Could there be any other special configurations you added? Thanks for your help. Actually we have had this problem for some years. This situation always comes ahead when we do large data migrations or we modify something (enable some plugin or similar in our 389-cluster). After modifications cluster comes unstable but starts to stabilize little by little. We already prevented logindisabled attribute to reach 389-ds because I also noticed that it shouldn't come there. I haven't yet studied how we could disable user via Novel, my plan was to write 'accountUnlockTime' via Novel. It's also an operational attribute so I guess that's not doable either. I also believe that kind of these modifications are the main reason. This one is an ADD operation which tries to add object with attributes which doesn't have needed objectclasses set, only groupofnames is defined but it doesn't describe postalcode nor postofficebox. Before I doubled the memory from 389ds, node receiving these entries died quite fast, now it stays up much longer. dn: CN=<NUMBER>,OU=Organizations,ou=<node>,dc=<domain>,dc=<com> postalCode: 00023 postOfficeBox: 12 OrgNameFi: Organization name physicalDeliveryOfficeName: Helsinki street: Erottajakatu cn: <NUMBER> objectclass: groupOfNames objectclass: top Of course we try to fix these from the Novel side. Sometimes someone does modifications there and kind of these situations occurs again. I would be quite glad if it wouldn't kill 389-ds. The situation with the logindisabled is the same as above I think, it tries to add attribute which is not defined in any of the object objectclasses? And yes, I know we're doing nasty stuff. But like you said it's not the reason for giving up on the 389-ds side. ;) Yes, we have MultiMaster replication environment. There isn't any special configurations. Generic installation with TLS/SSL and Account policy plugin enabled. Encryption for the data, no I don't recall doing anything like that. Only special thing I quess is that we have really large dataset which we're maintaining in the LDAP. I have now 'fixed' all ADD/MODIFY operations coming from the Novel IdM so that there are no Errors on the error log. Still it looks like that server doesn't stay up over 5 minutes. So I think it's something else (replication?). So, now the reason cant be even that heavy load nor numberous operations with uncomplete objects. Thank you for your research, Sami. I think I've found a bug in calculating the entry size. I'm going to make a patch and create a test build tomorrow. When it's done, could it be possible for your to test in your environment with it? It would be a test build and we have to ask you to update it when the official one is released, though... Thanks. No problem, This cluster is not in production and it's going to be build again from the scratch anyways. I am also waiting with valgrind If I can provide more information from the latest dying. Now it seems that it stays up for hours (starting to stabilize little by little, same effect as in earlier cluster 1.2...). Upstream ticket: https://fedorahosted.org/389/ticket/49121 Server haven't died during weekend, It's of course now getting stable when I want it to crash. Same effect like with 1.2 branch. I'll try to remove some fixes from Novel If I could reproduce the error before I update fixes. (In reply to Sami Silén from comment #13) > Server haven't died during weekend, It's of course now getting stable when I > want it to crash. Same effect like with 1.2 branch. I'll try to remove some > fixes from Novel If I could reproduce the error before I update fixes. Hi Sami, Your comment means the server does not crash even before replacing the Directory Server with the debug build 389-ds-base-1.3.5.10-17.49121.el7_3, right? I'd imagine the crash is data dependent. Once the problematic data is somehow made updated or skipped, it might not happen until the next occasion for the entry to be modified. Still we haven't fully understood the direct cause, but this entry likely has some anomaly. I'm hoping to figure it out what it is. cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com> It might be interesting to run modify each attribute in the entry one by one and see what it happens. The command-line to do is: ldapmodify -x -h <host> -p <port> -D "cn=directory manager" -W << EOF dn: cn=00154,ou=organizations,ou=<node>,dc=<domain>,dc=<com> changetype: modify replace: <ATTRIBUTE> <ATTRIBUTE>: <NEW VALUE> EOF And if you could run "dbscan -f /var/lib/dirsrv/slapd-<INSTANCE>/db/<BACKEND>/id2entry.db" and find out "rdn: cn=00154", it may give us some more info. If you see an obvious character in the entry which is different from the other entries and share it with us, that might be a big help, too. BTW, do you use memberOf plug-in or automember plug-in with the Directory Server? Created attachment 1248091 [details]
Valgrind output before patch
Created attachment 1248092 [details]
Access log output before patch
After removing Novel fixes and started migration, I managed to kill server again. Here are output from this occurrence. Now I update the node connected to the Novel, other nodes in the cluster will stay at the current RPM versions. Created attachment 1248109 [details]
errors log after patch
Created attachment 1248110 [details]
Valgrind output after patch
After patch server died immediately, for the first occurrence I didn't have debugs on. For that everything I have is the following error in the error log. [06/Feb/2017:19:34:40.968303179 +0200] entry2str_internal_ext: array boundary wrote: bufsize=6158 wrote=6230 I haven't now increased entry cache size (migration is not finished, and amount grows all the time). The logs before patch, i didn't upload errors log, because there wasn't any change before crash. Now after patch, I didn't upload access log because there wasn't any change. Now It looks like that I am not even able to get server started with patched RPMs. But it can also be that I am now at the worst moment on this anomaly and after multiple restarts server starts to get more stable time by time (so could be that situation would be the same without patched versions). I didn't install that 389-ds-base-snmp, because there weren't any previous versions installed on the system. I did a short checkout for "dbscan -f /var/lib/dirsrv/slapd-<INSTANCE>/db/<BACKEND>/id2entry.db" I didn't notice any change, but I continue tomorrow and try to look with fresh eyes. I am just afraid that I am not able to get 389ds up and running with patched versions. Yes, I also think that direct cause is not found. I am still believing it's not any certain entry but the amount of certain type of entries. Thanks, Sami.
Apparently, the patch did not solve the problem. Let me keep investigating...
> I am still believing it's not any certain entry but the amount of certain type of entries.
Regarding your input, do you happen to get a core file from the crash? If yes, could you find which entry caused the crash? Are there any similarity to the previous? Thanks...
Hi Mark, should this be Sanity only verification? or do we have steps to reproduce the crash? (In reply to Sankar Ramalingam from comment #24) > Hi Mark, should this be Sanity only verification? or do we have steps to > reproduce the crash? Actually there is a test case for this: dirsrvtests/tests/tickets/ticket49121_test.py Thanks Mark! for pointing out this!!. [0 root@qeos-38 tests]# py.test -s -v tickets/ticket49121_test.py ===================================================== test session starts ===================================================== platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python cachedir: .cache INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user5,ou=People,dc=example,dc=com - replacing attrs... INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user6,ou=People,dc=example,dc=com - replacing attrs... INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user7,ou=People,dc=example,dc=com - replacing attrs... INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user8,ou=People,dc=example,dc=com - replacing attrs... INFO:dirsrvtests.tests.tickets.ticket49121_test:[2] modify user CN=user9,ou=People,dc=example,dc=com - replacing attrs... INFO:dirsrvtests.tests.tickets.ticket49121_test:Valgrind is happy! INFO:lib389.utils:Valgrind is now disabled. INFO:dirsrvtests.tests.tickets.ticket49121_test:Testcase PASSED PASSEDInstance slapd-master_1 removed. Instance slapd-master_2 removed. ============== 1 passed in 68.92 seconds =============== Marking the bug as Verified since the above tests PASSed with the latest of 389-ds-base on RHEL-7.4. [0 root@qeos-38 tests]# rpm -qa |grep -i 389-ds 389-ds-base-snmp-1.3.6.1-13.el7.x86_64 389-ds-base-1.3.6.1-13.el7.x86_64 389-ds-base-debuginfo-1.3.6.1-13.el7.x86_64 389-ds-base-libs-1.3.6.1-13.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2086 |
Created attachment 1246596 [details] Stacktrace Description of problem: Our multimaster environment with four nodes is attached to the Novel IDM. When large dataset is migrated to the 389-ds it just dies. Version-Release number of selected component (if applicable): Name : 389-ds-base Arch : x86_64 Version : 1.3.5.10 Release : 15.el7_3 How reproducible: I just enable driver from the Novel to start synching data to the multimaster cluster. Actual results: * [4373826.739387] ns-slapd[2666]: segfault at 7f68b4000000 ip 00007f68b29513ac sp 00007f68807eef98 error 4 in libc-2.17.so[7f68b28d1000+1b6000] * [4720607.858694] traps: ns-slapd[21687] general protection ip:7f3545b5cdab sp:7f35107e6240 error:0 in libc-2.17.so[7f3545ade000+1b6000] Also from the errors log I can see huge amount of following * dn2entry_ext: Failed to get id for ,ou=<node>,dc=<domain>,dc=<com> from entryrdn index (34) And after new start * Detected Disorderly Shutdown last time Directory Server was running, recovering database. * NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-<INSTANCE>/changelogdb/731ede0d-cdc611e6-801cd48d-418e285b.sema; NSPR error - -5943 * NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=<domain>,dc=<com>. Check if DB RUV needs to be updated Expected results: Data should be synched