Bug 1773114
| Summary: | ns-slapd hangs during CleanAllRUV tests | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Viktor Ashirov <vashirov> | ||||||||
| Component: | 389-ds-base | Assignee: | Ludwig <lkrispen> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 8.2 | CC: | lkrispen, mreynolds, nkinder, pasik, spichugi, tbordaz, tmihinto, vashirov | ||||||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||||||
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | 389-ds-1.4-8020020200120164339.bf00efc9 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1837477 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2020-04-28 16:01:22 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1837477 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 1636636 [details]
pstack at shutdown
This could be related to issue reported in https://pagure.io/389-ds-base/issue/50659 - related to the new bdb code changes.... Victor would it be possible to collect db_stat at the same time of the pstack. (db_stat -N -C A -h <db_homedir> . ? I did reproduce the hang with cc8bfec02d2ae6e80f48ce8a505e312471225f00 the version I based the fix for 50659 on. But only in one of 4 runs. Created attachment 1637322 [details]
pstacks and dbstat for all masters
(In reply to Ludwig from comment #4) > I did reproduce the hang with cc8bfec02d2ae6e80f48ce8a505e312471225f00 the > version I based the fix for 50659 on. But only in one of 4 runs. Last successful nightly test was with 0493b01, every run after had a hang and the test was terminated by a timeout. Maybe b77f04a exposed a bug that was introduced earlier? This is very weird.
First I have no clear idea when this was introduced, With b77f04a I did several successful runs, and after enabling DEBUGGING I did get the hang. And I had reproduced the hang with an older commit
Next, the hang itself. The db_stat output shows one thread waiting for a lock and another one holding the lock, but when looking for the thread in the pstack, the thread holding it is idle:
39 dd= 2 locks held 1 write locks 0 pid/thread 19680/140696342230784 flags 10 priority 100
39 READ 1 HELD /var/lib/dirsrv/slapd-master4/changelogdb/6cd2ec83-0a0811ea-abd3ae47-1bf83628_5dd29e5d000000010000.db handle 0
80004eaf dd= 0 locks held 0 write locks 0 pid/thread 19680/140696396748544 flags 0 priority 100
80004eaf WRITE 1 WAIT /var/lib/dirsrv/slapd-master4/changelogdb/6cd2ec83-0a0811ea-abd3ae47-1bf83628_5dd29e5d000000010000.db handle
and 140696342230784 == 0x7ff66c7ff700
Thread 14 (Thread 0x7ff66c7ff700 (LWP 19694)):
#0 0x00007ff69307147c in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00007ff6936ca828 in PR_WaitCondVar () at /lib64/libnspr4.so
#2 0x000055e87720c88e in [IDLE THREAD] connection_wait_for_new_work (pb=pb@entry=0x7ff669200000, interval=interval@entry=4294967295) at ldap/servers/slapd/connection.c:973
#3 0x000055e87720de3b in connection_threadmain () at ldap/servers/slapd/connection.c:1515
#4 0x00007ff6936d0568 in _pt_root () at /lib64/libnspr4.so
#5 0x00007ff69306b2de in start_thread () at /lib64/libpthread.so.0
#6 0x00007ff6927fee53 in clone () at /lib64/libc.so.6
so a worker thread did have a lock, but didn't realease it.
Another strange thing is that the replica is deleted, but the following mods are still written to the changelog, but with the replicaid of the local replica. Looks like removing a replica is or always was not clean
Here is some more info, inthe access log we see the client connection conn=8 doing ADD operations In between there is conn=1 op=63 deleting the replica, which is not completed. [19/Nov/2019:09:23:35.363896713 +0100] conn=8 op=1379 ADD dn="uid=testuser1820,ou=People,dc=example,dc=com" [19/Nov/2019:09:23:35.378015556 +0100] conn=8 op=1379 RESULT err=68 tag=105 nentries=0 etime=0.018316882 [19/Nov/2019:09:23:35.379906886 +0100] conn=8 op=1380 ADD dn="uid=testuser1367,ou=People,dc=example,dc=com" [19/Nov/2019:09:23:35.383262370 +0100] conn=1 op=61 RESULT err=0 tag=107 nentries=0 etime=0.141737531 [19/Nov/2019:09:23:35.384074334 +0100] conn=1 op=62 SRCH base="cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config" scope=2 filter="(&(objectClass=nsDSWindowsReplicationAgreement))" attrs="distinguishedName" [19/Nov/2019:09:23:35.384212503 +0100] conn=1 op=62 RESULT err=0 tag=101 nentries=0 etime=0.000862241 [19/Nov/2019:09:23:35.385312862 +0100] conn=1 op=63 DEL dn="cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config" [19/Nov/2019:09:23:35.395442234 +0100] conn=8 op=1380 RESULT err=68 tag=105 nentries=0 etime=0.017294672 [19/Nov/2019:09:23:35.397646870 +0100] conn=8 op=1381 ADD dn="uid=testuser944,ou=People,dc=example,dc=com" [19/Nov/2019:09:23:35.410632844 +0100] conn=8 op=1381 RESULT err=0 tag=105 nentries=0 etime=0.015069664 csn=5dd3a687000900040000 [19/Nov/2019:09:23:35.414389193 +0100] conn=8 op=1382 ADD dn="uid=testuser1421,ou=People,dc=example,dc=com" [19/Nov/2019:09:23:36.888953882 +0100] conn=8 op=1382 RESULT err=0 tag=105 nentries=0 etime=1.477650371 csn=5dd3a687000a00040000 But in the error log, we see that after deletion of the replica [19/Nov/2019:09:23:35.385520219 +0100] - WARN - NSMMReplicationPlugin - replica_config_delete - The changelog for replica dc=example,dc=com is no longer valid since the replica config is being deleted. Removing the changelog. [19/Nov/2019:09:23:35.416396543 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes... [19/Nov/2019:09:23:36.154042390 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete. Result 0 (Success) [19/Nov/2019:09:23:36.166430924 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes... [19/Nov/2019:09:23:36.861871184 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete. Result 0 (Success) the changelog ruv is rebuilt, probably in the context of "conn=8 op=1381" But this should not happen, it indicates that the deletion was partially done. but then the ruv was recreated and the delete is still pending/hanging. I just reproduced the hang with 1.4.1 Latest news: I also did reproduce with 1.4.0 - so I think it is NOT a regression. I have a potential fix in mind, by setting the replica in the mapping tree node to NULL before deleting the changelog (in replica_config_delete), but this area seems to be a bit "fragile", so more investigation is needed Thank you Ludwig for the investigation, removing the Regression keyword. Upstream ticket: https://pagure.io/389-ds-base/issue/50745 Fix pushed upstream (https://pagure.io/389-ds-base/pull-request/50748) => POST Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1703 |
Created attachment 1636635 [details] pstack during test Description of problem: I'm observing a hang during suites/replication/cleanallruv_test.py::test_stress_clean Version-Release number of selected component (if applicable): 389-ds-base-1.4.2.4-3.module+el8.2.0+4718+1b57dca2.x86_64.rpm How reproducible: always Steps to Reproduce: 1. Run suites/replication/cleanallruv_test.py::test_stress_clean 2. 3. Actual results: Server hangs, pstacks during test and at the shutdown are attached. Expected results: Server should not hang. Additional info: