RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1773114 - ns-slapd hangs during CleanAllRUV tests
Summary: ns-slapd hangs during CleanAllRUV tests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: 389-ds-base
Version: 8.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Ludwig
QA Contact: RHDS QE
URL:
Whiteboard:
Depends On:
Blocks: 1837477
TreeView+ depends on / blocked
 
Reported: 2019-11-15 22:30 UTC by Viktor Ashirov
Modified: 2023-02-18 04:28 UTC (History)
8 users (show)

Fixed In Version: 389-ds-1.4-8020020200120164339.bf00efc9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1837477 (view as bug list)
Environment:
Last Closed: 2020-04-28 16:01:22 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
pstack during test (18.13 KB, text/plain)
2019-11-15 22:30 UTC, Viktor Ashirov
no flags Details
pstack at shutdown (8.14 KB, text/plain)
2019-11-15 22:31 UTC, Viktor Ashirov
no flags Details
pstacks and dbstat for all masters (12.82 KB, application/gzip)
2019-11-18 14:03 UTC, Viktor Ashirov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 3714 0 None closed AddressSanitizer: SEGV ldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:1990 in bdb_pre_close 2021-02-03 19:36:23 UTC
Github 389ds 389-ds-base issues 3786 0 None closed server hangs in cleanallruv test 2021-02-03 19:36:24 UTC
Github 389ds 389-ds-base issues 3800 0 None closed ns-slapd hangs during CleanAllRUV tests 2021-02-03 19:36:24 UTC
Red Hat Issue Tracker RHELPLAN-32581 0 None None None 2023-02-18 04:28:11 UTC
Red Hat Product Errata RHBA-2020:1703 0 None None None 2020-04-28 16:01:43 UTC

Description Viktor Ashirov 2019-11-15 22:30:36 UTC
Created attachment 1636635 [details]
pstack during test

Description of problem:
I'm observing a hang during suites/replication/cleanallruv_test.py::test_stress_clean


Version-Release number of selected component (if applicable):
389-ds-base-1.4.2.4-3.module+el8.2.0+4718+1b57dca2.x86_64.rpm

How reproducible:
always

Steps to Reproduce:
1. Run suites/replication/cleanallruv_test.py::test_stress_clean
2.
3.

Actual results:
Server hangs, pstacks during test and at the shutdown are attached.

Expected results:
Server should not hang.

Additional info:

Comment 1 Viktor Ashirov 2019-11-15 22:31:02 UTC
Created attachment 1636636 [details]
pstack at shutdown

Comment 2 mreynolds 2019-11-16 02:20:14 UTC
This could be related to issue reported in https://pagure.io/389-ds-base/issue/50659 - related to the new bdb code changes....

Comment 3 thierry bordaz 2019-11-18 08:30:16 UTC
Victor would it be possible to collect db_stat at the same time of the pstack. (db_stat -N -C A -h <db_homedir> . ?

Comment 4 Ludwig 2019-11-18 09:13:50 UTC
I did reproduce the hang with cc8bfec02d2ae6e80f48ce8a505e312471225f00 the version I based the fix for 50659 on. But only in one of 4 runs.

Comment 5 Viktor Ashirov 2019-11-18 14:03:02 UTC
Created attachment 1637322 [details]
pstacks and dbstat for all masters

Comment 7 Viktor Ashirov 2019-11-18 14:24:00 UTC
(In reply to Ludwig from comment #4)
> I did reproduce the hang with cc8bfec02d2ae6e80f48ce8a505e312471225f00 the
> version I based the fix for 50659 on. But only in one of 4 runs.

Last successful nightly test was with 0493b01, every run after had a hang and the test was terminated by a timeout. Maybe b77f04a exposed a bug that was introduced earlier?

Comment 8 Ludwig 2019-11-19 10:08:50 UTC
This is very weird.  
First I have no clear idea when this was introduced, With  b77f04a I did several successful runs, and after enabling DEBUGGING I did get the hang. And I had reproduced the hang with an older commit

Next, the hang itself. The db_stat output shows one thread waiting for a lock and another one holding the lock, but when looking for the thread in the pstack, the thread holding it is idle:

      39 dd= 2 locks held 1    write locks 0    pid/thread 19680/140696342230784 flags 10   priority 100
      39 READ          1 HELD    /var/lib/dirsrv/slapd-master4/changelogdb/6cd2ec83-0a0811ea-abd3ae47-1bf83628_5dd29e5d000000010000.db handle        0
     
80004eaf dd= 0 locks held 0    write locks 0    pid/thread 19680/140696396748544 flags 0    priority 100       
80004eaf WRITE         1 WAIT    /var/lib/dirsrv/slapd-master4/changelogdb/6cd2ec83-0a0811ea-abd3ae47-1bf83628_5dd29e5d000000010000.db handle 

and 140696342230784 == 0x7ff66c7ff700

Thread 14 (Thread 0x7ff66c7ff700 (LWP 19694)):
#0  0x00007ff69307147c in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007ff6936ca828 in PR_WaitCondVar () at /lib64/libnspr4.so
#2  0x000055e87720c88e in [IDLE THREAD] connection_wait_for_new_work (pb=pb@entry=0x7ff669200000, interval=interval@entry=4294967295) at ldap/servers/slapd/connection.c:973
#3  0x000055e87720de3b in connection_threadmain () at ldap/servers/slapd/connection.c:1515
#4  0x00007ff6936d0568 in _pt_root () at /lib64/libnspr4.so
#5  0x00007ff69306b2de in start_thread () at /lib64/libpthread.so.0
#6  0x00007ff6927fee53 in clone () at /lib64/libc.so.6


so a worker thread did have a lock, but didn't realease it.

Another strange thing is that the replica is deleted, but the following mods are still written to the changelog, but with the replicaid of the local replica. Looks like removing a replica is or always was not clean

Comment 9 Ludwig 2019-11-19 13:22:44 UTC
Here is some more info, inthe access log we see the client connection conn=8 doing ADD operations
In between there is conn=1 op=63 deleting the replica, which is not completed.

[19/Nov/2019:09:23:35.363896713 +0100] conn=8 op=1379 ADD dn="uid=testuser1820,ou=People,dc=example,dc=com"
[19/Nov/2019:09:23:35.378015556 +0100] conn=8 op=1379 RESULT err=68 tag=105 nentries=0 etime=0.018316882
[19/Nov/2019:09:23:35.379906886 +0100] conn=8 op=1380 ADD dn="uid=testuser1367,ou=People,dc=example,dc=com"
[19/Nov/2019:09:23:35.383262370 +0100] conn=1 op=61 RESULT err=0 tag=107 nentries=0 etime=0.141737531
[19/Nov/2019:09:23:35.384074334 +0100] conn=1 op=62 SRCH base="cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config" scope=2 filter="(&(objectClass=nsDSWindowsReplicationAgreement))" attrs="distinguishedName"
[19/Nov/2019:09:23:35.384212503 +0100] conn=1 op=62 RESULT err=0 tag=101 nentries=0 etime=0.000862241
[19/Nov/2019:09:23:35.385312862 +0100] conn=1 op=63 DEL dn="cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config"
[19/Nov/2019:09:23:35.395442234 +0100] conn=8 op=1380 RESULT err=68 tag=105 nentries=0 etime=0.017294672
[19/Nov/2019:09:23:35.397646870 +0100] conn=8 op=1381 ADD dn="uid=testuser944,ou=People,dc=example,dc=com"
[19/Nov/2019:09:23:35.410632844 +0100] conn=8 op=1381 RESULT err=0 tag=105 nentries=0 etime=0.015069664 csn=5dd3a687000900040000
[19/Nov/2019:09:23:35.414389193 +0100] conn=8 op=1382 ADD dn="uid=testuser1421,ou=People,dc=example,dc=com"
[19/Nov/2019:09:23:36.888953882 +0100] conn=8 op=1382 RESULT err=0 tag=105 nentries=0 etime=1.477650371 csn=5dd3a687000a00040000

But in the error log, we see that after deletion of the replica
[19/Nov/2019:09:23:35.385520219 +0100] - WARN - NSMMReplicationPlugin - replica_config_delete - The changelog for replica dc=example,dc=com is no longer valid since the replica config is being deleted.  Removing
 the changelog.
[19/Nov/2019:09:23:35.416396543 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[19/Nov/2019:09:23:36.154042390 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
[19/Nov/2019:09:23:36.166430924 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[19/Nov/2019:09:23:36.861871184 +0100] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)

the changelog ruv is rebuilt, probably in the context of "conn=8 op=1381" 

But this should not happen, it indicates that the deletion was partially done. but then the ruv was recreated and the delete is still pending/hanging.

Comment 10 Ludwig 2019-11-19 13:56:20 UTC
I just reproduced the hang with 1.4.1

Comment 11 Ludwig 2019-11-20 15:19:51 UTC
Latest news: I also did reproduce with 1.4.0 - so I think it is NOT a regression.

I have a potential fix in mind, by setting the replica in the mapping tree node to NULL before deleting the changelog (in replica_config_delete), but this area seems to be  a bit "fragile", so more investigation is needed

Comment 12 Viktor Ashirov 2019-11-20 15:47:07 UTC
Thank you Ludwig for the investigation, removing the Regression keyword.

Comment 13 thierry bordaz 2019-11-27 14:28:49 UTC
Upstream ticket:
https://pagure.io/389-ds-base/issue/50745

Comment 14 thierry bordaz 2019-12-04 10:20:55 UTC
Fix pushed upstream (https://pagure.io/389-ds-base/pull-request/50748) => POST

Comment 18 errata-xmlrpc 2020-04-28 16:01:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1703


Note You need to log in before you can comment on or make changes to this bug.