Bug 1976906

Summary: Instance crash at restart after changelog configuration
Product: Red Hat Enterprise Linux 8 Reporter: sgouvern
Component: 389-ds-baseAssignee: thierry bordaz <tbordaz>
Status: CLOSED ERRATA QA Contact: RHDS QE <ds-qe-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.5CC: ldap-maint, mreynolds, tbordaz
Target Milestone: betaKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-1.4-8050020210629182703.1a75f91c Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-09 18:12:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sgouvern 2021-06-28 14:13:48 UTC
Description of problem:

 the server is crashing when restarting the instance after enabling changelog encryption with AES


Version-Release number of selected component (if applicable):
389-ds-base-1.4.3.23-3.module+el8.5.0+11463+c476656c.x86_64.rpm

How reproducible:
Run dirsrvtests/tests/suites/replication/encryption_cl5_test.py


Actual results:
The server crashes and the test fails


Expected results:
test is passed, no crash

Additional info:

Investigation done by Thierry :
The ticket #4778 allows to configure the scheduling of the database compaction. One of the configuration attribute is "nsslapd-changelogcompactdb-time" (that is "23:55" by default).

For the changelog compaction, the data structure that stores this attribute is initialized in cl5ConfigTrimming.

The spawn thread responsible of the changelog compaction, that uses this data structure, is started cl5Open.

The problem is that, during replication plugin init, cl5Open is called before cl5ConfigTrimming. There is a risk that the changelog compaction thread (_cl5TrimMain) uses s_cl5Desc.dbTrim.compactTime before it gets initialized.

The crash is not systematic as it depends on thread scheduling


The crash backtrace is

Thread 8 (Thread 0x7f7e278cd700 (LWP 15645)):
#0  0x00007f7e17c9db51 in _cl5_get_tod_expiration (expire_time=0x0) at ldap/servers/plugins/replication/cl5_api.c:3107
#1  0x00007f7e17c9db51 in _cl5TrimMain (param=<optimized out>) at ldap/servers/plugins/replication/cl5_api.c:3145
#2  0x00007f7e2393f5d8 in _pt_root () at target:/lib64/libnspr4.so
#3  0x00007f7e232da14a in start_thread () at target:/lib64/libpthread.so.0
#4  0x00007f7e2564adc3 in clone () at target:/lib64/libc.so.6

...

Thread 1 (Thread 0x7f7e278d2240 (LWP 15630)):
#8  0x00007f7e17caf37a in add_new_agreement (e=e@entry=0x7f7e1a6b41c0) at ldap/servers/plugins/replication/repl5_agmtlist.c:157
...
#18 0x00007f7e17cb077f in agmtlist_config_init () at ldap/servers/plugins/replication/repl5_agmtlist.c:758
#19 0x00007f7e17cb9ae9 in multimaster_start (pb=0x7f7e1a617f00) at ldap/servers/plugins/replication/repl5_init.c:807
#20 0x00007f7e17cb9ae9 in multimaster_start (pb=0x7f7e1a617f00) at ldap/servers/plugins/replication/repl5_init.c:757
#21 0x00007f7e26f14787 in plugin_call_func (list=0x7f7e1a4d6900, operation=operation@entry=212, pb=0x7f7e1a617f00, call_one=call_one@entry=1)
    at ldap/servers/slapd/plugin.c:2064
#22 0x00007f7e26f17789 in plugin_call_one (pb=<optimized out>, operation=212, list=<optimized out>) at ldap/servers/slapd/plugin.c:1979
#23 0x00007f7e26f17789 in plugin_dependency_startall
    (operation=212, errmsg=<synthetic pointer>, plugin_list=<optimized out>, argv=<optimized out>, argc=<optimized out>) at ldap/servers/slapd/plugin.c:1733
#24 0x00007f7e26f17789 in plugin_startall (argc=<optimized out>, argv=<optimized out>, plugin_list=<optimized out>) at ldap/servers/slapd/plugin.c:1951
#25 0x000055e20502383c in main (argc=5, argv=0x7ffdeaeb6ea8) at ldap/servers/slapd/main.c:1034

Comment 4 sgouvern 2021-07-05 15:01:55 UTC
With 389-ds-base-1.4.3.23-4.module+el8.5.0+11627+16aed726.x86_64

]# PYTHONPATH=src/lib389/ py.test -sv dirsrvtests/tests/suites/replication/encryption_cl5_test.py 
re-exec with libfaketime dependencies
============================================== test session starts ==============================================
platform linux -- Python 3.6.8, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3.6
cachedir: .pytest_cache
metadata: {'Python': '3.6.8', 'Platform': 'Linux-4.18.0-318.el8.x86_64-x86_64-with-redhat-8.5-Ootpa', 'Packages': {'pytest': '6.2.4', 'py': '1.10.0', 'pluggy': '0.13.1'}, 'Plugins': {'metadata': '1.11.0', 'html': '3.1.1', 'libfaketime': '0.1.2', 'flaky': '3.7.0'}}
389-ds-base: 1.4.3.23-4.module+el8.5.0+11627+16aed726
nss: 3.67.0-2.el8_4
nspr: 4.31.0-1.el8_4
openldap: 2.4.46-17.el8
cyrus-sasl: 2.1.27-5.el8
FIPS: disabled
rootdir: /mnt/tests/rhds/tests/upstream/ds/dirsrvtests, configfile: pytest.ini
plugins: metadata-1.11.0, html-3.1.1, libfaketime-0.1.2, flaky-3.7.0
collected 1 item                                                                                                

========================================= 1 passed in 233.48s (0:03:53) =========================================

Marking as verified:tested

Comment 7 sgouvern 2021-07-12 08:00:08 UTC
As per comment 4, marking as VERIFIED

Comment 9 errata-xmlrpc 2021-11-09 18:12:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4203