Bug 1942060

Summary:

Directory Server is unavailable after a restart with nsslapd-readonly=on and consumes 100% CPU

Product:

Red Hat Directory Server

Reporter:

Marc Muehlfeld <mmuehlfe>

Component:

389-ds-base

Assignee:

LDAP Maintainers <idm-ds-dev-bugs>

Status:

CLOSED MIGRATED

QA Contact:

LDAP QA Team <idm-ds-qe-bugs>

Severity:

unspecified

Docs Contact:

Evgenia Martynyuk <emartyny>

Priority:

medium

Version:

11.2

CC:

idm-ds-dev-bugs, mreynolds, pasik, progier, tbordaz

Target Milestone:

DS12.5

Keywords:

Triaged, UserExperience

Target Release:

dirsrv-12.5

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

sync-to-jira

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-06-26 13:46:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
error log	none
ASAN file showing a double free	none

Description Marc Muehlfeld 2021-03-23 14:51:43 UTC

Created attachment 1765573 [details]
error log

Description of problem:
If you set nsslapd-readonly=on and restart the instance, ns-slapd consumes 100% CPU, no ports are opened, and the service is unavailable.


Version-Release number of selected component (if applicable):
389-ds-base-1.4.3.13-1.module+el8dsrv+8334+69a46a2e.x86_64


How reproducible:
Always.


Steps to Reproduce:
1. dsconf -D "cn=Directory Manager" ldap://server.example.com config replace nsslapd-readonly=on
2. systemctl restart dirsrv@instance_name


Actual results:
The systemctl command never ends, ns-slapd consumes 100% CPU, no ports are opened. The service is not available, and no meaningful error is logged.


Expected results:
If it's expected that the instance can't be (re-)started in read-only mode, then starting the service should fail and a meaningful information should be logged.

If it's expected that the instance starts, then ns-slapd should not hang and start successfully in read-only mode.

Comment 1 Pierre Rogier 2021-03-23 17:39:16 UTC

Created attachment 1765639 [details]
ASAN file showing a double free

FYI: I reproduced the problem on master branch with ASAN build and it shows data access on a freed buffer

Comment 2 Zuzana Lena Ansorgova 2021-05-17 14:40:28 UTC

@progier Is this going to be improved in RHDS v12?

Comment 3 Pierre Rogier 2021-05-20 09:15:31 UTC

@zansorgo I do not think so (as I do not see this bug in the JIRA backlog)

BTW, although the double free issue I noticed is very simple to fix (an error case is mishandled),
 the original issue is much more complex (and likely due to the fact that the replication plugin fails to starts. 
The more I think about it, the more I suspect that the global read-only feature is incompatible 
with the replication feature (There is no way to receive updates and no way to send updates because replication state cannot be updated (like RUV and
 CSN generator) furthermore the replication bootstrap fails because schema cannot be updated) 

We could disable replication plugin in that case, but IMHO it would be unsafe (because read-only flag can be changed dynamically but that will not restart
 the replication plugin) and if we run the server in read-write with the replication plugin disabled then there is a risk to corrupt the data ...

Note: The fact that there are trouble when starting the replication and that the server stopped is logged in the error log. 
And when using dsctl command to restart the server, the command finishes but after a long time out (1 minute or so) (that may be why the systemctl appear to hang (especially if there are some retries))

The reason is that lib389 waits until the pid file get created to determine that the server is started rightly.
 The issue is that the server generates the pid file after starting to listen the connection (i.e when the server is fully up) 
   and if the process dies before reaching that way it waits until the timeout expires. 

That said, there may be a way to improve the starting failure detection (to avoid having to wait for the time out)
 The idea is to have 2 pid files (the current one + another one written by the parent process once the child pid is known 
  ( i.e when forking in detach function)
 So in lib389 the detach.pid file should exists when the parent dies and we could check that child process is still alive while waiting
 for the "server started" pid file. (if the child is dead we should tell that the server failed to start and advise to look in 389 errorlog) 


So In conclusion: we could not change the fact that the server fails to start when set in read-only mode but we
could reports that the server failed to start before the user is getting impatient !

I will open a ticket upstream to propose that enhancement.

Comment 8 Viktor Ashirov 2024-06-26 13:46:36 UTC

This BZ has been automatically migrated to Red Hat Issue Tracker https://issues.redhat.com/browse/DIRSRV-34. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.