997742 – systemd crashes during automated removal of 389-ds-base instances

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 997742 - systemd crashes during automated removal of 389-ds-base instances

Summary: systemd crashes during automated removal of 389-ds-base instances

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1006323 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-16 06:30 UTC by Sankar Ramalingam
Modified:	2014-06-13 11:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:	systemd-206-7.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-13 11:20:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
stack trace (full) (5.22 KB, text/plain) 2013-08-22 16:37 UTC, Nathan Kinder	no flags	Details
View All

Description Sankar Ramalingam 2013-08-16 06:30:16 UTC

Description of problem: Creating directory server instance is failing on RHEL7. Running setup-ds.pl throws an error message as
Error: command '/bin/systemctl --system daemon-reload' failed - output [Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused
] error []Error: Could not create directory server instance 'testinst11'.
Exiting . . .
Log file is '/tmp/setupzitRJD.log'



Version-Release number of selected component (if applicable): 1.3.1.6


How reproducible: Not consistently. 


Steps to Reproduce:
1. Install 389-ds-base-1.3.1.6 on RHEL7
2. Create multiple instances of directory server and configure replication or Run fourwaymmr test suite from TET.
3. After completing fourway mmr setup, try removing all the masters by running remove-ds.pl.
4. Then, try to create a new instance to check whether it works.

Actual results: Creating DS instance fails.

Error: command '/bin/systemctl --system daemon-reload' failed - output [Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused
] error []Error: Could not create directory server instance 'testinst11'.
Exiting . . .
Log file is '/tmp/setupzitRJD.log'


Expected results: Instance should be successfully created.


Additional info:

It takes quite a long time to complete remove-ds.pl command and in the end, it fails to remove the DS instance and throws the same error.

Few lines from fourwaymmr cleanup tests...

RemoveInstance /usr/lib64/dirsrv/slapd-M4 30106
The following errors occurred during removal:
Error: command '/bin/systemctl --system daemon-reload' failed - output [Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused
] error []Error: could not remove directory server M4
TestCase [fourwaymmr_cleanup] result-> [PASS]

Also, the slapd leaves "defunct" processes on the system.

ps -eaf |grep -i slapd
sramling  2100     1  0 Aug15 ?        00:00:06 [ns-slapd] <defunct>
sramling 14511     1  0 Aug15 ?        00:00:00 [ns-slapd] <defunct>
sramling 14663     1  0 Aug15 ?        00:00:00 [ns-slapd] <defunct>
sramling 14827     1  0 Aug15 ?        00:00:01 [ns-slapd] <defunct>
sramling 15003     1  0 Aug15 ?        00:00:01 [ns-slapd] <defunct>
sramling 15119     1  0 Aug15 ?        00:00:01 [ns-slapd] <defunct>
sramling 15247     1  0 Aug15 ?        00:00:01 [ns-slapd] <defunct>
sramling 15339     1  0 Aug15 ?        00:00:02 [ns-slapd] <defunct>
sramling 15443     1  0 Aug15 ?        00:00:02 [ns-slapd] <defunct>
root     21619 21598  0 02:21 pts/2    00:00:00 grep --color=auto -i slapd

Comment 2 Nathan Kinder 2013-08-19 15:16:29 UTC

Does this happen outside of TET automation?  Are you able to successfully create and remove instances manually by running setup-ds.pl/remove-ds.pl?

Comment 3 Nathan Kinder 2013-08-19 15:27:53 UTC

Please check the following as well:

- Are there any AVC messages logged?
- Do other systemctl commands not related to DS work?

Comment 4 Nathan Kinder 2013-08-22 16:36:36 UTC

The problem is that systemd is crashing.  I suspect that our test automation is cleaning something up out from under systemd, and it doesn't handle it well.

When the system first gets into this broken state, the following is logged to /var/log/messages:

-----------------------------------------------------------------------------
Aug 22 01:05:11 dell-pe2950-01 systemd: Assertion 'path' failed at src/shared/cgroup-util.c:866, function cg_is_empty_recursive(). Aborting.
Aug 22 01:05:11 dell-pe2950-01 systemd: Caught <ABRT>, dumped core as pid 24154.
Aug 22 01:05:11 dell-pe2950-01 systemd: Freezing execution. 
-----------------------------------------------------------------------------

I generated the following stack trace from the core file (a full stack trace will be attached to this bug shortly):


-----------------------------------------------------------------------------
Core was generated by `/usr/lib/systemd/systemd --switched-root --system --deserialize 22'.
Program terminated with signal 6, Aborted.
#0  0x00007f2454e5bffb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
37	  return INLINE_SYSCALL (tgkill, 3, pid, THREAD_GETMEM (THREAD_SELF, tid),
Missing separate debuginfos, use: debuginfo-install libattr-2.4.46-10.el7.x86_64 pcre-8.32-7.el7.x86_64 zlib-1.2.7-10.el7.x86_64
(gdb) bt
#0  0x00007f2454e5bffb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f24569a395e in crash (sig=6) at src/core/main.c:144
#2  <signal handler called>
#3  0x00007f2454ac1999 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4  0x00007f2454ac30a8 in __GI_abort () at abort.c:90
#5  0x00007f24569fc563 in log_assert (text=<optimized out>, file=0x7f2456a55503 "src/shared/cgroup-util.c", line=866, 
    func=0x7f2456a556c0 <__PRETTY_FUNCTION__.7913> "cg_is_empty_recursive", 
    format=format@entry=0x7f2456a56ab0 "Assertion '%s' failed at %s:%u, function %s(). Aborting.") at src/shared/log.c:699
#6  0x00007f24569fce10 in log_assert_failed (text=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>)
    at src/shared/log.c:704
#7  0x00007f24569f33f3 in cg_is_empty_recursive (controller=controller@entry=0x7f2456a4dda7 "name=systemd", path=0x0, 
    ignore_self=ignore_self@entry=true) at src/shared/cgroup-util.c:866
#8  0x00007f24569e4890 in manager_notify_cgroup_empty (m=m@entry=0x7f245828cba0, cgroup=<optimized out>) at src/core/cgroup.c:736
#9  0x00007f24569d538d in private_bus_message_filter (connection=0x7f245828d820, message=0x7f245858ce50, data=0x7f245828cba0)
    at src/core/dbus.c:491
#10 0x00007f24554969e6 in dbus_connection_dispatch (connection=connection@entry=0x7f245828d820) at dbus-connection.c:4631
#11 0x00007f24569d5dda in bus_dispatch (m=m@entry=0x7f245828cba0) at src/core/dbus.c:525
#12 0x00007f24569a969f in manager_loop (m=0x7f245828cba0) at src/core/manager.c:1816
#13 0x00007f24569a0fb6 in main (argc=5, argv=0x7ffff5ada0c8) at src/core/main.c:1705
-----------------------------------------------------------------------------

Comment 5 Nathan Kinder 2013-08-22 16:37:20 UTC

Created attachment 789271 [details]
stack trace (full)

Comment 6 Harald Hoyer 2013-08-23 16:19:53 UTC

same as https://bugzilla.redhat.com/show_bug.cgi?id=995197#c15

Comment 7 Harald Hoyer 2013-08-23 16:55:44 UTC

next try:
systemd-206-6.el7

Comment 8 Harald Hoyer 2013-08-28 14:34:21 UTC

Found the culprit in a 12h debug session. hashmap keys were free()'d and caused a hashmap corruption.

systemd-206-7.el7

Comment 9 Michal Schmidt 2013-09-10 12:45:31 UTC

*** Bug 1006323 has been marked as a duplicate of this bug. ***

Comment 10 Sankar Ramalingam 2013-09-16 11:21:42 UTC

with systemd-207-1.el7, the issue is not re-producible. Hence, marking the bug as Verified.

Comment 11 Ludek Smid 2014-06-13 11:20:30 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.