Bug 995197 - RHEL7 systemd segfault during ipa-server-install
RHEL7 systemd segfault during ipa-server-install
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd (Show other bugs)
7.0
Unspecified Unspecified
urgent Severity high
: rc
: ---
Assigned To: systemd-maint
Tomas Dolezal
: TestBlocker
: 1000369 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-08 14:16 EDT by Scott Poore
Modified: 2014-06-13 08:03 EDT (History)
10 users (show)

See Also:
Fixed In Version: systemd-206-7.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-13 08:03:40 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
abrt crash dump dir from systemd segfault (294.23 KB, application/x-xz)
2013-08-08 14:24 EDT, Scott Poore
no flags Details
backtrace from failed job (7.96 KB, text/plain)
2013-08-22 09:20 EDT, Scott Poore
no flags Details
new systemd crash abrt dir (387.93 KB, application/x-tar)
2013-08-22 11:48 EDT, Scott Poore
no flags Details
new systemd-logind abrt dir (81.28 KB, application/x-tar)
2013-08-22 11:49 EDT, Scott Poore
no flags Details

  None (edit)
Description Scott Poore 2013-08-08 14:16:10 EDT
Description of problem:
I've seen systemd segfault twice now in RHEL 7 after/during IPA install.  From Logs it looks like it fails during an httpd restart.

2013-08-08T16:03:58Z DEBUG args=/bin/systemctl restart httpd.service
2013-08-08T16:03:59Z DEBUG Process finished, return code=1
2013-08-08T16:03:59Z DEBUG stdout=
2013-08-08T16:03:59Z DEBUG stderr=Warning! D-Bus connection terminated.

From journalctl:

Aug 08 12:03:59 qe-blade-10.testrelm.com kernel: systemd[1]: segfault at 7fbb4ba3d858 ip 00007fb82372cd4f sp 00007fffc1aae400 error 4 in systemd[7fb82370b000+102000]
Aug 08 12:03:59 qe-blade-10.testrelm.com abrtd[4950]: Directory 'ccpp-2013-08-08-12:03:59-9214' creation detected
Aug 08 12:03:59 qe-blade-10.testrelm.com abrt[9215]: Saved core dump of pid 9214 (/usr/lib/systemd/systemd) to /var/tmp/abrt/ccpp-2013-08-08-12:03:59-9214 (4521984 bytes)
Aug 08 12:03:59 qe-blade-10.testrelm.com systemd[1]: Caught <SEGV>, dumped core as pid 9214.
Aug 08 12:03:59 qe-blade-10.testrelm.com systemd[1]: Freezing execution.
Aug 08 12:04:00 qe-blade-10.testrelm.com abrtd[4950]: Generating core_backtrace
Aug 08 12:04:00 qe-blade-10.testrelm.com abrtd[4950]: Generating backtrace
Aug 08 12:04:03 qe-blade-10.testrelm.com abrtd[4950]: New problem directory /var/tmp/abrt/ccpp-2013-08-08-12:03:59-9214, processing
Aug 08 12:04:03 qe-blade-10.testrelm.com abrtd[4950]: Sending an email...
...

Aug 08 12:04:04 qe-blade-10.testrelm.com systemd-cgroups-agent[9310]: Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused


Version-Release number of selected component (if applicable):
systemd-206-2.el7.x86_64

How reproducible:
Unknown but, have seen this twice in automated testing.


Steps to Reproduce:
1.  ipa-server-install
2.  systemctl -l list-units

Actual results:
systemd segfaults and fails to run systemctl commands fail.  

systemctl -l list-units
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused

Expected results:
httpd restarts cleanly and systemd and systemctl commands continue working as expected.


Additional info:
Comment 1 Scott Poore 2013-08-08 14:24:19 EDT
Created attachment 784548 [details]
abrt crash dump dir from systemd segfault
Comment 4 Harald Hoyer 2013-08-09 09:21:50 EDT
(gdb) bt
#0  0x00007fb821be3e7b in raise () from /lib64/libpthread.so.0
#1  0x00007fb82372b95e in crash (sig=11) at src/core/main.c:144
#2  <signal handler called>
#3  0x00007fb82372cd4f in manager_dispatch_sigchld (m=m@entry=0x7fb824498800) at src/core/manager.c:1380
#4  0x00007fb8237321a1 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1623
#5  process_event (ev=0x7fffc1aae5e0, m=0x7fb824498800) at src/core/manager.c:1648
#6  manager_loop (m=0x7fb824498800) at src/core/manager.c:1845
#7  0x00007fb823728fb6 in main (argc=5, argv=0x7fffc1aaee58) at src/core/main.c:1705
...

#3  0x00007fb82372cd4f in manager_dispatch_sigchld (m=m@entry=0x7fb824498800) at src/core/manager.c:1380
1380	                UNIT_VTABLE(u)->sigchld_event(u, si.si_pid, si.si_code, si.si_status);
(gdb) print u
$1 = (struct Unit *) 0x7fb82478caf0
(gdb) print u->type 
$2 = 1694511699
(gdb) print *u
$3 = {manager = 0x7974742f7974742f, type = 1694511699, load_state = 1701013878, merged_into = 0x0, id = 0x21 <Address 0x21 out of bounds>, 
  instance = 0x537974742d766564 <Address 0x537974742d766564 out of bounds>, names = 0x6563697665642e32, dependencies = {0x0, 0x21, 0x7974742f7665642f, 
    0x6563697665003253, 0x0, 0x41, 0x7369642f7665642f, 0x2f64692d79622f6b, 0x2d656d616e2d6d64, 0x2d65715f6c656872, 0x2d2d6564616c622d, 0x706177732d3031, 
    0x0, 0x31, 0x2e6d65747379732f, 0x65642f6563696c73, 0x6170656775682d76, 0x6e756f6d2e736567, 0x7fb824580074, 0x31, 0x697463656c6c6f43, 0x616461657220676e, 
    0x7461642064616568}, requires_mounts_for = 0x2f64692d79620061, description = 0x3035332d69736373 <Address 0x3035332d69736373 out of bounds>, 
  documentation = 0x31, fragment_path = 0x2e6d65747379732f <Address 0x2e6d65747379732f out of bounds>, 
  source_path = 0x72692f6563696c73 <Address 0x72692f6563696c73 out of bounds>, dropin_paths = 0x65636e616c616271, fragment_mtime = 7305798977971385134, 
  source_mtime = 140428862606336, dropin_mtime = 49, job = 0x2e6d65747379732f, nop_job = 0x65632f6563696c73, job_timeout = 8243108416985068658, 
  refs = 0x656369767265732e, conditions = 0x0, condition_timestamp = {realtime = 65, monotonic = 8316288341928928303}, inactive_exit_timestamp = {
    realtime = 7599108784428035947, monotonic = 7233735817404100452}, active_enter_timestamp = {realtime = 3269952198971045688, 
    monotonic = 7219941098984190516}, active_exit_timestamp = {realtime = 3546365046461443429, monotonic = 54297919306039}, inactive_enter_timestamp = {
    realtime = 65, monotonic = 3345441649034097455}, cgroup_path = 0x79732f6563696c73 <Address 0x79732f6563696c73 out of bounds>, cgroup_realized = 115, 
  cgroup_mask = (CGROUP_CPU | CGROUP_BLKIO | CGROUP_MEMORY | unknown: 1937339168), slice = {unit = 0x6432785c646d6574, refs_next = 0x696c732e6b637366, 
    refs_prev = 0x752f3a6e00006563}, units_by_type_next = 0x6e69622f7273, units_by_type_prev = 0x41, has_requires_mounts_for_next = 0x7665642f7379732f, 
  has_requires_mounts_for_prev = 0x616c702f73656369, load_queue_next = 0x65732f6d726f6674, load_queue_prev = 0x303532386c616972, 
  dbus_queue_next = 0x7974742f7974742f, dbus_queue_prev = 0x6563697665003353, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x21, 
  gc_queue_next = 0x6d722f6e69622f, gc_queue_prev = 0x7fb824590f00, cgroup_queue_next = 0x0, cgroup_queue_prev = 0x61, gc_marker = 611896592, 
  deserialized_job = 32696, load_error = 611895392, unit_file_state = 32696, stop_when_unneeded = false, default_dependencies = false, 
  refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false, on_failure_isolate = false, ignore_on_isolate = false, 
  ignore_on_snapshot = false, condition_result = false, transient = false, in_load_queue = false, in_dbus_queue = false, in_cleanup_queue = false, 
  in_gc_queue = false, in_cgroup_queue = false, sent_dbus_new_signal = false, no_gc = false, in_audit = false}
(gdb) print si
$4 = {si_signo = 17, si_errno = 0, si_code = 2, _sifields = {_pad = {8508, 0, 15, 0 <repeats 25 times>}, _kill = {si_pid = 8508, si_uid = 0}, _timer = {
      si_tid = 8508, si_overrun = 0, si_sigval = {sival_int = 15, sival_ptr = 0xf}}, _rt = {si_pid = 8508, si_uid = 0, si_sigval = {sival_int = 15, 
        sival_ptr = 0xf}}, _sigchld = {si_pid = 8508, si_uid = 0, si_status = 15, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x213c}, _sigpoll = {
      si_band = 8508, si_fd = 15}, _sigsys = {_call_addr = 0x213c, _syscall = 15, _arch = 0}}}
Comment 5 Michal Schmidt 2013-08-09 09:38:19 EDT
Perhaps we forgot to remove an item from m->watch_pids somewhere earlier? So now we're accessing a Unit that's not there anymore.
Comment 6 Lennart Poettering 2013-08-09 10:40:20 EDT
What is the contents of the httpd.service file used here?
Comment 7 Lennart Poettering 2013-08-09 10:45:43 EDT
Michal is probably be right, though there's no obvious place where I see where we might forget unwatching a PID... I have now made a commit 41efeaec037678ac790e2a02df9020f83cc3a359 which explicitly unwatches the PID in a few cases, which shouldn't hurt but where I am really nto sure if that touches the problem. WOuld be cool if somebody could test the patch and see if it has any effect of the issue in question.
Comment 8 Scott Poore 2013-08-09 11:31:09 EDT
Lennart,

Here's the httpd.service:

# cat /usr/lib/systemd/system/httpd.service 
[Unit]
Description=The Apache HTTP Server
After=network.target remote-fs.target nss-lookup.target

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/httpd
ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND
ExecReload=/usr/sbin/httpd $OPTIONS -k graceful
ExecStop=/usr/sbin/httpd $OPTIONS -k graceful-stop
# We want systemd to give httpd some time to finish gracefully, but still want
# it to kill httpd after TimeoutStopSec if something went wrong during the
# graceful stop. Normally, Systemd sends SIGTERM signal right after the
# ExecStop, which would kill httpd. We are sending useless SIGCONT here to give
# httpd time to finish.
KillSignal=SIGCONT
PrivateTmp=true

[Install]
WantedBy=multi-user.target

That's from a server where it failed this morning.  4 other servers didn't see the failure in the same test.  So, it doesn't appear to be consistent but, I think running many tests would show if it's fixed.  And since I run a lot of IPA install/reinstall tests, I'd think if it's not fixed, I'd see it pretty quickly.

Any chance I can get an rpm that I could test with?

Thanks,
Scott
Comment 9 Scott Poore 2013-08-12 20:58:12 EDT
Lennart, 

Any chance of getting an rpm I can test with?

Thanks,
Scott
Comment 11 Scott Poore 2013-08-19 13:38:36 EDT
Can someone please let me know when a fix might be available?

Thanks,
Scott
Comment 12 Paul W. Frields 2013-08-21 19:39:34 EDT
Here, I built scratch RPMs for you: https://brewweb.devel.redhat.com/taskinfo?taskID=6197476
Comment 13 Scott Poore 2013-08-21 20:11:34 EDT
Cool, I'm checking it out now.  Thanks
Comment 14 Scott Poore 2013-08-22 09:19:26 EDT
Hmm...I'm still seeing it fail.  Unfortunately, I ran jobs that terminated on completion instead of pausing to allow investigation.  I'm running new ones this morning to see if I see the same thing.  

This time it failed after dirsrv instead of httpd:

2013-08-22T00:24:32Z DEBUG args=/bin/systemctl restart dirsrv.target
2013-08-22T00:24:34Z DEBUG Process finished, return code=1
2013-08-22T00:24:34Z DEBUG stdout=
2013-08-22T00:24:34Z DEBUG stderr=Warning! D-Bus connection terminated.
Disconnected from bus.

2013-08-22T00:24:34Z CRITICAL Failed to restart the directory server (Command '/bin/systemctl restart dirsrv.target' returned non-zero exit status 1). See the installation log for details.

I'll also upload the backtrace from the failure.
Comment 15 Scott Poore 2013-08-22 09:20:14 EDT
Created attachment 789191 [details]
backtrace from failed job
Comment 16 Scott Poore 2013-08-22 11:48:37 EDT
Created attachment 789261 [details]
new systemd crash abrt dir
Comment 17 Scott Poore 2013-08-22 11:49:27 EDT
Created attachment 789262 [details]
new systemd-logind abrt dir

I'm not sure if this will help too but, thought I'd include it as it failed shortly after the first abrt crash
Comment 19 Lukáš Nykrýn 2013-08-23 06:03:05 EDT
*** Bug 1000369 has been marked as a duplicate of this bug. ***
Comment 20 Harald Hoyer 2013-08-23 12:20:18 EDT
Also seen in https://bugzilla.redhat.com/show_bug.cgi?id=997742
Comment 21 Harald Hoyer 2013-08-23 12:55:30 EDT
next try:
systemd-206-6.el7
Comment 22 Scott Poore 2013-08-23 15:55:44 EDT
ok, testing this one now.
Comment 23 Scott Poore 2013-08-23 20:00:24 EDT
Ok, so I'm no longer seeing a core dump but, I am now seeing an almost (not 100%) consistent failure starting up one of the IPA services:

[root@qe-blade-08 log]# systemctl status dirsrv@TESTRELM-COM.service
dirsrv@TESTRELM-COM.service - 389 Directory Server TESTRELM-COM.
   Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled)
   Active: failed (Result: exit-code) since Fri 2013-08-23 17:29:36 EDT; 7s ago
  Process: 12140 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid (code=exited, status=0/SUCCESS)
  Process: 32401 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid (code=exited, status=1/FAILURE)
 Main PID: 12051 (code=exited, status=0/SUCCESS)

Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Starting 389 Directory Server TESTRELM-COM....
Aug 23 17:29:36 qe-blade-08.testrelm.com ns-slapd[32401]: [23/Aug/2013:17:29:36 -0400] createprlistensockets - PR_Bind() on All Interfaces port 389 failed: Netscape Portable Runtime error -5982 (...s is in use.)
Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: dirsrv@TESTRELM-COM.service: control process exited, code=exited status=1
Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Failed to start 389 Directory Server TESTRELM-COM..
Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Unit dirsrv@TESTRELM-COM.service entered failed state.

And it now appears that it's just not stopping dirsrv:


[root@qe-blade-08 ~]# kill 28191

[root@qe-blade-08 ~]# systemctl start dirsrv@TESTRELM-COM.service

[root@qe-blade-08 ~]# ps -ef|grep slapd
dirsrv   28692     1  1 17:52 ?        00:00:00 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid
root     28782  9609  0 17:52 pts/0    00:00:00 grep --color=auto slapd

[root@qe-blade-08 ~]# systemctl stop dirsrv@TESTRELM-COM.service

[root@qe-blade-08 ~]# ps -ef|grep slapd
dirsrv   28692     1  0 17:52 ?        00:00:00 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid
root     28858  9609  0 17:52 pts/0    00:00:00 grep --color=auto slapd

This does not happen with the -4.el7 or the other test rpm I tried last.

If it matters, here is the systemd file:

[root@qe-blade-08 system]# cat dirsrv@.service
# you usually do not want to edit this file - instead, edit the
# /etc/sysconfig/dirsrv.systemd file instead - otherwise,
# do not edit this file in /lib/systemd/system - instead, do the following:
# cp /lib/systemd/system/dirsrv\@.service /etc/systemd/system/dirsrv\@.service
# mkdir -p /etc/systemd/system/dirsrv.target.wants
# edit /etc/systemd/system/dirsrv\@.service - uncomment the LimitNOFILE=8192 line
# where %i is the name of the instance
# you may already have a symlink in
# /etc/systemd/system/dirsrv.target.wants/dirsrv@%i.service pointing to
# /lib/systemd/system/dirsrv\@.service - you will have to change it to link
# to /etc/systemd/system/dirsrv\@.service instead
# ln -s /etc/systemd/system/dirsrv\@.service /etc/systemd/system/dirsrv.target.wants/dirsrv@%i.service
# systemctl daemon-reload 
# systemctl (re)start dirsrv.target
[Unit]
Description=389 Directory Server %i.
BindTo=dirsrv.target
After=dirsrv.target

[Service]
Type=forking
EnvironmentFile=/etc/sysconfig/dirsrv
EnvironmentFile=/etc/sysconfig/dirsrv-%i
ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid
ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid
# if you need to set other directives e.g. LimitNOFILE=8192
# set them in this file
.include /etc/sysconfig/dirsrv.systemd
Comment 24 Harald Hoyer 2013-08-24 07:19:46 EDT
(In reply to Scott Poore from comment #23)
> Ok, so I'm no longer seeing a core dump but, I am now seeing an almost (not
> 100%) consistent failure starting up one of the IPA services:
> 
> [root@qe-blade-08 log]# systemctl status dirsrv@TESTRELM-COM.service
> dirsrv@TESTRELM-COM.service - 389 Directory Server TESTRELM-COM.
>    Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled)
>    Active: failed (Result: exit-code) since Fri 2013-08-23 17:29:36 EDT; 7s
> ago
>   Process: 12140 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid
> (code=exited, status=0/SUCCESS)
>   Process: 32401 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i
> /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid
> (code=exited, status=1/FAILURE)
>  Main PID: 12051 (code=exited, status=0/SUCCESS)

What is this Main PID 12051? 

What are you killing with:
[root@qe-blade-08 ~]# kill 28191
?

Please set the systemd log level to debug. Either by booting with "systemd.log_level=debug" on the kernel command line or by issuing:

# kill -s 56 1

What is the output of (after starting):
# systemd-cgls
# systemctl show dirsrv@TESTRELM-COM.service

Also please attach the output of the journal (best with systemd log level debug)

# journalctl -ab -o short-monotonic
Comment 25 Scott Poore 2013-08-26 10:39:39 EDT
(In reply to Harald Hoyer from comment #24)
> (In reply to Scott Poore from comment #23)
...
> >  Main PID: 12051 (code=exited, status=0/SUCCESS)
> 
> What is this Main PID 12051? 

I think it was an ns-slapd (main controlling process maybe?)

From the full journal, I see similar:

[root@ibm-x3650m4-01-vm-12 ~]# systemctl status dirsrv@TESTRELM-COM.service 
dirsrv@TESTRELM-COM.service - 389 Directory Server TESTRELM-COM.
   Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled)
   Active: inactive (dead) since Mon 2013-08-26 09:52:19 EDT; 41min ago
  Process: 10963 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid (code=exited, status=0/SUCCESS)
  Process: 10961 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid (code=exited, status=0/SUCCESS)
 Main PID: 10840 (code=exited, status=0/SUCCESS)
...

Then from journalctl command below:

[  300.061316] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Received SIGCHLD from PID 10840 (ns-slapd).
[  300.061696] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Got SIGCHLD for process 10840 (ns-slapd)
[  300.062090] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Child 10840 died (code=exited, status=0/SUCCESS)
[  300.062403] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Child 10840 belongs to dirsrv@TESTRELM-COM.service

So, it looks like it's ns-slapd.

> 
> What are you killing with:
> [root@qe-blade-08 ~]# kill 28191
> ?

That was a left behind ns-slapd process that was preventing me from even starting the service again because it had the port.  So, killed it to see if I could even start/stop at all.

> 
> Please set the systemd log level to debug. Either by booting with
> "systemd.log_level=debug" on the kernel command line or by issuing:
> 
> # kill -s 56 1
> 
> What is the output of (after starting):
> # systemd-cgls

[root@ibm-x3650m4-01-vm-12 ~]# systemd-cgls --no-pager
└─system.slice
  ├─    1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
  ├─10962 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid
  ├─abrtd.service
  │ └─10349 /usr/sbin/abrtd -d -s
  ├─rsyslog.service
  │ └─7029 /sbin/rsyslogd -n
  ├─polkit.service
  │ └─482 /usr/lib/polkit-1/polkitd --no-debug
  ├─systemd-journald.service
  │ └─299 /usr/lib/systemd/systemd-journald
  ├─dbus.service
  │ └─439 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
  ├─systemd-logind.service
  │ └─438 /usr/lib/systemd/systemd-logind
  ├─beah-fwd-backend.service
  │ └─6873 /usr/bin/python /usr/bin/beah-fwd-backend
  ├─beah-beaker-backend.service
  │ └─6872 /usr/bin/python /usr/bin/beah-beaker-backend
  ├─beah-srv.service
  │ ├─ 6871 /usr/bin/python /usr/bin/beah-srv
  │ ├─15592 sleep 500
  │ ├─17778 sleep 1
  │ ├─18123 /usr/bin/python /usr/bin/beah-rhts-task
  │ ├─18151 /bin/sh -x /var/lib/beah/tortilla/wrappers.d/runtest
  │ ├─18192 /bin/bash /usr/bin/rhts-test-runner.sh
  │ ├─18221 make run
  │ └─18228 /bin/bash ./runtest.sh
  ├─ntpd.service
  │ └─10511 /usr/sbin/ntpd -u ntp:ntp -g -x
  ├─sshd.service
  │ ├─  799 /usr/sbin/sshd -D
  │ ├─13011 sshd: root@pts/0
  │ ├─13079 -bash
  │ └─17779 systemd-cgls --no-pager
  ├─postfix.service
  │ ├─1024 /usr/libexec/postfix/master -w
  │ ├─1061 pickup -l -t unix -u
  │ └─1062 qmgr -l -t unix -u
  ├─irqbalance.service
  │ └─430 /usr/sbin/irqbalance --foreground
  ├─rhsmcertd.service
  │ └─497 /usr/bin/rhsmcertd
  ├─NetworkManager.service
  │ ├─429 /usr/sbin/NetworkManager --no-daemon
  │ └─496 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-0ec0038f-c8ef-42ee-b790-7ebf7cdee179-eth0.lease -cf /var/lib/NetworkManager/dhcl...
  ├─avahi-daemon.service
  │ ├─428 avahi-daemon: running [ibm-x3650m4-01-vm-12.local]
  │ └─450 avahi-daemon: chroot helper
  ├─nfs-lock.service
  │ └─520 /sbin/rpc.statd
  ├─atd.service
  │ └─442 /usr/sbin/atd -f
  ├─rpcbind.service
  │ └─500 /sbin/rpcbind -w
  ├─crond.service
  │ ├─ 6870 /usr/sbin/crond -n
  │ └─15304 /usr/sbin/anacron -s
  ├─iprinit.service
  │ └─459 /sbin/iprinit --daemon
  ├─iprdump.service
  │ └─473 /sbin/iprdump --daemon
  ├─iprupdate.service
  │ └─457 /sbin/iprupdate --daemon
  ├─system-getty.slice
  │ └─getty@tty1.service
  │   └─447 /sbin/agetty --noclear tty1
  └─system-serial\x2dgetty.slice
    └─serial-getty@ttyS0.service
      └─446 /sbin/agetty --keep-baud ttyS0 115200 38400 9600



> # systemctl show dirsrv@TESTRELM-COM.service


[root@ibm-x3650m4-01-vm-12 ~]# systemctl show dirsrv@TESTRELM-COM.service --no-pager
Id=dirsrv@TESTRELM-COM.service
Names=dirsrv@TESTRELM-COM.service
Requires=basic.target
Wants=system-dirsrv.slice
BindsTo=dirsrv.target
WantedBy=dirsrv.target
Conflicts=shutdown.target
Before=shutdown.target
After=dirsrv.target systemd-journald.socket basic.target system-dirsrv.slice
Description=389 Directory Server TESTRELM-COM.
LoadState=loaded
ActiveState=inactive
SubState=dead
FragmentPath=/usr/lib/systemd/system/dirsrv@.service
UnitFileState=enabled
InactiveExitTimestamp=Mon 2013-08-26 09:52:19 EDT
InactiveExitTimestampMonotonic=300067958
ActiveEnterTimestamp=Mon 2013-08-26 09:52:13 EDT
ActiveEnterTimestampMonotonic=294906310
ActiveExitTimestamp=Mon 2013-08-26 09:52:17 EDT
ActiveExitTimestampMonotonic=298800886
InactiveEnterTimestamp=Mon 2013-08-26 09:52:19 EDT
InactiveEnterTimestampMonotonic=300231044
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureIsolate=no
IgnoreOnIsolate=no
IgnoreOnSnapshot=no
NeedDaemonReload=no
JobTimeoutUSec=0
ConditionTimestamp=Mon 2013-08-26 09:52:19 EDT
ConditionTimestampMonotonic=300066662
ConditionResult=yes
Transient=no
Slice=system-dirsrv.slice
Type=forking
Restart=no
NotifyAccess=none
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
WatchdogUSec=0
WatchdogTimestampMonotonic=0
StartLimitInterval=10000000
StartLimitBurst=5
StartLimitAction=none
ExecStart={ path=/usr/sbin/ns-slapd ; argv[]=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid ; ignore_errors=no ; start_time=[Mon 2013-08-26 09:52:19 EDT] ; stop_time=[Mon 2013-08-26 09:52:19 EDT] ; pid=10961 ; code=exited ; status=0 }
ExecStopPost={ path=/bin/rm ; argv[]=/bin/rm -f /var/run/dirsrv/slapd-%i.pid ; ignore_errors=no ; start_time=[Mon 2013-08-26 09:52:19 EDT] ; stop_time=[Mon 2013-08-26 09:52:19 EDT] ; pid=10963 ; code=exited ; status=0 }
PermissionsStartOnly=no
EnvironmentFile=/etc/sysconfig/dirsrv (ignore_errors=no)
EnvironmentFile=/etc/sysconfig/dirsrv-TESTRELM-COM (ignore_errors=no)
UMask=0022
LimitCPU=18446744073709551615
LimitFSIZE=18446744073709551615
LimitDATA=18446744073709551615
LimitSTACK=18446744073709551615
LimitCORE=18446744073709551615
LimitRSS=18446744073709551615
LimitNOFILE=8192
LimitAS=18446744073709551615
LimitNPROC=30509
LimitMEMLOCK=65536
LimitLOCKS=18446744073709551615
LimitSIGPENDING=30509
LimitMSGQUEUE=819200
LimitNICE=0
LimitRTPRIO=0
LimitRTTIME=18446744073709551615
OOMScoreAdjust=0
Nice=0
IOScheduling=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardOutput=journal
StandardError=inherit
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SecureBits=0
CapabilityBoundingSet=18446744073709551615
MountFlags=0
PrivateTmp=no
PrivateNetwork=no
SameProcessGroup=no
IgnoreSIGPIPE=yes
NoNewPrivileges=no
KillMode=control-group
KillSignal=15
SendSIGKILL=yes
CPUAccounting=no
CPUShares=1024
BlockIOAccounting=no
BlockIOWeight=1000
MemoryAccounting=no
MemoryLimit=18446744073709551615
MemorySoftLimit=18446744073709551615
DevicePolicy=auto
ExecMainStartTimestamp=Mon 2013-08-26 09:52:13 EDT
ExecMainStartTimestampMonotonic=294906265
ExecMainExitTimestamp=Mon 2013-08-26 09:52:19 EDT
ExecMainExitTimestampMonotonic=300058790
ExecMainPID=10840
ExecMainCode=1
ExecMainStatus=0

> 
> Also please attach the output of the journal (best with systemd log level
> debug)
> 
> # journalctl -ab -o short-monotonic

Attaching shortly.
Comment 27 Nathan Kinder 2013-08-26 16:07:31 EDT
I see something similar testing on RHEL 7.0 with systemd-206-6.el7.  The problem seems to be that the dirsrv (ns-slapd) processes are not started in the proper cgroup:

--------------------------------------------------------------------------------[root@dell-pe2950-01 dirsrv.target.wants]# systemd-cgls
├─user.slice
│ └─user-0.slice
│   └─session-1.scope
│     ├─ 897 sshd: root@pts/0
│     ├─ 903 -bash
│     ├─2267 systemd-cgls
│     └─2268 less
└─system.slice
  ├─   1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
  ├─1692 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-C2 -i /var/run/dirsrv/slapd-C2.pid -w /var/run/dirsrv/slapd-C2.startpid
  ├─1818 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-M2 -i /var/run/dirsrv/slapd-M2.pid -w /var/run/dirsrv/slapd-M2.startpid
  ├─1819 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-C1 -i /var/run/dirsrv/slapd-C1.pid -w /var/run/dirsrv/slapd-C1.startpid
  ├─1830 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-M1 -i /var/run/dirsrv/slapd-M1.pid -w /var/run/dirsrv/slapd-M1.startpid
...
--------------------------------------------------------------------------------

If you try this same command on a working system (like my F19 system), you will see that the DS instances are associated with the dirsrv@.service group:

--------------------------------------------------------------------------------[root@localhost ~]# systemd-cgls
...
└─system
  ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 23
  ├─dirsrv@.service
  │ └─localhost
  │   └─11375 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-localhost -i /var/run/dirsrv/slapd-localhost.pid -w /var/r
...
--------------------------------------------------------------------------------

I am starting my DS instances using 'systemctl start dirsrv.target', but they are being put into the system.slice cgroup for some reason.
Comment 28 Nathan Kinder 2013-08-26 16:12:24 EDT
I've also noticed some of these "failed to realize cgroup" messages in /var/log/messages, which might be related to the problem:

-----------------------------------------------------------------------------
Aug 26 14:48:36 dell-pe2950-01 systemd: Reached target 389 Directory Server.
Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server M2....
Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server C2....
Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server M1....
Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server C1....
Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server M2..
Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server C1..
Aug 26 14:48:37 dell-pe2950-01 ns-slapd: [26/Aug/2013:14:48:37 -0400] createprlistensockets - PR_Bind() on All Interfaces port 1489 failed: Netscape Portable Runtime error -5982 (Local Network address is in use.)
Aug 26 14:48:37 dell-pe2950-01 systemd: dirsrv@C2.service: control process exited, code=exited status=1
Aug 26 14:48:37 dell-pe2950-01 systemd: Failed to start 389 Directory Server C2..
Aug 26 14:48:37 dell-pe2950-01 systemd: Unit dirsrv@C2.service entered failed state.
Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server M1..
Aug 26 14:50:01 dell-pe2950-01 systemd: Starting Session 2 of user root.
Aug 26 14:50:01 dell-pe2950-01 systemd: Failed to realize cgroup: File exists
-----------------------------------------------------------------------------

The failure to start C2 was due to an already running ns-slapd process that was bound to the same port.
Comment 29 Scott Poore 2013-08-27 11:13:59 EDT
Moving this back to ASSIGNED since we're still having problems.

Harald, does the output above and attached show what could be causing dirsrv to not be in the proper control group?
Comment 30 Nathan Kinder 2013-08-27 11:39:45 EDT
Is this related to the introduction of slices in systemd?  If I do a 'systemctl show' on one of my dirsrv instances, I see that it refers to a system-dirsrv.slice:

-----------------------------------------------------------------
[root@dell-pe2950-01 system]# systemctl show dirsrv@M1.service
Id=dirsrv@M1.service
Names=dirsrv@M1.service
Requires=basic.target
Wants=system-dirsrv.slice
BindsTo=dirsrv.target
WantedBy=dirsrv.target
Conflicts=shutdown.target
Before=shutdown.target
After=dirsrv.target systemd-journald.socket basic.target system-dirsrv.slice
...
Slice=system-dirsrv.slice
...
-----------------------------------------------------------------

We haven't made any changes in 389-ds-base for systemd since the introduction of slices.  Is there something we need to do on our side to work with slices?
Comment 31 Scott Poore 2013-08-27 12:48:59 EDT
I just tested with the previous test version of systemd and I do see the slice for dirsrv:

  ├─system-dirsrv.slice
  │ └─dirsrv@TESTRELM-COM.service
  │   └─3911 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM
Comment 32 Ann Marie Rubin 2013-08-28 10:15:23 EDT
This bug blocks IPA integration testing on RHEL 7.0.
Comment 33 Harald Hoyer 2013-08-28 10:34:05 EDT
Found the culprit in a 12h debug session. hashmap keys were free()'d and caused a hashmap corruption.

systemd-206-7.el7
Comment 34 Scott Poore 2013-08-28 12:02:09 EDT
Cool, that looks much better.  Thank you!

Done configuring DNS (named).

Global DNS configuration in LDAP server is empty
You can use 'dnsconfig-mod' command to set global DNS options that
would override settings in local named.conf files

Restarting the web server
==============================================================================
Setup complete

Next steps:
	1. You must make sure these network ports are open:
		TCP Ports:
		  * 80, 443: HTTP/HTTPS
		  * 389, 636: LDAP/LDAPS
		  * 88, 464: kerberos
		  * 53: bind
		UDP Ports:
		  * 88, 464: kerberos
		  * 53: bind
		  * 123: ntp

	2. You can now obtain a kerberos ticket using the command: 'kinit admin'
	   This ticket will allow you to use the IPA tools (e.g., ipa user-add)
	   and the web user interface.

Be sure to back up the CA certificate stored in /root/cacert.p12
This file is required to create replicas. The password for this
file is the Directory Manager password
:: [   PASS   ] :: Running ' /usr/sbin/ipa-server-install --setup-dns --forwarder=192.168.122.1 --hostname=rhel7-1.testrelm.com -r TESTRELM.COM -n testrelm.com -p Secret123 -P Secret123 -a Secret123 -U' (Expected 0, got 0)

[root@rhel7-1 quickinstall]# systemd-cgls
...
  ├─system-dirsrv.slice
  │ └─dirsrv@TESTRELM-COM.service
  │   └─10041 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-CO

[root@rhel7-1 quickinstall]# systemctl restart dirsrv@TESTRELM-COM.service

[root@rhel7-1 quickinstall]# systemd-cgls
...
  ├─system-dirsrv.slice
  │ └─dirsrv@TESTRELM-COM.service
  │   └─10482 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-CO
Comment 36 Ludek Smid 2014-06-13 08:03:40 EDT
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.