Bug 995197
| Summary: | RHEL7 systemd segfault during ipa-server-install | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Scott Poore <spoore> | ||||||||||
| Component: | systemd | Assignee: | systemd-maint | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Tomas Dolezal <todoleza> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 7.0 | CC: | arubin, harald, jfilak, jgalipea, lpoetter, mschmidt, nkinder, spoore, systemd-maint-list, todoleza | ||||||||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | systemd-206-7.el7 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2014-06-13 12:03:40 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Scott Poore
2013-08-08 18:16:10 UTC
Created attachment 784548 [details]
abrt crash dump dir from systemd segfault
(gdb) bt
#0 0x00007fb821be3e7b in raise () from /lib64/libpthread.so.0
#1 0x00007fb82372b95e in crash (sig=11) at src/core/main.c:144
#2 <signal handler called>
#3 0x00007fb82372cd4f in manager_dispatch_sigchld (m=m@entry=0x7fb824498800) at src/core/manager.c:1380
#4 0x00007fb8237321a1 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1623
#5 process_event (ev=0x7fffc1aae5e0, m=0x7fb824498800) at src/core/manager.c:1648
#6 manager_loop (m=0x7fb824498800) at src/core/manager.c:1845
#7 0x00007fb823728fb6 in main (argc=5, argv=0x7fffc1aaee58) at src/core/main.c:1705
...
#3 0x00007fb82372cd4f in manager_dispatch_sigchld (m=m@entry=0x7fb824498800) at src/core/manager.c:1380
1380 UNIT_VTABLE(u)->sigchld_event(u, si.si_pid, si.si_code, si.si_status);
(gdb) print u
$1 = (struct Unit *) 0x7fb82478caf0
(gdb) print u->type
$2 = 1694511699
(gdb) print *u
$3 = {manager = 0x7974742f7974742f, type = 1694511699, load_state = 1701013878, merged_into = 0x0, id = 0x21 <Address 0x21 out of bounds>,
instance = 0x537974742d766564 <Address 0x537974742d766564 out of bounds>, names = 0x6563697665642e32, dependencies = {0x0, 0x21, 0x7974742f7665642f,
0x6563697665003253, 0x0, 0x41, 0x7369642f7665642f, 0x2f64692d79622f6b, 0x2d656d616e2d6d64, 0x2d65715f6c656872, 0x2d2d6564616c622d, 0x706177732d3031,
0x0, 0x31, 0x2e6d65747379732f, 0x65642f6563696c73, 0x6170656775682d76, 0x6e756f6d2e736567, 0x7fb824580074, 0x31, 0x697463656c6c6f43, 0x616461657220676e,
0x7461642064616568}, requires_mounts_for = 0x2f64692d79620061, description = 0x3035332d69736373 <Address 0x3035332d69736373 out of bounds>,
documentation = 0x31, fragment_path = 0x2e6d65747379732f <Address 0x2e6d65747379732f out of bounds>,
source_path = 0x72692f6563696c73 <Address 0x72692f6563696c73 out of bounds>, dropin_paths = 0x65636e616c616271, fragment_mtime = 7305798977971385134,
source_mtime = 140428862606336, dropin_mtime = 49, job = 0x2e6d65747379732f, nop_job = 0x65632f6563696c73, job_timeout = 8243108416985068658,
refs = 0x656369767265732e, conditions = 0x0, condition_timestamp = {realtime = 65, monotonic = 8316288341928928303}, inactive_exit_timestamp = {
realtime = 7599108784428035947, monotonic = 7233735817404100452}, active_enter_timestamp = {realtime = 3269952198971045688,
monotonic = 7219941098984190516}, active_exit_timestamp = {realtime = 3546365046461443429, monotonic = 54297919306039}, inactive_enter_timestamp = {
realtime = 65, monotonic = 3345441649034097455}, cgroup_path = 0x79732f6563696c73 <Address 0x79732f6563696c73 out of bounds>, cgroup_realized = 115,
cgroup_mask = (CGROUP_CPU | CGROUP_BLKIO | CGROUP_MEMORY | unknown: 1937339168), slice = {unit = 0x6432785c646d6574, refs_next = 0x696c732e6b637366,
refs_prev = 0x752f3a6e00006563}, units_by_type_next = 0x6e69622f7273, units_by_type_prev = 0x41, has_requires_mounts_for_next = 0x7665642f7379732f,
has_requires_mounts_for_prev = 0x616c702f73656369, load_queue_next = 0x65732f6d726f6674, load_queue_prev = 0x303532386c616972,
dbus_queue_next = 0x7974742f7974742f, dbus_queue_prev = 0x6563697665003353, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x21,
gc_queue_next = 0x6d722f6e69622f, gc_queue_prev = 0x7fb824590f00, cgroup_queue_next = 0x0, cgroup_queue_prev = 0x61, gc_marker = 611896592,
deserialized_job = 32696, load_error = 611895392, unit_file_state = 32696, stop_when_unneeded = false, default_dependencies = false,
refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false, on_failure_isolate = false, ignore_on_isolate = false,
ignore_on_snapshot = false, condition_result = false, transient = false, in_load_queue = false, in_dbus_queue = false, in_cleanup_queue = false,
in_gc_queue = false, in_cgroup_queue = false, sent_dbus_new_signal = false, no_gc = false, in_audit = false}
(gdb) print si
$4 = {si_signo = 17, si_errno = 0, si_code = 2, _sifields = {_pad = {8508, 0, 15, 0 <repeats 25 times>}, _kill = {si_pid = 8508, si_uid = 0}, _timer = {
si_tid = 8508, si_overrun = 0, si_sigval = {sival_int = 15, sival_ptr = 0xf}}, _rt = {si_pid = 8508, si_uid = 0, si_sigval = {sival_int = 15,
sival_ptr = 0xf}}, _sigchld = {si_pid = 8508, si_uid = 0, si_status = 15, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x213c}, _sigpoll = {
si_band = 8508, si_fd = 15}, _sigsys = {_call_addr = 0x213c, _syscall = 15, _arch = 0}}}
Perhaps we forgot to remove an item from m->watch_pids somewhere earlier? So now we're accessing a Unit that's not there anymore. What is the contents of the httpd.service file used here? Michal is probably be right, though there's no obvious place where I see where we might forget unwatching a PID... I have now made a commit 41efeaec037678ac790e2a02df9020f83cc3a359 which explicitly unwatches the PID in a few cases, which shouldn't hurt but where I am really nto sure if that touches the problem. WOuld be cool if somebody could test the patch and see if it has any effect of the issue in question. Lennart, Here's the httpd.service: # cat /usr/lib/systemd/system/httpd.service [Unit] Description=The Apache HTTP Server After=network.target remote-fs.target nss-lookup.target [Service] Type=notify EnvironmentFile=/etc/sysconfig/httpd ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND ExecReload=/usr/sbin/httpd $OPTIONS -k graceful ExecStop=/usr/sbin/httpd $OPTIONS -k graceful-stop # We want systemd to give httpd some time to finish gracefully, but still want # it to kill httpd after TimeoutStopSec if something went wrong during the # graceful stop. Normally, Systemd sends SIGTERM signal right after the # ExecStop, which would kill httpd. We are sending useless SIGCONT here to give # httpd time to finish. KillSignal=SIGCONT PrivateTmp=true [Install] WantedBy=multi-user.target That's from a server where it failed this morning. 4 other servers didn't see the failure in the same test. So, it doesn't appear to be consistent but, I think running many tests would show if it's fixed. And since I run a lot of IPA install/reinstall tests, I'd think if it's not fixed, I'd see it pretty quickly. Any chance I can get an rpm that I could test with? Thanks, Scott Lennart, Any chance of getting an rpm I can test with? Thanks, Scott Can someone please let me know when a fix might be available? Thanks, Scott Here, I built scratch RPMs for you: https://brewweb.devel.redhat.com/taskinfo?taskID=6197476 Cool, I'm checking it out now. Thanks Hmm...I'm still seeing it fail. Unfortunately, I ran jobs that terminated on completion instead of pausing to allow investigation. I'm running new ones this morning to see if I see the same thing. This time it failed after dirsrv instead of httpd: 2013-08-22T00:24:32Z DEBUG args=/bin/systemctl restart dirsrv.target 2013-08-22T00:24:34Z DEBUG Process finished, return code=1 2013-08-22T00:24:34Z DEBUG stdout= 2013-08-22T00:24:34Z DEBUG stderr=Warning! D-Bus connection terminated. Disconnected from bus. 2013-08-22T00:24:34Z CRITICAL Failed to restart the directory server (Command '/bin/systemctl restart dirsrv.target' returned non-zero exit status 1). See the installation log for details. I'll also upload the backtrace from the failure. Created attachment 789191 [details]
backtrace from failed job
Created attachment 789261 [details]
new systemd crash abrt dir
Created attachment 789262 [details]
new systemd-logind abrt dir
I'm not sure if this will help too but, thought I'd include it as it failed shortly after the first abrt crash
*** Bug 1000369 has been marked as a duplicate of this bug. *** Also seen in https://bugzilla.redhat.com/show_bug.cgi?id=997742 next try: systemd-206-6.el7 ok, testing this one now. Ok, so I'm no longer seeing a core dump but, I am now seeing an almost (not 100%) consistent failure starting up one of the IPA services: [root@qe-blade-08 log]# systemctl status dirsrv dirsrv - 389 Directory Server TESTRELM-COM. Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled) Active: failed (Result: exit-code) since Fri 2013-08-23 17:29:36 EDT; 7s ago Process: 12140 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid (code=exited, status=0/SUCCESS) Process: 32401 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid (code=exited, status=1/FAILURE) Main PID: 12051 (code=exited, status=0/SUCCESS) Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Starting 389 Directory Server TESTRELM-COM.... Aug 23 17:29:36 qe-blade-08.testrelm.com ns-slapd[32401]: [23/Aug/2013:17:29:36 -0400] createprlistensockets - PR_Bind() on All Interfaces port 389 failed: Netscape Portable Runtime error -5982 (...s is in use.) Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: dirsrv: control process exited, code=exited status=1 Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Failed to start 389 Directory Server TESTRELM-COM.. Aug 23 17:29:36 qe-blade-08.testrelm.com systemd[1]: Unit dirsrv entered failed state. And it now appears that it's just not stopping dirsrv: [root@qe-blade-08 ~]# kill 28191 [root@qe-blade-08 ~]# systemctl start dirsrv [root@qe-blade-08 ~]# ps -ef|grep slapd dirsrv 28692 1 1 17:52 ? 00:00:00 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid root 28782 9609 0 17:52 pts/0 00:00:00 grep --color=auto slapd [root@qe-blade-08 ~]# systemctl stop dirsrv [root@qe-blade-08 ~]# ps -ef|grep slapd dirsrv 28692 1 0 17:52 ? 00:00:00 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid root 28858 9609 0 17:52 pts/0 00:00:00 grep --color=auto slapd This does not happen with the -4.el7 or the other test rpm I tried last. If it matters, here is the systemd file: [root@qe-blade-08 system]# cat dirsrv@.service # you usually do not want to edit this file - instead, edit the # /etc/sysconfig/dirsrv.systemd file instead - otherwise, # do not edit this file in /lib/systemd/system - instead, do the following: # cp /lib/systemd/system/dirsrv\@.service /etc/systemd/system/dirsrv\@.service # mkdir -p /etc/systemd/system/dirsrv.target.wants # edit /etc/systemd/system/dirsrv\@.service - uncomment the LimitNOFILE=8192 line # where %i is the name of the instance # you may already have a symlink in # /etc/systemd/system/dirsrv.target.wants/dirsrv@%i.service pointing to # /lib/systemd/system/dirsrv\@.service - you will have to change it to link # to /etc/systemd/system/dirsrv\@.service instead # ln -s /etc/systemd/system/dirsrv\@.service /etc/systemd/system/dirsrv.target.wants/dirsrv@%i.service # systemctl daemon-reload # systemctl (re)start dirsrv.target [Unit] Description=389 Directory Server %i. BindTo=dirsrv.target After=dirsrv.target [Service] Type=forking EnvironmentFile=/etc/sysconfig/dirsrv EnvironmentFile=/etc/sysconfig/dirsrv-%i ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid # if you need to set other directives e.g. LimitNOFILE=8192 # set them in this file .include /etc/sysconfig/dirsrv.systemd (In reply to Scott Poore from comment #23) > Ok, so I'm no longer seeing a core dump but, I am now seeing an almost (not > 100%) consistent failure starting up one of the IPA services: > > [root@qe-blade-08 log]# systemctl status dirsrv > dirsrv - 389 Directory Server TESTRELM-COM. > Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled) > Active: failed (Result: exit-code) since Fri 2013-08-23 17:29:36 EDT; 7s > ago > Process: 12140 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid > (code=exited, status=0/SUCCESS) > Process: 32401 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i > /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid > (code=exited, status=1/FAILURE) > Main PID: 12051 (code=exited, status=0/SUCCESS) What is this Main PID 12051? What are you killing with: [root@qe-blade-08 ~]# kill 28191 ? Please set the systemd log level to debug. Either by booting with "systemd.log_level=debug" on the kernel command line or by issuing: # kill -s 56 1 What is the output of (after starting): # systemd-cgls # systemctl show dirsrv Also please attach the output of the journal (best with systemd log level debug) # journalctl -ab -o short-monotonic (In reply to Harald Hoyer from comment #24) > (In reply to Scott Poore from comment #23) ... > > Main PID: 12051 (code=exited, status=0/SUCCESS) > > What is this Main PID 12051? I think it was an ns-slapd (main controlling process maybe?) From the full journal, I see similar: [root@ibm-x3650m4-01-vm-12 ~]# systemctl status dirsrv dirsrv - 389 Directory Server TESTRELM-COM. Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled) Active: inactive (dead) since Mon 2013-08-26 09:52:19 EDT; 41min ago Process: 10963 ExecStopPost=/bin/rm -f /var/run/dirsrv/slapd-%i.pid (code=exited, status=0/SUCCESS) Process: 10961 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid (code=exited, status=0/SUCCESS) Main PID: 10840 (code=exited, status=0/SUCCESS) ... Then from journalctl command below: [ 300.061316] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Received SIGCHLD from PID 10840 (ns-slapd). [ 300.061696] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Got SIGCHLD for process 10840 (ns-slapd) [ 300.062090] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Child 10840 died (code=exited, status=0/SUCCESS) [ 300.062403] ibm-x3650m4-01-vm-12.testrelm.com systemd[1]: Child 10840 belongs to dirsrv So, it looks like it's ns-slapd. > > What are you killing with: > [root@qe-blade-08 ~]# kill 28191 > ? That was a left behind ns-slapd process that was preventing me from even starting the service again because it had the port. So, killed it to see if I could even start/stop at all. > > Please set the systemd log level to debug. Either by booting with > "systemd.log_level=debug" on the kernel command line or by issuing: > > # kill -s 56 1 > > What is the output of (after starting): > # systemd-cgls [root@ibm-x3650m4-01-vm-12 ~]# systemd-cgls --no-pager └─system.slice ├─ 1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22 ├─10962 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM.pid -w /var/run/dirsrv/slapd-TESTRELM-COM.startpid ├─abrtd.service │ └─10349 /usr/sbin/abrtd -d -s ├─rsyslog.service │ └─7029 /sbin/rsyslogd -n ├─polkit.service │ └─482 /usr/lib/polkit-1/polkitd --no-debug ├─systemd-journald.service │ └─299 /usr/lib/systemd/systemd-journald ├─dbus.service │ └─439 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation ├─systemd-logind.service │ └─438 /usr/lib/systemd/systemd-logind ├─beah-fwd-backend.service │ └─6873 /usr/bin/python /usr/bin/beah-fwd-backend ├─beah-beaker-backend.service │ └─6872 /usr/bin/python /usr/bin/beah-beaker-backend ├─beah-srv.service │ ├─ 6871 /usr/bin/python /usr/bin/beah-srv │ ├─15592 sleep 500 │ ├─17778 sleep 1 │ ├─18123 /usr/bin/python /usr/bin/beah-rhts-task │ ├─18151 /bin/sh -x /var/lib/beah/tortilla/wrappers.d/runtest │ ├─18192 /bin/bash /usr/bin/rhts-test-runner.sh │ ├─18221 make run │ └─18228 /bin/bash ./runtest.sh ├─ntpd.service │ └─10511 /usr/sbin/ntpd -u ntp:ntp -g -x ├─sshd.service │ ├─ 799 /usr/sbin/sshd -D │ ├─13011 sshd: root@pts/0 │ ├─13079 -bash │ └─17779 systemd-cgls --no-pager ├─postfix.service │ ├─1024 /usr/libexec/postfix/master -w │ ├─1061 pickup -l -t unix -u │ └─1062 qmgr -l -t unix -u ├─irqbalance.service │ └─430 /usr/sbin/irqbalance --foreground ├─rhsmcertd.service │ └─497 /usr/bin/rhsmcertd ├─NetworkManager.service │ ├─429 /usr/sbin/NetworkManager --no-daemon │ └─496 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-0ec0038f-c8ef-42ee-b790-7ebf7cdee179-eth0.lease -cf /var/lib/NetworkManager/dhcl... ├─avahi-daemon.service │ ├─428 avahi-daemon: running [ibm-x3650m4-01-vm-12.local] │ └─450 avahi-daemon: chroot helper ├─nfs-lock.service │ └─520 /sbin/rpc.statd ├─atd.service │ └─442 /usr/sbin/atd -f ├─rpcbind.service │ └─500 /sbin/rpcbind -w ├─crond.service │ ├─ 6870 /usr/sbin/crond -n │ └─15304 /usr/sbin/anacron -s ├─iprinit.service │ └─459 /sbin/iprinit --daemon ├─iprdump.service │ └─473 /sbin/iprdump --daemon ├─iprupdate.service │ └─457 /sbin/iprupdate --daemon ├─system-getty.slice │ └─getty │ └─447 /sbin/agetty --noclear tty1 └─system-serial\x2dgetty.slice └─serial-getty └─446 /sbin/agetty --keep-baud ttyS0 115200 38400 9600 > # systemctl show dirsrv [root@ibm-x3650m4-01-vm-12 ~]# systemctl show dirsrv --no-pager Id=dirsrv Names=dirsrv Requires=basic.target Wants=system-dirsrv.slice BindsTo=dirsrv.target WantedBy=dirsrv.target Conflicts=shutdown.target Before=shutdown.target After=dirsrv.target systemd-journald.socket basic.target system-dirsrv.slice Description=389 Directory Server TESTRELM-COM. LoadState=loaded ActiveState=inactive SubState=dead FragmentPath=/usr/lib/systemd/system/dirsrv@.service UnitFileState=enabled InactiveExitTimestamp=Mon 2013-08-26 09:52:19 EDT InactiveExitTimestampMonotonic=300067958 ActiveEnterTimestamp=Mon 2013-08-26 09:52:13 EDT ActiveEnterTimestampMonotonic=294906310 ActiveExitTimestamp=Mon 2013-08-26 09:52:17 EDT ActiveExitTimestampMonotonic=298800886 InactiveEnterTimestamp=Mon 2013-08-26 09:52:19 EDT InactiveEnterTimestampMonotonic=300231044 CanStart=yes CanStop=yes CanReload=no CanIsolate=no StopWhenUnneeded=no RefuseManualStart=no RefuseManualStop=no AllowIsolate=no DefaultDependencies=yes OnFailureIsolate=no IgnoreOnIsolate=no IgnoreOnSnapshot=no NeedDaemonReload=no JobTimeoutUSec=0 ConditionTimestamp=Mon 2013-08-26 09:52:19 EDT ConditionTimestampMonotonic=300066662 ConditionResult=yes Transient=no Slice=system-dirsrv.slice Type=forking Restart=no NotifyAccess=none RestartUSec=100ms TimeoutStartUSec=1min 30s TimeoutStopUSec=1min 30s WatchdogUSec=0 WatchdogTimestampMonotonic=0 StartLimitInterval=10000000 StartLimitBurst=5 StartLimitAction=none ExecStart={ path=/usr/sbin/ns-slapd ; argv[]=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /var/run/dirsrv/slapd-%i.pid -w /var/run/dirsrv/slapd-%i.startpid ; ignore_errors=no ; start_time=[Mon 2013-08-26 09:52:19 EDT] ; stop_time=[Mon 2013-08-26 09:52:19 EDT] ; pid=10961 ; code=exited ; status=0 } ExecStopPost={ path=/bin/rm ; argv[]=/bin/rm -f /var/run/dirsrv/slapd-%i.pid ; ignore_errors=no ; start_time=[Mon 2013-08-26 09:52:19 EDT] ; stop_time=[Mon 2013-08-26 09:52:19 EDT] ; pid=10963 ; code=exited ; status=0 } PermissionsStartOnly=no EnvironmentFile=/etc/sysconfig/dirsrv (ignore_errors=no) EnvironmentFile=/etc/sysconfig/dirsrv-TESTRELM-COM (ignore_errors=no) UMask=0022 LimitCPU=18446744073709551615 LimitFSIZE=18446744073709551615 LimitDATA=18446744073709551615 LimitSTACK=18446744073709551615 LimitCORE=18446744073709551615 LimitRSS=18446744073709551615 LimitNOFILE=8192 LimitAS=18446744073709551615 LimitNPROC=30509 LimitMEMLOCK=65536 LimitLOCKS=18446744073709551615 LimitSIGPENDING=30509 LimitMSGQUEUE=819200 LimitNICE=0 LimitRTPRIO=0 LimitRTTIME=18446744073709551615 OOMScoreAdjust=0 Nice=0 IOScheduling=0 CPUSchedulingPolicy=0 CPUSchedulingPriority=0 TimerSlackNSec=50000 CPUSchedulingResetOnFork=no NonBlocking=no StandardInput=null StandardOutput=journal StandardError=inherit TTYReset=no TTYVHangup=no TTYVTDisallocate=no SyslogPriority=30 SyslogLevelPrefix=yes SecureBits=0 CapabilityBoundingSet=18446744073709551615 MountFlags=0 PrivateTmp=no PrivateNetwork=no SameProcessGroup=no IgnoreSIGPIPE=yes NoNewPrivileges=no KillMode=control-group KillSignal=15 SendSIGKILL=yes CPUAccounting=no CPUShares=1024 BlockIOAccounting=no BlockIOWeight=1000 MemoryAccounting=no MemoryLimit=18446744073709551615 MemorySoftLimit=18446744073709551615 DevicePolicy=auto ExecMainStartTimestamp=Mon 2013-08-26 09:52:13 EDT ExecMainStartTimestampMonotonic=294906265 ExecMainExitTimestamp=Mon 2013-08-26 09:52:19 EDT ExecMainExitTimestampMonotonic=300058790 ExecMainPID=10840 ExecMainCode=1 ExecMainStatus=0 > > Also please attach the output of the journal (best with systemd log level > debug) > > # journalctl -ab -o short-monotonic Attaching shortly. I see something similar testing on RHEL 7.0 with systemd-206-6.el7. The problem seems to be that the dirsrv (ns-slapd) processes are not started in the proper cgroup: --------------------------------------------------------------------------------[root@dell-pe2950-01 dirsrv.target.wants]# systemd-cgls ├─user.slice │ └─user-0.slice │ └─session-1.scope │ ├─ 897 sshd: root@pts/0 │ ├─ 903 -bash │ ├─2267 systemd-cgls │ └─2268 less └─system.slice ├─ 1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22 ├─1692 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-C2 -i /var/run/dirsrv/slapd-C2.pid -w /var/run/dirsrv/slapd-C2.startpid ├─1818 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-M2 -i /var/run/dirsrv/slapd-M2.pid -w /var/run/dirsrv/slapd-M2.startpid ├─1819 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-C1 -i /var/run/dirsrv/slapd-C1.pid -w /var/run/dirsrv/slapd-C1.startpid ├─1830 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-M1 -i /var/run/dirsrv/slapd-M1.pid -w /var/run/dirsrv/slapd-M1.startpid ... -------------------------------------------------------------------------------- If you try this same command on a working system (like my F19 system), you will see that the DS instances are associated with the dirsrv@.service group: --------------------------------------------------------------------------------[root@localhost ~]# systemd-cgls ... └─system ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 23 ├─dirsrv@.service │ └─localhost │ └─11375 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-localhost -i /var/run/dirsrv/slapd-localhost.pid -w /var/r ... -------------------------------------------------------------------------------- I am starting my DS instances using 'systemctl start dirsrv.target', but they are being put into the system.slice cgroup for some reason. I've also noticed some of these "failed to realize cgroup" messages in /var/log/messages, which might be related to the problem: ----------------------------------------------------------------------------- Aug 26 14:48:36 dell-pe2950-01 systemd: Reached target 389 Directory Server. Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server M2.... Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server C2.... Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server M1.... Aug 26 14:48:36 dell-pe2950-01 systemd: Starting 389 Directory Server C1.... Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server M2.. Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server C1.. Aug 26 14:48:37 dell-pe2950-01 ns-slapd: [26/Aug/2013:14:48:37 -0400] createprlistensockets - PR_Bind() on All Interfaces port 1489 failed: Netscape Portable Runtime error -5982 (Local Network address is in use.) Aug 26 14:48:37 dell-pe2950-01 systemd: dirsrv: control process exited, code=exited status=1 Aug 26 14:48:37 dell-pe2950-01 systemd: Failed to start 389 Directory Server C2.. Aug 26 14:48:37 dell-pe2950-01 systemd: Unit dirsrv entered failed state. Aug 26 14:48:37 dell-pe2950-01 systemd: Started 389 Directory Server M1.. Aug 26 14:50:01 dell-pe2950-01 systemd: Starting Session 2 of user root. Aug 26 14:50:01 dell-pe2950-01 systemd: Failed to realize cgroup: File exists ----------------------------------------------------------------------------- The failure to start C2 was due to an already running ns-slapd process that was bound to the same port. Moving this back to ASSIGNED since we're still having problems. Harald, does the output above and attached show what could be causing dirsrv to not be in the proper control group? Is this related to the introduction of slices in systemd? If I do a 'systemctl show' on one of my dirsrv instances, I see that it refers to a system-dirsrv.slice: ----------------------------------------------------------------- [root@dell-pe2950-01 system]# systemctl show dirsrv Id=dirsrv Names=dirsrv Requires=basic.target Wants=system-dirsrv.slice BindsTo=dirsrv.target WantedBy=dirsrv.target Conflicts=shutdown.target Before=shutdown.target After=dirsrv.target systemd-journald.socket basic.target system-dirsrv.slice ... Slice=system-dirsrv.slice ... ----------------------------------------------------------------- We haven't made any changes in 389-ds-base for systemd since the introduction of slices. Is there something we need to do on our side to work with slices? I just tested with the previous test version of systemd and I do see the slice for dirsrv: ├─system-dirsrv.slice │ └─dirsrv │ └─3911 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-COM This bug blocks IPA integration testing on RHEL 7.0. Found the culprit in a 12h debug session. hashmap keys were free()'d and caused a hashmap corruption. systemd-206-7.el7 Cool, that looks much better. Thank you! Done configuring DNS (named). Global DNS configuration in LDAP server is empty You can use 'dnsconfig-mod' command to set global DNS options that would override settings in local named.conf files Restarting the web server ============================================================================== Setup complete Next steps: 1. You must make sure these network ports are open: TCP Ports: * 80, 443: HTTP/HTTPS * 389, 636: LDAP/LDAPS * 88, 464: kerberos * 53: bind UDP Ports: * 88, 464: kerberos * 53: bind * 123: ntp 2. You can now obtain a kerberos ticket using the command: 'kinit admin' This ticket will allow you to use the IPA tools (e.g., ipa user-add) and the web user interface. Be sure to back up the CA certificate stored in /root/cacert.p12 This file is required to create replicas. The password for this file is the Directory Manager password :: [ PASS ] :: Running ' /usr/sbin/ipa-server-install --setup-dns --forwarder=192.168.122.1 --hostname=rhel7-1.testrelm.com -r TESTRELM.COM -n testrelm.com -p Secret123 -P Secret123 -a Secret123 -U' (Expected 0, got 0) [root@rhel7-1 quickinstall]# systemd-cgls ... ├─system-dirsrv.slice │ └─dirsrv │ └─10041 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-CO [root@rhel7-1 quickinstall]# systemctl restart dirsrv [root@rhel7-1 quickinstall]# systemd-cgls ... ├─system-dirsrv.slice │ └─dirsrv │ └─10482 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-COM -i /var/run/dirsrv/slapd-TESTRELM-CO This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |