Created attachment 742272 [details] /proc/2641/status Description of problem: Occasionally I find a stale process which cannot be even kill -9 as root. Version-Release number of selected component (if applicable): systemd-197-1.fc18.2.x86_64 How reproducible: It comes from the GDB testsuite running hundreds times so it is very rare to reproduce. Steps to Reproduce: Run GDB testsuite for many hundreds times. ps axwf Actual results: 2641 ? ZN 0:00 [clone-new-threa] <defunct> Expected results: No such stale process, it did not happen with former inits. Additional info:
Code of the stale process is most probably: http://sourceware.org/git/?p=gdb.git;a=blob;f=gdb/testsuite/gdb.threads/clone-new-thread-event.c
(In reply to comment #0) > Occasionally I find a stale process which cannot be even kill -9 as root. Because zombies are already dead and just waiting to be reaped. > Actual results: > 2641 ? ZN 0:00 [clone-new-threa] <defunct> Is PID 1 the current parent of this zombie? Does the zombie exist forever or does it eventually go away? Try restarting some random service via systemctl. That will surely cause some SIGCHLD signals to be delivered to systemd. systemd should react by reaping all its children zombies then. (In reply to comment #1) > Code of the stale process is most probably: > http://sourceware.org/git/?p=gdb.git;a=blob;f=gdb/testsuite/gdb.threads/ > clone-new-thread-event.c I tried running this program in a tight loop, but it did not reproduce the bug. We need a more reliable way to reproduce or there's not much we can do here.
(In reply to comment #2) > Is PID 1 the current parent of this zombie? Yes, this is why you have the /proc/PID/status attached here and why I have filed it for systemd. > Does the zombie exist forever or does it eventually go away? The Comment #0 zombie PID still exists on that box. > Try restarting some random service via systemctl. That will surely cause > some SIGCHLD signals to be delivered to systemd. systemd should react by > reaping all its children zombies then. But this is a workaround of a systemd bug (or it could be also kernel bug...). > We need a more reliable way to reproduce or there's not much we can do here. Maybe I could patch local systemd with some debug patch to know more when it happens again.
If this happens is systemditself still responsive? i.e. does "systemctl" and friends still work correctly? Usually, if processes aren't reaped anymore then this happens because PID1 is dead. There's not some weird namespace stuff involved, is there?
(In reply to comment #4) > If this happens is systemditself still responsive? i.e. does "systemctl" and > friends still work correctly? I do not know how to answer it, systemd has never worked. "systemctl restart ..." sometimes does not (Bug 924581) but sometimes it works so at least partially systemd works. > Usually, if processes aren't reaped anymore then this happens because PID1 > is dead. Definitely thousands of processes have been successfully reaped afterwards. > There's not some weird namespace stuff involved, is there? As I do not know what is "namespace" then probably it is not.
(In reply to comment #5) > (In reply to comment #4) > > If this happens is systemditself still responsive? i.e. does "systemctl" and > > friends still work correctly? > > I do not know how to answer it, systemd has never worked. Hmm, this might be the first thing to investigate then. What does "never worked" mean? I am mostly interested if "systemctl status" on some service succeeds quickly or whether it hangs. > As I do not know what is "namespace" then probably it is not. For example processes added with "nsenter" into an existing container namespace or something like that.
(In reply to comment #6) > Hmm, this might be the first thing to investigate then. What does "never > worked" mean? I belong to the crowd of people with negative opinion on systemd. > I am mostly interested if "systemctl status" on some service succeeds > quickly or whether it hangs. The kernel crashed during weekend so currently I do not have such zombie anymore. I will reopen this Bug when a new one appears. > For example processes added with "nsenter" into an existing container > namespace or something like that. No.
To be prepared for when it happens again, please set systemd's log level to debug. Either by: 1. Booting with "systemd.log_level=debug" on the kernel command line, or 2. "kill -56 1" at runtime (that's SIGRTMIN+22).
Using systemd debug mode: # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.9.4-200.fc18.x86_64 root=UUID=393b7ea1-33fa-4d85-a05b-9ade4f2b8798 ro rd.luks.uuid=luks-28e0d1d7-c2b6-4c67-8b32-2efa2824011c rd.luks.uuid=luks-0fc6128b-b90f-477a-825a-e51da8919949 nomodeset systemd.log_level=debug LANG=en_US.UTF-8 Having the zombie task again: 28907 ? ZN 0:00 [clone-new-threa] <defunct> kernel-3.9.4-200.fc18.x86_64 systemd-201-2.fc18.7.x86_64 But I do not see anything interesting in: grep -w systemd /var/log/messages
Created attachment 758391 [details] /var/log/messages
Your syslog daemon is likely configured to throw away messages logged at debug level. Instead of /var/log/messages, let's see the output of "journalctl _PID=1".
Created attachment 758857 [details] .xz of journalctl _PID=1 28907 ? ZN 0:00 [clone-new-threa] <defunct> # journalctl _PID=1|grep 28907 Jun 07 06:09:23 systemd[1]: Received SIGCHLD from PID 28907 (gdb). Jun 07 06:09:23 systemd[1]: Got SIGCHLD for process 28907 (gdb) Jun 07 06:09:23 systemd[1]: Child 28907 died (code=killed, status=9/KILL) Jun 08 03:41:41 systemd[1]: Received SIGCHLD from PID 28907 (gdb). Jun 08 03:41:41 systemd[1]: Got SIGCHLD for process 28907 (gdb) Jun 08 03:41:41 systemd[1]: Child 28907 died (code=killed, status=9/KILL) start time of the zombie process: # cat /proc/28907/stat 28907 (clone-new-threa) Z 1 28870 28804 0 -1 50252 19 0 0 0 0 0 0 0 39 19 1 0 91680111 0 0 18446744073709551615 0 0 0 0 0 256 0 16781312 0 18446744071579258842 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 9 # cat /proc/uptime;date +%s 1055334.02 7237680.05 1370794440 # perl -MPOSIX -le 'print strftime("%F %T",localtime( 1370794440-1055334.02+91680111/100 ));' 2013-06-08 03:45:07 So both grep entries seem to be unrelated, you may find more in the log.
This message is a reminder that Fedora 18 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 18. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '18'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 18's end of life. Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 18 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 18's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Bug is still valid: kernel-3.12.8-300.dpath.fc20.x86_64 (".dpath" = Bug 1050964 Comment 15 patch) systemd-208-9.fc20.x86_64
I was finally able to reproduce this. Here's how to run the test from gdb's testsuite: git clone git://sourceware.org/git/binutils-gdb.git cd binutils-gdb ./configure make -j8 cd gdb/testsuite while :; do make check RUNTESTFLAGS="gdb.threads/clone-new-thread-event.exp"; done I am now seeing the weird state, where several zombies, children of PID 1, do not get reaped: $ ps -eo state,pid,ppid,comm | grep ^Z Z 1575 1 clone-new-threa <defunct> Z 4860 1 clone-new-threa <defunct> Z 10530 1 clone-new-threa <defunct> Z 20292 1 clone-new-threa <defunct> Z 27840 1 clone-new-threa <defunct> Z 28442 1 clone-new-threa <defunct> I can send a SIGCHLD to PID 1 and see in strace that it tries to find dead children processes: waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0 I attached gdb to systemd and stepped through manager_dispatch_sigchld(). After the call to: waitid(P_ALL, 0, &si, WEXITED|WNOHANG|WNOWAIT) the "siginfo_t si" structure contains only zeros, so systemd is not getting any information about those zombies. I am running kernel-3.13.3-201.fc20.x86_64. Using systemtap I looked into what's going on when systemd calls waitid. Here's the stap script: probe kernel.function("eligible_child") { if (pid() == 1) { printf("eligible_child: pid %d, p->exit_signal=%d, wo->wo_flags=0x%x\n", task_pid($p), $p->exit_signal, $wo->wo_flags); } } probe kernel.function("do_wait_thread") { if (pid() == 1) { printf("do_wait_thread pid=%d\n", pid()); } } It gives me: do_wait_thread pid=1 eligible_child: pid 459, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 480, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 486, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 509, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 521, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 674, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 691, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 692, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 709, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 714, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 715, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 719, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 725, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 733, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 737, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 738, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 739, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 741, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 746, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 748, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 847, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 888, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 900, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1030, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1268, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1278, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1350, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1349, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1365, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1369, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1466, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1474, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1514, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1516, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1497, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1521, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1529, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1540, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1542, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1560, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1554, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1589, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1583, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1600, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1605, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1788, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1772, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1770, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1963, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 2252, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 2306, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 1575, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 28442, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 10530, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 20292, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 27840, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 4860, p->exit_signal=0, wo->wo_flags=0x1000005 eligible_child: pid 23638, p->exit_signal=17, wo->wo_flags=0x1000005 eligible_child: pid 28930, p->exit_signal=17, wo->wo_flags=0x1000005 Notice that the weird zombies all have p->exit_signal=0. According to the comment in eligible_child(), this makes them "clone" children. wo->wo_flags=0x1000005 means WNOHANG|WEXITED|WNOWAIT. None of the clone flags are set, so the clone children are not eligible. So systemd will never learn about them. I suspect it's a kernel bug (Something went wrong when reparenting the processes to PID 1?).
Great, thanks for the reproducer!
(In reply to Michal Schmidt from comment #15) > > I was finally able to reproduce this. Here's how to run the test from gdb's > testsuite: > > git clone git://sourceware.org/git/binutils-gdb.git > cd binutils-gdb > ./configure > make -j8 > cd gdb/testsuite > while :; do make check > RUNTESTFLAGS="gdb.threads/clone-new-thread-event.exp"; done Thanks a lot Michal. I guess I need to reserve the testing machine and reproduce, but perhaps you already have the one I can use? > eligible_child: pid 28442, p->exit_signal=0, wo->wo_flags=0x1000005 > eligible_child: pid 10530, p->exit_signal=0, wo->wo_flags=0x1000005 > eligible_child: pid 20292, p->exit_signal=0, wo->wo_flags=0x1000005 > eligible_child: pid 27840, p->exit_signal=0, wo->wo_flags=0x1000005 > eligible_child: pid 4860, p->exit_signal=0, wo->wo_flags=0x1000005 > eligible_child: pid 23638, p->exit_signal=17, wo->wo_flags=0x1000005 > eligible_child: pid 28930, p->exit_signal=17, wo->wo_flags=0x1000005 > > Notice that the weird zombies all have p->exit_signal=0. Yes. This is strange. This should be only possible if, say, 28442 was cloned by systemd/init with clone_flags & CSIGNAL == 0. Or another exit_signal==0 child did clone(CLONE_PARENT). > I suspect it's a kernel bug (Something went wrong when reparenting the > processes to PID 1?). I agree very much, this certainly looks like a bug in reparenting. The only problem is, this looks impossible ;) reparent_leader()\ always sets exit_signal = SIGCHLD.
(In reply to Oleg Nesterov from comment #17) > Thanks a lot Michal. I guess I need to reserve the testing machine > and reproduce, but perhaps you already have the one I can use? Oleg, I did the test on my local workstation. If you want to come and have a look, you can. I have a couple of these zombies here right now.
I'll recheck, but it seems I have found the race which can perfectly explain this problem. wait_task_zombie() sets exit_state = EXIT_DEAD but then changes it back to EXIT_ZOMBIE if the child was ptrace-reparented and we are the tracer. If the original parent reparents this child in between, reparent_leader() sees EXIT_DEAD and doesn't update ->exit_signal == 0. This is nasty. Ironically, I already found the similar race in 2009, but I didn't realize this problem is even worse. See https://www.redhat.com/archives/utrace-devel/2009-April/msg00053.html I do not see a simple fix, need to think more...
Created attachment 864639 [details] may be patch > I do not see a simple fix, need to think more... although perhaps this simple fix makes sense, it should close the most serious implications.
(In reply to Oleg Nesterov from comment #20) I tested the patch and it really helps. Previously my system generated 1 zombie per about 500 iterations of the test. With the patch applied I'm now at 12000 iterations and no zombies.
(In reply to Michal Schmidt from comment #21) > (In reply to Oleg Nesterov from comment #20) > I tested the patch and it really helps. Previously my system generated 1 > zombie per about 500 iterations of the test. With the patch applied I'm now > at 12000 iterations and no zombies. Great! Thanks a lot. This confirms the theory. I am working on the "right" fix(es), but perhaps this change makes sense as a minimal fix for stable. We will see.
Created attachment 865176 [details] [PATCH] kill DEAD->ZOMBIE transition This is what I am going to send to lkml. Except I need to split this change. And I am going to make other cleanups in wait_consider_task(). Michal, Jan, I'll appreciate very much if you can test this patch as well ;)
(In reply to Oleg Nesterov from comment #23) > Created attachment 865176 [details] > [PATCH] kill DEAD->ZOMBIE transition Oleg, the patch seems fine. 25000 test iterations and still no zombies. Thanks!
(In reply to Michal Schmidt from comment #24) > > the patch seems fine. 25000 test iterations and still no zombies. Thanks a lot again! I sent the patches upstream. I too tried to make a test-case and test this fix, seems to work.
Should be fixed by v3.14-10445-gabd50b39e7. I'll close this since 3.15 is in F20.