958505 – Stale zombie/<defunct> process

Bug 958505 - Stale zombie/<defunct> process

Summary: Stale zombie/<defunct> process

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1535715
TreeView+	depends on / blocked

Reported:	2013-05-01 17:01 UTC by Jan Kratochvil
Modified:	2018-01-17 23:24 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Clones:	1535715 (view as bug list)
Environment:
Last Closed:	2014-08-15 03:01:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/proc/2641/status (693 bytes, text/plain) 2013-05-01 17:01 UTC, Jan Kratochvil	no flags	Details
/var/log/messages (44.42 KB, text/plain) 2013-06-08 02:31 UTC, Jan Kratochvil	no flags	Details
.xz of journalctl _PID=1 (818.93 KB, application/octet-stream) 2013-06-09 16:17 UTC, Jan Kratochvil	no flags	Details
may be patch (785 bytes, patch) 2014-02-18 16:04 UTC, Oleg Nesterov	no flags	Details \| Diff
[PATCH] kill DEAD->ZOMBIE transition (3.42 KB, patch) 2014-02-19 18:29 UTC, Oleg Nesterov	no flags	Details \| Diff
View All

Description Jan Kratochvil 2013-05-01 17:01:38 UTC

Created attachment 742272 [details]
/proc/2641/status

Description of problem:
Occasionally I find a stale process which cannot be even kill -9 as root.

Version-Release number of selected component (if applicable):
systemd-197-1.fc18.2.x86_64

How reproducible:
It comes from the GDB testsuite running hundreds times so it is very rare to reproduce.

Steps to Reproduce:
Run GDB testsuite for many hundreds times.
ps axwf

Actual results:
 2641 ?        ZN     0:00 [clone-new-threa] <defunct>

Expected results:
No such stale process, it did not happen with former inits.

Additional info:

Comment 1 Jan Kratochvil 2013-05-01 17:02:32 UTC

Code of the stale process is most probably:
http://sourceware.org/git/?p=gdb.git;a=blob;f=gdb/testsuite/gdb.threads/clone-new-thread-event.c

Comment 2 Michal Schmidt 2013-05-03 22:16:30 UTC

(In reply to comment #0)
> Occasionally I find a stale process which cannot be even kill -9 as root.

Because zombies are already dead and just waiting to be reaped.

> Actual results:
>  2641 ?        ZN     0:00 [clone-new-threa] <defunct>

Is PID 1 the current parent of this zombie?
Does the zombie exist forever or does it eventually go away?
Try restarting some random service via systemctl. That will surely cause some SIGCHLD signals to be delivered to systemd. systemd should react by reaping all its children zombies then.

(In reply to comment #1)
> Code of the stale process is most probably:
> http://sourceware.org/git/?p=gdb.git;a=blob;f=gdb/testsuite/gdb.threads/
> clone-new-thread-event.c

I tried running this program in a tight loop, but it did not reproduce the bug.

We need a more reliable way to reproduce or there's not much we can do here.

Comment 3 Jan Kratochvil 2013-05-04 05:00:55 UTC

(In reply to comment #2)
> Is PID 1 the current parent of this zombie?

Yes, this is why you have the /proc/PID/status attached here and why I have filed it for systemd.


> Does the zombie exist forever or does it eventually go away?

The Comment #0 zombie PID still exists on that box.


> Try restarting some random service via systemctl. That will surely cause
> some SIGCHLD signals to be delivered to systemd. systemd should react by
> reaping all its children zombies then.

But this is a workaround of a systemd bug (or it could be also kernel bug...).


> We need a more reliable way to reproduce or there's not much we can do here.

Maybe I could patch local systemd with some debug patch to know more when it happens again.

Comment 4 Lennart Poettering 2013-05-06 15:05:26 UTC

If this happens is systemditself still responsive? i.e. does "systemctl" and friends still work correctly?

Usually, if processes aren't reaped anymore then this happens because PID1 is dead.

There's not some weird namespace stuff involved, is there?

Comment 5 Jan Kratochvil 2013-05-06 15:18:44 UTC

(In reply to comment #4)
> If this happens is systemditself still responsive? i.e. does "systemctl" and
> friends still work correctly?

I do not know how to answer it, systemd has never worked.
"systemctl restart ..." sometimes does not (Bug 924581) but sometimes it works so at least partially systemd works.


> Usually, if processes aren't reaped anymore then this happens because PID1
> is dead.

Definitely thousands of processes have been successfully reaped afterwards.


> There's not some weird namespace stuff involved, is there?

As I do not know what is "namespace" then probably it is not.

Comment 6 Lennart Poettering 2013-05-06 15:42:41 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > If this happens is systemditself still responsive? i.e. does "systemctl" and
> > friends still work correctly?
> 
> I do not know how to answer it, systemd has never worked.

Hmm, this might be the first thing to investigate then. What does "never worked" mean?

I am mostly interested if "systemctl status" on some service succeeds quickly or whether it hangs.

> As I do not know what is "namespace" then probably it is not.

For example processes added with "nsenter" into an existing container namespace or something like that.

Comment 7 Jan Kratochvil 2013-05-06 16:28:42 UTC

(In reply to comment #6)
> Hmm, this might be the first thing to investigate then. What does "never
> worked" mean?

I belong to the crowd of people with negative opinion on systemd.


> I am mostly interested if "systemctl status" on some service succeeds
> quickly or whether it hangs.

The kernel crashed during weekend so currently I do not have such zombie anymore.  I will reopen this Bug when a new one appears.


> For example processes added with "nsenter" into an existing container
> namespace or something like that.

No.

Comment 8 Michal Schmidt 2013-05-07 16:28:18 UTC

To be prepared for when it happens again, please set systemd's log level to debug. Either by:
1. Booting with "systemd.log_level=debug" on the kernel command line, or
2. "kill -56 1" at runtime (that's SIGRTMIN+22).

Comment 9 Jan Kratochvil 2013-06-08 02:30:38 UTC

Using systemd debug mode:
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.9.4-200.fc18.x86_64 root=UUID=393b7ea1-33fa-4d85-a05b-9ade4f2b8798 ro rd.luks.uuid=luks-28e0d1d7-c2b6-4c67-8b32-2efa2824011c rd.luks.uuid=luks-0fc6128b-b90f-477a-825a-e51da8919949 nomodeset systemd.log_level=debug LANG=en_US.UTF-8

Having the zombie task again:
28907 ?        ZN     0:00 [clone-new-threa] <defunct>

kernel-3.9.4-200.fc18.x86_64
systemd-201-2.fc18.7.x86_64

But I do not see anything interesting in: grep -w systemd /var/log/messages

Comment 10 Jan Kratochvil 2013-06-08 02:31:13 UTC

Created attachment 758391 [details]
/var/log/messages

Comment 11 Michal Schmidt 2013-06-09 07:38:29 UTC

Your syslog daemon is likely configured to throw away messages logged at debug level.
Instead of /var/log/messages, let's see the output of "journalctl _PID=1".

Comment 12 Jan Kratochvil 2013-06-09 16:17:46 UTC

Created attachment 758857 [details]
.xz of journalctl _PID=1

28907 ?        ZN     0:00 [clone-new-threa] <defunct>
# journalctl _PID=1|grep 28907
Jun 07 06:09:23 systemd[1]: Received SIGCHLD from PID 28907 (gdb).
Jun 07 06:09:23 systemd[1]: Got SIGCHLD for process 28907 (gdb)
Jun 07 06:09:23 systemd[1]: Child 28907 died (code=killed, status=9/KILL)
Jun 08 03:41:41 systemd[1]: Received SIGCHLD from PID 28907 (gdb).
Jun 08 03:41:41 systemd[1]: Got SIGCHLD for process 28907 (gdb)
Jun 08 03:41:41 systemd[1]: Child 28907 died (code=killed, status=9/KILL)

start time of the zombie process:
# cat /proc/28907/stat
28907 (clone-new-threa) Z 1 28870 28804 0 -1 50252 19 0 0 0 0 0 0 0 39 19 1 0 91680111 0 0 18446744073709551615 0 0 0 0 0 256 0 16781312 0 18446744071579258842 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 9
# cat /proc/uptime;date +%s
1055334.02 7237680.05
1370794440
# perl -MPOSIX -le 'print strftime("%F %T",localtime( 1370794440-1055334.02+91680111/100 ));'
2013-06-08 03:45:07

So both grep entries seem to be unrelated, you may find more in the log.

Comment 13 Fedora End Of Life 2013-12-21 15:30:22 UTC

This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Jan Kratochvil 2014-01-26 11:04:55 UTC

Bug is still valid:
kernel-3.12.8-300.dpath.fc20.x86_64 (".dpath" = Bug 1050964 Comment 15 patch)
systemd-208-9.fc20.x86_64

Comment 15 Michal Schmidt 2014-02-17 20:39:28 UTC

I was finally able to reproduce this. Here's how to run the test from gdb's testsuite:

git clone git://sourceware.org/git/binutils-gdb.git
cd binutils-gdb
./configure
make -j8
cd gdb/testsuite
while :; do make check RUNTESTFLAGS="gdb.threads/clone-new-thread-event.exp"; done


I am now seeing the weird state, where several zombies, children of PID 1, do not get reaped:

$ ps -eo state,pid,ppid,comm | grep ^Z
Z  1575     1 clone-new-threa <defunct>
Z  4860     1 clone-new-threa <defunct>
Z 10530     1 clone-new-threa <defunct>
Z 20292     1 clone-new-threa <defunct>
Z 27840     1 clone-new-threa <defunct>
Z 28442     1 clone-new-threa <defunct>

I can send a SIGCHLD to PID 1 and see in strace that it tries to find dead children processes:
waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0

I attached gdb to systemd and stepped through manager_dispatch_sigchld(). 
After the call to:
   waitid(P_ALL, 0, &si, WEXITED|WNOHANG|WNOWAIT)
the "siginfo_t si" structure contains only zeros, so systemd is not getting any information about those zombies.

I am running kernel-3.13.3-201.fc20.x86_64. Using systemtap I looked into what's going on when systemd calls waitid. Here's the stap script:

probe kernel.function("eligible_child") {
	if (pid() == 1) {
		printf("eligible_child: pid %d, p->exit_signal=%d, wo->wo_flags=0x%x\n", task_pid($p), $p->exit_signal, $wo->wo_flags);
	}
}

probe kernel.function("do_wait_thread") {
	if (pid() == 1) {
		printf("do_wait_thread pid=%d\n", pid());
	}
}


It gives me:
do_wait_thread pid=1
eligible_child: pid 459, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 480, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 486, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 509, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 521, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 674, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 691, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 692, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 709, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 714, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 715, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 719, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 725, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 733, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 737, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 738, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 739, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 741, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 746, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 748, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 847, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 888, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 900, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1030, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1268, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1278, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1350, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1349, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1365, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1369, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1466, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1474, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1514, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1516, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1497, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1521, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1529, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1540, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1542, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1560, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1554, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1589, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1583, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1600, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1605, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1788, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1772, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1770, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1963, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 2252, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 2306, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 1575, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 28442, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 10530, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 20292, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 27840, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 4860, p->exit_signal=0, wo->wo_flags=0x1000005
eligible_child: pid 23638, p->exit_signal=17, wo->wo_flags=0x1000005
eligible_child: pid 28930, p->exit_signal=17, wo->wo_flags=0x1000005

Notice that the weird zombies all have p->exit_signal=0. According to the comment in eligible_child(), this makes them "clone" children.
wo->wo_flags=0x1000005 means WNOHANG|WEXITED|WNOWAIT. None of the clone flags are set, so the clone children are not eligible. So systemd will never learn about them.

I suspect it's a kernel bug (Something went wrong when reparenting the processes to PID 1?).

Comment 16 Jan Kratochvil 2014-02-17 20:46:31 UTC

Great, thanks for the reproducer!

Comment 17 Oleg Nesterov 2014-02-18 14:51:04 UTC

(In reply to Michal Schmidt from comment #15)
>
> I was finally able to reproduce this. Here's how to run the test from gdb's
> testsuite:
> 
> git clone git://sourceware.org/git/binutils-gdb.git
> cd binutils-gdb
> ./configure
> make -j8
> cd gdb/testsuite
> while :; do make check
> RUNTESTFLAGS="gdb.threads/clone-new-thread-event.exp"; done

Thanks a lot Michal. I guess I need to reserve the testing machine
and reproduce, but perhaps you already have the one I can use?
 
> eligible_child: pid 28442, p->exit_signal=0, wo->wo_flags=0x1000005
> eligible_child: pid 10530, p->exit_signal=0, wo->wo_flags=0x1000005
> eligible_child: pid 20292, p->exit_signal=0, wo->wo_flags=0x1000005
> eligible_child: pid 27840, p->exit_signal=0, wo->wo_flags=0x1000005
> eligible_child: pid 4860, p->exit_signal=0, wo->wo_flags=0x1000005
> eligible_child: pid 23638, p->exit_signal=17, wo->wo_flags=0x1000005
> eligible_child: pid 28930, p->exit_signal=17, wo->wo_flags=0x1000005
> 
> Notice that the weird zombies all have p->exit_signal=0.

Yes. This is strange. This should be only possible if, say, 28442
was cloned by systemd/init with clone_flags & CSIGNAL == 0. Or
another exit_signal==0 child did clone(CLONE_PARENT).

> I suspect it's a kernel bug (Something went wrong when reparenting the
> processes to PID 1?).

I agree very much, this certainly looks like a bug in reparenting.
The only problem is, this looks impossible ;)  reparent_leader()\
always sets exit_signal = SIGCHLD.

Comment 18 Michal Schmidt 2014-02-18 14:56:47 UTC

(In reply to Oleg Nesterov from comment #17)
> Thanks a lot Michal. I guess I need to reserve the testing machine
> and reproduce, but perhaps you already have the one I can use?

Oleg, I did the test on my local workstation. If you want to come and have a look, you can. I have a couple of these zombies here right now.

Comment 19 Oleg Nesterov 2014-02-18 15:58:24 UTC

I'll recheck, but it seems I have found the race which can perfectly
explain this problem.

wait_task_zombie() sets exit_state = EXIT_DEAD but then changes it back
to EXIT_ZOMBIE if the child was ptrace-reparented and we are the tracer.

If the original parent reparents this child in between, reparent_leader()
sees EXIT_DEAD and doesn't update ->exit_signal == 0.

This is nasty. Ironically, I already found the similar race in 2009,
but I didn't realize this problem is even worse. See
https://www.redhat.com/archives/utrace-devel/2009-April/msg00053.html

I do not see a simple fix, need to think more...

Comment 20 Oleg Nesterov 2014-02-18 16:04:34 UTC

Created attachment 864639 [details]
may be patch

> I do not see a simple fix, need to think more...

although perhaps this simple fix makes sense, it should close
the most serious implications.

Comment 21 Michal Schmidt 2014-02-19 11:41:20 UTC

(In reply to Oleg Nesterov from comment #20)
I tested the patch and it really helps. Previously my system generated 1 zombie per about 500 iterations of the test. With the patch applied I'm now at 12000 iterations and no zombies.

Comment 22 Oleg Nesterov 2014-02-19 15:09:17 UTC

(In reply to Michal Schmidt from comment #21)
> (In reply to Oleg Nesterov from comment #20)
> I tested the patch and it really helps. Previously my system generated 1
> zombie per about 500 iterations of the test. With the patch applied I'm now
> at 12000 iterations and no zombies.

Great! Thanks a lot. This confirms the theory.

I am working on the "right" fix(es), but perhaps this change makes sense
as a minimal fix for stable. We will see.

Comment 23 Oleg Nesterov 2014-02-19 18:29:54 UTC

Created attachment 865176 [details]
[PATCH] kill DEAD->ZOMBIE transition

This is what I am going to send to lkml. Except I need to split this
change. And I am going to make other cleanups in wait_consider_task().

Michal, Jan, I'll appreciate very much if you can test this patch as
well ;)

Comment 24 Michal Schmidt 2014-02-20 13:11:11 UTC

(In reply to Oleg Nesterov from comment #23)
> Created attachment 865176 [details]
> [PATCH] kill DEAD->ZOMBIE transition

Oleg,
the patch seems fine. 25000 test iterations and still no zombies.
Thanks!

Comment 25 Oleg Nesterov 2014-02-20 17:50:20 UTC

(In reply to Michal Schmidt from comment #24)
>
> the patch seems fine. 25000 test iterations and still no zombies.

Thanks a lot again! I sent the patches upstream.

I too tried to make a test-case and test this fix, seems to work.

Comment 26 Zbigniew Jędrzejewski-Szmek 2014-08-15 03:01:13 UTC

Should be fixed by v3.14-10445-gabd50b39e7. I'll close this since 3.15 is in F20.

Note You need to log in before you can comment on or make changes to this bug.