Bug 217112 - PTRACE_SETOPTIONS mysterious behavior
Summary: PTRACE_SETOPTIONS mysterious behavior
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 6
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Roland McGrath
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-11-24 03:46 UTC by Tom Horsley
Modified: 2007-11-30 22:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-09-26 11:08:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
test-setoptions.c program (c++ source code). (8.53 KB, text/plain)
2006-11-24 05:01 UTC, Tom Horsley
no flags Details
transcript of test runs with commentary (5.35 KB, text/plain)
2006-11-24 05:17 UTC, Tom Horsley
no flags Details
improved test-setoptions.c test program (11.12 KB, text/plain)
2006-11-24 16:49 UTC, Tom Horsley
no flags Details

Description Tom Horsley 2006-11-24 03:46:12 UTC
Description of problem:

Back when ptrace was ptrace (instead of a layer on utrace), there were two
PTRACE_SETOPTIONS function codes. The "old" code was 21 and did almost
nothing useful, the "new" code was 0x4200 and implemented all sorts of
fabulous features for following forks and clones and wot-not.

I notice in fedora core 6, the "old" code of 21 now implements all the
fabulous new features, and the "new" code of 0x4200 doesn't work at all.

Just wondering if this is intentional or an oversight in the new ptrace
layer (certainly the "new" version of setoptions was never documented
anywhere other than the kernel source code).

Version-Release number of selected component (if applicable):
2.6.18-1.2849.fc6

How reproducible:
Every time.

Steps to Reproduce:
1. Try to use ptrace function code 0x4200.
2. Try again with 21
3.
  
Actual results:
0x4200 no longer works in FC6.

Expected results:
Was hoping for backward compatibility (not that it matters a lot, I long
ago produced an insanely complex test program to probe the ptrace service
call and describe to my debugger how it works on the current kernel.
With remarkable foresight (or luck :-), it already happens to be
checking both versions of PTRACE_SETOPTIONS to see which one works,
so I shouldn't have a problem, just thought someone might want to know).

Additional info:

Since PTRACE_SETOPTIONS is actually documented in the ptrace(2) man page
now, I'm guessing the change is intentional.

Comment 1 Tom Horsley 2006-11-24 03:51:02 UTC
Or maybe it isn't intentional. I just noticed this in the <sys/ptrace.h> file:

  PTRACE_SETOPTIONS = 0x4200,

That's the definition that no longer works.


Comment 2 Tom Horsley 2006-11-24 04:59:41 UTC
Things are more mysterious than I at first assumed. It appears as though the
PTRACE_SETOPTIONS behavior is more random than anything else. I'm attaching a
test program and a transcript of running it several times with different
results each time.


Comment 3 Tom Horsley 2006-11-24 05:01:56 UTC
Created attachment 142030 [details]
test-setoptions.c program (c++ source code).

I compiled this program on both FC5 and FC6 (x86_64), and running both versions

on FC6 produces very strange behavior (see next attachment).

Comment 4 Tom Horsley 2006-11-24 05:17:54 UTC
Created attachment 142031 [details]
transcript of test runs with commentary

Looks like PTRACE_SETOPTIONS behaves in completely random fashion.

Comment 5 Tom Horsley 2006-11-24 13:14:30 UTC
I think I cut too much stuff out of the test program trying to make a smaller
example. I'll be replacing the test later today with one that provides more
info about exactly what is going on.

Comment 6 Tom Horsley 2006-11-24 16:49:43 UTC
Created attachment 142076 [details]
improved test-setoptions.c test program

OK. I admit it! I have no idea if this is a bug or not. The new test prog
shows that the source of my confusion is that the fork event message for the
parent is sometimes (randomly) delivered after the initial status for the
child has already shown up. This certainly feels like a bug (since what is
the point in getting the fork event at all if not to warn you that the child
is coming soon so you can do stuff like avoid modifying the parent till you
have control of the child and you are sure of what was in the image that
forked).

However, running this on FC5 seems to indicate that the same random behavior
exists there as well (but maybe the latest FC5 kernel also has the utrace
code?)

Comment 7 Tom Horsley 2006-11-24 16:56:57 UTC
On a RHEL 4 system at work (4 cpus, x86_64 arch), the test program never
once showed the random behavior, so I'm suspecting this really is a bug,
and somehow the random status delivery order got introduced by recent
ptrace changes.


Comment 8 Tom Horsley 2006-12-01 13:27:03 UTC
Even more interesting information: I ran this test program at work on
a Fedora Core 5 machine with the 2.6.18-1.2239.fc5smp kernel (dual
Xeon system hyperthreaded to look like 4 cpus). In a shell script loop
like so:

while true
do
./test-setoptions 2>&1 | fgrep ERR
./test-setoptions new 2>&1 | fgrep ERR
done

It ran for a second with no errors, then spewed a block of about 7 ERR
lines all at once, then the system was dead :-(. Completely froze up. No
response to keyboard. Had to hit the reset button.


Comment 9 Roland McGrath 2006-12-05 04:32:51 UTC
It has always been a race between clone event report in parent and starting
SIGSTOP report in child.  It was exceedingly rare in the old implementation,
probably only seen with lots of preemption or really really fast SMP.  It is
much less rare now.

Is there in fact any SETOPTIONS issue, or just the report order?

Comment 10 Tom Horsley 2006-12-05 10:52:53 UTC
Up until comment #8, I would have said the report order being wrong was
the issue (and I'd still call that a bug - I don't see any value to
stopping the parent at all if it isn't going to stop first),
but in #8 the test program crashed my system when running in a loop
over and over, so that is definitely a bug (though perhaps a hard
one to reproduce).

So far I've only seen the crash on the one machine, but I'm pretty sure the
attached test program is what crashed it.

Comment 11 Tom Horsley 2006-12-05 12:32:07 UTC
I just ran the loop from comment #8 again on the same machine and it crashed
again after running for a bit then getting an ERR output line from one of the
test runs. It seems to be a pretty reliable crash on this machine, but I haven't
been able to crash my home system.

The machine that crashes is running Fedora Core 5, kernel 2.6.18-1.2239.fc5smp
with dual Intel(R) XEON(TM) CPU 2.20GHz CPUs (hyperthreading enabled, so it
looks like 4 cpus to the kernel). 1 gig of rambus (bleh :-) memory on a
Super P4DC6 or P4DC6+ (not sure which) motherboard. In two tries it has only
taken about a minute to crash each time. (And the machine normally stays up
forever - it has been very stable under normal use).

The machine that doesn't seem to crash is an AMD Athlon 64 X2 4400+ dual core
cpu, 2 gig of memory, and a BIOSTAR TForce4U socket 939 motherboard. I've got
lots of different boot partitions on it and have tried both 32 and 64 bit
Fedora Core 6 and Core 5, and no crash on this machine with any of those
kernels (at least not in the time I was willing to wait).


Comment 12 Chuck Ebbert 2007-06-22 15:07:43 UTC
Tom, can you retest this? Fedora 5/6 is on kernel 2.6.20 and I can't reproduce
any kind of hang on 2 x dual-core Xeon on Fedora 6 with kernel 2.6.20-1.2962.fc6.


Comment 13 Tom Horsley 2007-06-22 16:01:10 UTC
The machine I previously ran this on has just been regenned to Fedora 7,
so I tested it there, and the crash no longer happens (though the out
of sequence parent/child status certainly does which I'd still call
a bug :-).

uname -a gives:
Linux tweety 2.6.21-1.3228.fc7 #1 SMP Tue Jun 12 15:37:31 EDT 2007 i686 i686
i386 GNU/Linux

Ran a few minutes without crashing and since it only took a few seconds
before, I assume it is working now.


Comment 14 Roland McGrath 2007-06-22 17:28:29 UTC
Such ordering has never been guaranteed and it is only luck that you have never
seen it with a vanilla kernel.


Note You need to log in before you can comment on or make changes to this bug.