Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 150803

Summary:	64 bit debugger can't ptrace a 32 bit program
Product:	Red Hat Enterprise Linux 4	Reporter:	Michael Waite <mwaite>
Component:	kernel	Assignee:	Rik van Riel <riel>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.0	CC:	davej, riel, roland, tao
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-03-11 20:20:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michael Waite 2005-03-10 19:20:51 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0

Description of problem:
The problem only effects 
32 bit processes running on 64 bit hardware
with the 64 bit kernel. We have seen 
the problem with the RHEL 4.0 2.6.9-5.EL
kernel version. It appears that a 
vanilla kernel.org 2.6.9 kernel doesn't
exhibit this problem so we believe it to
be something introduced by a RedHat kernel
patch.

The issue occurs when a user tries to 
debug any program which does a fork()
call to create a new process. 

What the user sees is that they run their
program and when it gets to the fork call 
the kernel suddenly terminates the TotalView 
process. 

Looking at what TotalView is doing and
at system traces reveals that what happens 
is that TotalView tries to use ptrace(PEEKDATA,)
to read the vDSO page of the newly created process
and the kernel encounters some sort of problem
and terminates TotalView with a SIGSEGV.

TotalView should be able to read the vDSO page.
This is something that TotalView does during normal operation 
and the trace indicates that TV was able to successfully read the vDSO
page in the parent process just a moment before the fatal attempt
to read the vDSO page on the child process. 

Version-Release number of selected component (if applicable):
kernel-2.6.9-5.EL

How reproducible:
Always

Steps to Reproduce:
1.The issue occurs when a user tries to 
debug any program which does a fork()
call to create a new process. 
2.
What the user sees is that they run their
program and when it gets to the fork call 
the kernel suddenly terminates the TotalView 
process.
3.
  

Actual Results:  Looking at what TotalView is doing and
at system traces reveals that what happens 
is that TotalView tries to use ptrace(PEEKDATA,)
to read the vDSO page of the newly created process
and the kernel encounters some sort of problem
and terminates TotalView with a SIGSEGV.

Expected Results:  should not do this

Additional info:

Comment 1 Rik van Riel 2005-03-10 19:59:07 UTC

Just to narrow things down, is TotalView a 32 bit or a 64 bit program ?

If the bug happens with a 64 bit TotalView, could you also try with a 32 bit
version of the program ?

Comment 2 Michael Waite 2005-03-10 20:08:30 UTC

Waiting on feedback from Etnus specific to your question............

Comment 3 Michael Waite 2005-03-10 20:34:45 UTC

The TotalView main process (the one doing the ptrace calls) is 64 
bit on the x86-64. TotalView is an example of a product with 
mixed 32 and 64 bit pieces -- if you graph an array on an x86-64 
linux system TV launches a 32 bit process to do the visualization.

They will try it with their 32 bit totalview (which isn't intended
for use on the 64 bit linux kernels and is unable to debug
64 bit targets).

Comment 4 Chris Gottbrath 2005-03-10 21:21:43 UTC

Mike -- thanks for filing this! 

Just a point of clarification wrt the title. With the 
exception of this specific instance it appears that 
ptrace()ing a 32 bit process from within a 64 bit process 
is working just fine. Using ptrace() to look at the vDSO 
page (the exact same address) in the parent process seems
to work fine. The parent is also 32 bit and the same debugger
(64 bit) is doing the call. It appears that there may be something
different about the state of this process which is in the late
stages of being cloned.

Comment 5 Michael Waite 2005-03-10 21:39:05 UTC

some background for the benefit of the RedHat folks:

- On Linux x86_64 totalview is a 64-bit process, and it is capable of
  debugging bot 64-bit and 32-bit processes from the same debugger
  process.  The Linux x86_64 kernel has supported this from day-1, and
  I expect that to continue to be the case.

- The thing that has never worked is a 32-bit debugger process
  debugging a 64-bit target process.  We tried it at one time, it
  didn't work, so we made TV a 64-bit process.  We don't care that
  this doesn't work.

Now, for the bug at hand:

- If order to follow fork() system calls in a target process, the
  debugger traces sys_clone system calls and sets the CLONE_PTRACE bit
  while the target is stopped in the sys_clone system call.  That
  causes the debugger to become attached to the "newborn" process
  (child of the target process).

- The newborn process stops with a SIGSTOP on exit from the sys_clone
  system call.  The debugger gets the wait() event, and reads the
  registers of the newborn process.  At this point, the PC (rip
  register) of the newbown points into the vDSO (syscall page).

- We can see from the /proc/<pid>/maps file that the vDSO page is
  mapped into the process.  The mapping looks like this:

  ffffe000-fffff000 ---p 00000000 00:00 0

  The PC of the newbown process is at 0xffffe403.

- One of two things can now happen:

  1) If the debugger attempts to read from the vDSO page in the
     newborn process at this point in the newborn's life (remember,
     it's stopped on exit from the sys_clone system call), the kernel
     takes an exception, and defensively kills the debugger process.
     Game over.

  2) If instead, the debugger allows the newborn process to continue
     to a breakoint, and then reads the vDSO page in the newborn
     process, it works OK.

So, I think the problem is that if the debugger reads from the the
vDSO page too soon in a newborn process's life, then ptrace triggers
an exception.

Comment 6 Chris Gottbrath 2005-03-10 22:01:50 UTC

I tried the test with our version of TotalView intended for 32 bit
linux. It generated all kinds of errors earlier in the debugging
process. I'd been afraid that would happen since the x86-64 version of
TV is the one that knows how to deal with the 64 bit kernel.

As pointed out in the subsequent comments -- there is more to this
than simply the bit depth of the debugger and the target.

Comment 8 Rik van Riel 2005-03-10 22:18:02 UTC

If this is what we think it is, the bug was already fixed in
kernel-2.6.9-5.0.3.EL. Could you please verify whether the problem still occurs
with the errata kernel that went out right after RHEL4 GA ?

Comment 9 Michael Waite 2005-03-11 20:18:45 UTC

Great News!  The updated kernel does fix the problem.

Is there any clever mechanism (other than using uname)
to check and see if the kernel is going to be broken 
in this way?

If not, are there any other redhat issued kernels that 
are likely to have the problem?

I'm asking because we would like to 
add some logic in TV to watch for broken
kernels. Could we just look at uname -a and
if it matches 2.6.9-5.EL we have a broken 
kernel and if it is anything else we are ok?
Was there a 2.6.9-5.0.1 or 2.6.9-5.0.2? Are there
any versions < 2.6.9-5.EL that might be in circulation
which might have the bug?

Comment 10 Michael Waite 2005-03-11 20:20:14 UTC

Closed

Comment 11 Chris Gottbrath 2005-04-18 16:09:19 UTC

Speculation on the part of engineers here at Etnus:

was this fixed by patch

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111074


the reason we are looking at this is that there is a similar problem on 
ia64 --- and we note that this patch is just for x86-64. 

Does anyone know if the same sort of problems might occur on ia64?