Bug 237096 - Watchpoints missed after inferior's fork()
Summary: Watchpoints missed after inferior's fork()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gdb
Version: 5.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Jan Kratochvil
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 219688 242981 349451 427011
TreeView+ depends on / blocked
 
Reported: 2007-04-19 12:58 UTC by Holger Hopp
Modified: 2008-05-21 16:55 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0332
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 16:55:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
gdb.threads/watchthreads.exp testcase fix. (3.29 KB, patch)
2007-04-20 19:25 UTC, Jan Kratochvil
no flags Details | Diff
testcase (930 bytes, application/x-bzip)
2007-06-12 16:08 UTC, Holger Hopp
no flags Details
More automatized testcase from the Comment 7 (1004 bytes, application/octet-stream)
2007-06-25 15:19 UTC, Jan Kratochvil
no flags Details
Testcase which would be silent if there would be no bug. (1.31 KB, application/octet-stream)
2007-06-26 15:19 UTC, Jan Kratochvil
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0332 0 normal SHIPPED_LIVE gdb bug fix and enhancement update 2008-05-20 17:58:35 UTC

Description Holger Hopp 2007-04-19 12:58:07 UTC
Description of problem:

The gdb has problems setting watchpoints in multithreaded
applications. This is a known problem. The gdb documentation says:

"Warning: In multi-thread programs, watchpoints have only limited
usefulness. With the current watchpoint implementation, GDB can only
watch the value of an expression in a single thread. If you are
confident that the expression can only change due to the current
thread's activity (and if you are also confident that no other thread
can become current), then you can use watchpoints as usual. However,
GDB may not notice when a non-current thread's activity changes the
expression."

The question is here: Is there currently some effort to fix this
misbehavior of gdb? The (hw) watchpoint feature is one key feature of
a debugger and it is really bad if this does not work. And so we would
like to have a fix here soon.

The problem occurs with all gdb's (RHEL4, RHEL5, newest GNU
release 6.6). The problem was detected on x86_64, but probably it occurs
also on other platforms.

Steps to Reproduce:

I was not able to create a short and simple example, but since the
problem is known, the gdb developers should have such an example.
In my simple tries the watchpoint works (watchpoint is not fully
catched by gdb, but at least the program execution stops with
trace/breakpoint trap). The real application is too huge to attach
here.

Comment 1 Jan Kratochvil 2007-04-20 19:25:41 UTC
Created attachment 153204 [details]
gdb.threads/watchthreads.exp testcase fix.

There may be multiple reasons:

(1)
First please check that in `info break' you have listed all the watchpoints as
`hw watchpoint':
Num Type	   Disp Enb Address	       What
2   hw watchpoint  keep y		       var1

Then also check there were no error messages from GDB like:
  Could not insert hardware watchpoint 6.
or
  warning: Could not remove hardware watchpoint 6.
with the result
  You may have requested too many hardware breakpoints/watchpoints.
as hardware has limited number of slots for the memory watching.

So far I do not deal with non-hardware watchpoints here, they should not be
reliable for non-current threads.

(2)
While I was trying to reproduce the behavior I found out there is a bug in the
ptrace(2) kernel communication (occurs on both Red Hat and upstream kernels,
verified on 2.6.20-1.2944.fc6.x86_64 and 2.6.20.4.x86_64).

The attached patch fixes the testcase now exhibiting the behavior described in
the previous paragraph.

This problem (2) will still stop, just with inappropriate reason:
  Program received signal SIGTRAP, Trace/breakpoint trap.
instead of the expected
  Hardware watchpoint NUMBER: EXPRESSION

Still this probably does not match your described problem in Comment 0.


Please clarify if the (1) or (2) reasons are applicable to your problem.
Also please formally resubmit this bugfix request referring this Bug 237096
using your RHEL subscription support contract.

Comment 3 Holger Hopp 2007-04-23 13:10:39 UTC
Regarding (1):
Yes, we are talking about hw watchpoints.
A was watching the memory content directly:
Num Type           Disp Enb Address            What
1   hw watchpoint  keep y                      *12410208

And it's only one (hw) watchpoint.


Regarding (2):
In my small example, that I have created to break down the problem to
a small and simple case, I have observed the same behavior, i.e.
  Program received signal SIGTRAP, Trace/breakpoint trap.
instead of the expected
  Hardware watchpoint NUMBER: EXPRESSION
But this is not the real problem, because the watchpoint "works",
although the gdb message was insufficient. So in my (and your) simple example
the watchpoint in multithreaded application works.
In the real problem (huge application) I have set the watchpoint to
*12410208 when *12410208 == 1. Then I continue wait and wait and
wait (several minutes) and stop the program in the debugger with ctrl-c.
Then I look into *12410208 and see its value is 2, not 1. But the 
watchpoint was never reached. The problem is: Who on earth sets this 
value to 2? That's what I want to find out with a debugger, but the gdb
watchpoint does not work in this case. 

Then I saw this gdb documentation that shows that gdb has known
problems with watchpoints in multithreaded applications.
Sorry, that I have no small example up to now.


Regarding support contract:
SAP is technology parter of RedHat. We detect this misbehavior during
our testing / developing / debugging.

Comment 4 Jan Kratochvil 2007-04-25 20:32:36 UTC
Watchpoints for threads are being implemented by Jeff Johnston's patches
  gdb-6.3-threaded-watchpoints-20041213.patch
  gdb-6.3-threaded-watchpoints2-20050225.patch

Unfortunately I cannot suggest a fix now, the patches should work.
Please try to use the software watchpoints if the preformance would be feasible:
  set can-use-hw-watchpoints 0
Looking forward if you have a reproducibility testcase.

While going to fix the kernel compatibility I do not believe it will catch more
data changes - if you do not get the error messages, nothing should change.

Sorry for the response delay.


Comment 5 Jan Kratochvil 2007-04-29 18:36:43 UTC
Reproducibility case for the
  Program received signal SIGTRAP, Trace/breakpoint trap.
case is welcome, so far I tried many ways but I can no longer reproduce it.
Sure it was reproducible for me before.


Comment 6 RHEL Program Management 2007-05-03 14:49:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Holger Hopp 2007-06-12 16:08:58 UTC
Created attachment 156804 [details]
testcase

Comment 8 Holger Hopp 2007-06-12 16:09:56 UTC
Sorry for the long response delay. Now I have time to care a little
bit more about this issue.

I have attached my testcase that can reproduce the "Program received
signal SIGTRAP, Trace/breakpoint trap" issue, but it can not reproduce
the real issue that there is no stop at the watchpoint when the
corresponding value changes.

Maybe it is possible to create a reproducer for the second issue, but
before I start, I would like to confirm that this is really
necessary, because I'm sure that this is not so simple ...


I've tested both static and shared data locations, but there is no
fundamental difference regarding the watchpoints. To reproduce (the
first issue), set a watchpoint on any 'arr' value:

$ make testwpe
$ LD_LIBRARY_PATH=. gdb testwpe
(gdb) b main
(gdb) r
(gdb) p arr[8000]
(gdb) watch arr[8000]
(gdb) c
[...]
Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 184628648288 (LWP 23228)]
start (arg=0x7fbfffe00c) at testwp.c:31
(gdb) p arr[8000]


Comment 9 Russell Doty 2007-06-25 14:57:52 UTC
Not resolved in time for 5.1; changed proposed flag to 5.2.

Comment 10 Jan Kratochvil 2007-06-25 15:19:22 UTC
Created attachment 157764 [details]
More automatized testcase from the Comment 7

So far I was unable to reproduce it on RHEL-5 Server x86_64.
Could you please specify your exact kernel version used?
Attaching your testcase modified so that:
$ make
$ gdb &>1
runs forever for me producing correctly caught watchpoints:

[Switching to Thread 46914921773376 (LWP 27186)]
Hardware watchpoint 2: arr[8000]

Old value = 0
New value = 1
start (arg=0x7fffab6a738c) at testwp.c:31
31	      sleep (1);
...

With your original testcase I always just hit `continue' on each correctly
caught watchpoint (no general SIGTRAP seen).

Comment 11 Holger Hopp 2007-06-26 13:28:01 UTC
You're right.
I also can not reproduce the "Program received signal SIGTRAP,
Trace/breakpoint trap" on RHEL5 with the RHEL5 gdb (6.5-16.el5rh).

I've tested the GNU gdb-6.6 on RHEL5 and there I have detected this
SIGTRAP error, but with the standard RHEL5 gdb it works fine (i.e. gdb
stops correct with printing Old value and New value).
Sorry for the confusion. I've tested several other systems and
detected the SIGTRAP error also with the installed gdb, but not on
RHEL5.

But, as mentioned, this is not the main issue. More critical is that
gdb does not stop at the watchpoint if the watched value changes. And
for this case I still have no small reproducer.
Maybe it is possible to create a reproducer for this issue, but, as
mentioned, before I start, I would like to confirm that this is really
necessary, because I'm sure that this is not so simple ...

BTW, the kernel was
Linux 2.6.18-8.1.3.el5xen #1 SMP Mon Apr 16 16:19:37 EDT 2007 x86_64 GNU/Linux



Comment 12 Jan Kratochvil 2007-06-26 15:19:37 UTC
Created attachment 157908 [details]
Testcase which would be silent if there would be no bug.

The watchpoints in threads do not work in upstream GDB, please see the Comment
4.
Fortunately the SIGTRAPs are now explained.

There is some race, attaching testcase showing some of the variable changes are
lost.  Thanks, bugreport accepted, to be fixed.

Comment 14 Jan Kratochvil 2007-07-04 17:31:36 UTC
Comment on attachment 157908 [details]
Testcase which would be silent if there would be no bug.

There is a bug, the increment is not atomic, call like g_atomic_int_inc() is
needed; still this Bug is valid and such testcase fix does not change the
output much.

Comment 15 Jan Kratochvil 2007-07-20 16:23:57 UTC
There is a RHEL-5.0-derived test GDB version at:
http://www.jankratochvil.net/priv/bz237096/

There is a new command `wwatch' - write watchpoint.
Please use it instead of the original `watch' command.
It will catch any write - even one which does not modify anything (twice hit
for: `a=1; a=1;').

The testcase from Comments above should be recoded as:
  http://www.jankratochvil.net/priv/bz237096/bz237096-test0.tar.gz
(self-contained, no need to install the rpms to run it)

Unfortunately I have no other ideas of any missed watchpoints, debug registers
should be properly set for any newly spawned thread.  I have to recommend to
rather use RHEL-4 kernels or upstream (kernel.org) for this test (the kernel
debugging issues are fixed for the upcoming RHEL-5.1) due to some known
ptrace(2) issues there.  Feel free to `rpmbuild --rebuild gdb-*.src.rpm' on
RHEL-4 if needed.

Thanks for the bugreport, it already brought new fixes for GDB.


Comment 16 Holger Hopp 2007-07-30 10:01:21 UTC
Thank you for your help and for providing the test gdb.
Unfortunately, it does not really fix our issues.

Now we have the problem, that when gdb runs with the enabled 'watch' or
'wwatch' watchpoints, gdb uses about 95% and more CPU, and gdb
massively displays "warning: Could not remove hardware watchpoint 8.",
so that we are unable to do proper work with gdb to find out our bugs.

Do you know what could have happened?

Is it possible that the linux kernel is responsible for this issues?
Do we need the newer RHEL5.1 kernel? Or are older kernels RHEL4 and
SLES9 sufficient and the problem is only on gdb's side?

(Our tests are running under SLES9 SP3. We only used the recompiled
test gdb from you/RedHat. Since our complete environment is huge, we
have not yet transfered the complete issue to a RHEL system. My collegue
who originially detects this issue in his part of code is so busy with
other tasks, so that we still have no small reproducer and no RHEL
tests.)

As far as I have understood, the gdb fix runs also on RHEL4, i.e. also
without these kernel fixes. But if we need the kernel fixes, can you
give us a pointer to these fixes, so that we could rebuild a SLES
kernel with these patches included?

Comment 17 Jan Kratochvil 2007-08-18 12:04:13 UTC
The "warning: Could not remove hardware watchpoint 8." message occurs only with
the patched GDB from Comment 15?

Failed to find a way how to download the SLES9 SP3 kernel sources to be able to
check if some ptrace(2)/threading patches may not affect GDB.  I know nothing
about the SLES9 kernel/codebase.
All the RHEL kernels (or even upstream kernels) should be fine for this GDB
debugging except RHEL-5.0 to be completely on the safe side.

Running GDB testsuite on SLES9 may give some hint if GDB can be affected by the
SLES kernel there.  It is being run as a part of the GDB build retrievable from
the output which could be attached here:
rpmbuild --rebuild gdb-*.src.rpm 2>&1|tee build.log

While I appreciate your bugreport as there is neither a reproducer nor a remote
login I will have to make this Bug closed.


Comment 18 Helge Deller 2007-08-20 08:22:28 UTC
Thanks for your efforts Jan!
Holger is on vacation until September 3rd. He will answer your questions as 
soon as he is back in the office again.
Helge

Comment 19 Jan Kratochvil 2007-10-15 15:26:56 UTC
I guess I found the problem during an unrelated gdb-on-gdb debugging.

The problem is fork()ing of the debugged program.
fork() there disarms the hardware watchpoints (they still appear as active but
they get unset from the hardware registers).
It affects:
RHEL-4.6beta
RHEL-5.1beta
gdb-6.6 upstream
gdb-6.7 upstream
gdb cvs upstream

Fedora8/development binary + source rpm:
  http://koji.fedoraproject.org/koji/buildinfo?buildID=20928
.src.rpm there is easily `rpm --rebuild'-able at least on RHEL-5.
Patch posted upstream:
  http://sources.redhat.com/ml/gdb-patches/2007-10/msg00367.html

It should (no guarantees can be made) get fixed in RHEL-4.7 and RHEL-5.2.


Comment 20 Holger Hopp 2007-10-23 14:41:58 UTC
Sorry for my response delays. 
I think that you have found it. We have forks, and that was one reason that we
failed to create a proper reproducer quickly.
Thank you very much, Jan. I think you have done more than we can expect without
proper reproducer and only few feedback.
Fix in RHEL-4.7 and RHEL-5.2 is ok.


Comment 21 Jan Kratochvil 2007-10-23 20:21:37 UTC
Fixed in Rawhide (and F8):
* Fri Oct 19 2007 Jan Kratochvil <jan.kratochvil> - 6.6-37
- Fix hiding unexpected breakpoints on intentional step/next commands.


Comment 22 Andrew Cagney 2007-12-04 17:38:22 UTC
This statement in our documentation is wrong:

"Warning: In multi-thread programs, watchpoints have only limited
usefulness. With the current watchpoint implementation, GDB can only
watch the value of an expression in a single thread. If you are
confident that the expression can only change due to the current
thread's activity (and if you are also confident that no other thread
can become current), then you can use watchpoints as usual. However,
GDB may not notice when a non-current thread's activity changes the
expression."



Comment 23 Andrew Cagney 2007-12-04 17:42:15 UTC
dev-ack+

Partial fix available (missing fixes to local documentation).

Change driven by customer issue.

Comment 27 Jan Kratochvil 2008-01-12 16:21:22 UTC
Committed to Rawhide:
* Sat Jan 12 2008 Jan Kratochvil <jan.kratochvil> - 6.7.1-9
- Fix also threaded inferiors for hardware watchpoints after the fork call.


Comment 32 errata-xmlrpc 2008-05-21 16:55:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0332.html



Note You need to log in before you can comment on or make changes to this bug.