Description of problem: The gdb has problems setting watchpoints in multithreaded applications. This is a known problem. The gdb documentation says: "Warning: In multi-thread programs, watchpoints have only limited usefulness. With the current watchpoint implementation, GDB can only watch the value of an expression in a single thread. If you are confident that the expression can only change due to the current thread's activity (and if you are also confident that no other thread can become current), then you can use watchpoints as usual. However, GDB may not notice when a non-current thread's activity changes the expression." The question is here: Is there currently some effort to fix this misbehavior of gdb? The (hw) watchpoint feature is one key feature of a debugger and it is really bad if this does not work. And so we would like to have a fix here soon. The problem occurs with all gdb's (RHEL4, RHEL5, newest GNU release 6.6). The problem was detected on x86_64, but probably it occurs also on other platforms. Steps to Reproduce: I was not able to create a short and simple example, but since the problem is known, the gdb developers should have such an example. In my simple tries the watchpoint works (watchpoint is not fully catched by gdb, but at least the program execution stops with trace/breakpoint trap). The real application is too huge to attach here.
Created attachment 153204 [details] gdb.threads/watchthreads.exp testcase fix. There may be multiple reasons: (1) First please check that in `info break' you have listed all the watchpoints as `hw watchpoint': Num Type Disp Enb Address What 2 hw watchpoint keep y var1 Then also check there were no error messages from GDB like: Could not insert hardware watchpoint 6. or warning: Could not remove hardware watchpoint 6. with the result You may have requested too many hardware breakpoints/watchpoints. as hardware has limited number of slots for the memory watching. So far I do not deal with non-hardware watchpoints here, they should not be reliable for non-current threads. (2) While I was trying to reproduce the behavior I found out there is a bug in the ptrace(2) kernel communication (occurs on both Red Hat and upstream kernels, verified on 2.6.20-1.2944.fc6.x86_64 and 2.6.20.4.x86_64). The attached patch fixes the testcase now exhibiting the behavior described in the previous paragraph. This problem (2) will still stop, just with inappropriate reason: Program received signal SIGTRAP, Trace/breakpoint trap. instead of the expected Hardware watchpoint NUMBER: EXPRESSION Still this probably does not match your described problem in Comment 0. Please clarify if the (1) or (2) reasons are applicable to your problem. Also please formally resubmit this bugfix request referring this Bug 237096 using your RHEL subscription support contract.
Regarding (1): Yes, we are talking about hw watchpoints. A was watching the memory content directly: Num Type Disp Enb Address What 1 hw watchpoint keep y *12410208 And it's only one (hw) watchpoint. Regarding (2): In my small example, that I have created to break down the problem to a small and simple case, I have observed the same behavior, i.e. Program received signal SIGTRAP, Trace/breakpoint trap. instead of the expected Hardware watchpoint NUMBER: EXPRESSION But this is not the real problem, because the watchpoint "works", although the gdb message was insufficient. So in my (and your) simple example the watchpoint in multithreaded application works. In the real problem (huge application) I have set the watchpoint to *12410208 when *12410208 == 1. Then I continue wait and wait and wait (several minutes) and stop the program in the debugger with ctrl-c. Then I look into *12410208 and see its value is 2, not 1. But the watchpoint was never reached. The problem is: Who on earth sets this value to 2? That's what I want to find out with a debugger, but the gdb watchpoint does not work in this case. Then I saw this gdb documentation that shows that gdb has known problems with watchpoints in multithreaded applications. Sorry, that I have no small example up to now. Regarding support contract: SAP is technology parter of RedHat. We detect this misbehavior during our testing / developing / debugging.
Watchpoints for threads are being implemented by Jeff Johnston's patches gdb-6.3-threaded-watchpoints-20041213.patch gdb-6.3-threaded-watchpoints2-20050225.patch Unfortunately I cannot suggest a fix now, the patches should work. Please try to use the software watchpoints if the preformance would be feasible: set can-use-hw-watchpoints 0 Looking forward if you have a reproducibility testcase. While going to fix the kernel compatibility I do not believe it will catch more data changes - if you do not get the error messages, nothing should change. Sorry for the response delay.
Reproducibility case for the Program received signal SIGTRAP, Trace/breakpoint trap. case is welcome, so far I tried many ways but I can no longer reproduce it. Sure it was reproducible for me before.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 156804 [details] testcase
Sorry for the long response delay. Now I have time to care a little bit more about this issue. I have attached my testcase that can reproduce the "Program received signal SIGTRAP, Trace/breakpoint trap" issue, but it can not reproduce the real issue that there is no stop at the watchpoint when the corresponding value changes. Maybe it is possible to create a reproducer for the second issue, but before I start, I would like to confirm that this is really necessary, because I'm sure that this is not so simple ... I've tested both static and shared data locations, but there is no fundamental difference regarding the watchpoints. To reproduce (the first issue), set a watchpoint on any 'arr' value: $ make testwpe $ LD_LIBRARY_PATH=. gdb testwpe (gdb) b main (gdb) r (gdb) p arr[8000] (gdb) watch arr[8000] (gdb) c [...] Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 184628648288 (LWP 23228)] start (arg=0x7fbfffe00c) at testwp.c:31 (gdb) p arr[8000]
Not resolved in time for 5.1; changed proposed flag to 5.2.
Created attachment 157764 [details] More automatized testcase from the Comment 7 So far I was unable to reproduce it on RHEL-5 Server x86_64. Could you please specify your exact kernel version used? Attaching your testcase modified so that: $ make $ gdb &>1 runs forever for me producing correctly caught watchpoints: [Switching to Thread 46914921773376 (LWP 27186)] Hardware watchpoint 2: arr[8000] Old value = 0 New value = 1 start (arg=0x7fffab6a738c) at testwp.c:31 31 sleep (1); ... With your original testcase I always just hit `continue' on each correctly caught watchpoint (no general SIGTRAP seen).
You're right. I also can not reproduce the "Program received signal SIGTRAP, Trace/breakpoint trap" on RHEL5 with the RHEL5 gdb (6.5-16.el5rh). I've tested the GNU gdb-6.6 on RHEL5 and there I have detected this SIGTRAP error, but with the standard RHEL5 gdb it works fine (i.e. gdb stops correct with printing Old value and New value). Sorry for the confusion. I've tested several other systems and detected the SIGTRAP error also with the installed gdb, but not on RHEL5. But, as mentioned, this is not the main issue. More critical is that gdb does not stop at the watchpoint if the watched value changes. And for this case I still have no small reproducer. Maybe it is possible to create a reproducer for this issue, but, as mentioned, before I start, I would like to confirm that this is really necessary, because I'm sure that this is not so simple ... BTW, the kernel was Linux 2.6.18-8.1.3.el5xen #1 SMP Mon Apr 16 16:19:37 EDT 2007 x86_64 GNU/Linux
Created attachment 157908 [details] Testcase which would be silent if there would be no bug. The watchpoints in threads do not work in upstream GDB, please see the Comment 4. Fortunately the SIGTRAPs are now explained. There is some race, attaching testcase showing some of the variable changes are lost. Thanks, bugreport accepted, to be fixed.
Comment on attachment 157908 [details] Testcase which would be silent if there would be no bug. There is a bug, the increment is not atomic, call like g_atomic_int_inc() is needed; still this Bug is valid and such testcase fix does not change the output much.
There is a RHEL-5.0-derived test GDB version at: http://www.jankratochvil.net/priv/bz237096/ There is a new command `wwatch' - write watchpoint. Please use it instead of the original `watch' command. It will catch any write - even one which does not modify anything (twice hit for: `a=1; a=1;'). The testcase from Comments above should be recoded as: http://www.jankratochvil.net/priv/bz237096/bz237096-test0.tar.gz (self-contained, no need to install the rpms to run it) Unfortunately I have no other ideas of any missed watchpoints, debug registers should be properly set for any newly spawned thread. I have to recommend to rather use RHEL-4 kernels or upstream (kernel.org) for this test (the kernel debugging issues are fixed for the upcoming RHEL-5.1) due to some known ptrace(2) issues there. Feel free to `rpmbuild --rebuild gdb-*.src.rpm' on RHEL-4 if needed. Thanks for the bugreport, it already brought new fixes for GDB.
Thank you for your help and for providing the test gdb. Unfortunately, it does not really fix our issues. Now we have the problem, that when gdb runs with the enabled 'watch' or 'wwatch' watchpoints, gdb uses about 95% and more CPU, and gdb massively displays "warning: Could not remove hardware watchpoint 8.", so that we are unable to do proper work with gdb to find out our bugs. Do you know what could have happened? Is it possible that the linux kernel is responsible for this issues? Do we need the newer RHEL5.1 kernel? Or are older kernels RHEL4 and SLES9 sufficient and the problem is only on gdb's side? (Our tests are running under SLES9 SP3. We only used the recompiled test gdb from you/RedHat. Since our complete environment is huge, we have not yet transfered the complete issue to a RHEL system. My collegue who originially detects this issue in his part of code is so busy with other tasks, so that we still have no small reproducer and no RHEL tests.) As far as I have understood, the gdb fix runs also on RHEL4, i.e. also without these kernel fixes. But if we need the kernel fixes, can you give us a pointer to these fixes, so that we could rebuild a SLES kernel with these patches included?
The "warning: Could not remove hardware watchpoint 8." message occurs only with the patched GDB from Comment 15? Failed to find a way how to download the SLES9 SP3 kernel sources to be able to check if some ptrace(2)/threading patches may not affect GDB. I know nothing about the SLES9 kernel/codebase. All the RHEL kernels (or even upstream kernels) should be fine for this GDB debugging except RHEL-5.0 to be completely on the safe side. Running GDB testsuite on SLES9 may give some hint if GDB can be affected by the SLES kernel there. It is being run as a part of the GDB build retrievable from the output which could be attached here: rpmbuild --rebuild gdb-*.src.rpm 2>&1|tee build.log While I appreciate your bugreport as there is neither a reproducer nor a remote login I will have to make this Bug closed.
Thanks for your efforts Jan! Holger is on vacation until September 3rd. He will answer your questions as soon as he is back in the office again. Helge
I guess I found the problem during an unrelated gdb-on-gdb debugging. The problem is fork()ing of the debugged program. fork() there disarms the hardware watchpoints (they still appear as active but they get unset from the hardware registers). It affects: RHEL-4.6beta RHEL-5.1beta gdb-6.6 upstream gdb-6.7 upstream gdb cvs upstream Fedora8/development binary + source rpm: http://koji.fedoraproject.org/koji/buildinfo?buildID=20928 .src.rpm there is easily `rpm --rebuild'-able at least on RHEL-5. Patch posted upstream: http://sources.redhat.com/ml/gdb-patches/2007-10/msg00367.html It should (no guarantees can be made) get fixed in RHEL-4.7 and RHEL-5.2.
Sorry for my response delays. I think that you have found it. We have forks, and that was one reason that we failed to create a proper reproducer quickly. Thank you very much, Jan. I think you have done more than we can expect without proper reproducer and only few feedback. Fix in RHEL-4.7 and RHEL-5.2 is ok.
Fixed in Rawhide (and F8): * Fri Oct 19 2007 Jan Kratochvil <jan.kratochvil> - 6.6-37 - Fix hiding unexpected breakpoints on intentional step/next commands.
This statement in our documentation is wrong: "Warning: In multi-thread programs, watchpoints have only limited usefulness. With the current watchpoint implementation, GDB can only watch the value of an expression in a single thread. If you are confident that the expression can only change due to the current thread's activity (and if you are also confident that no other thread can become current), then you can use watchpoints as usual. However, GDB may not notice when a non-current thread's activity changes the expression."
dev-ack+ Partial fix available (missing fixes to local documentation). Change driven by customer issue.
Committed to Rawhide: * Sat Jan 12 2008 Jan Kratochvil <jan.kratochvil> - 6.7.1-9 - Fix also threaded inferiors for hardware watchpoints after the fork call.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0332.html