Bug 1184724
| Summary: | gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Luyao Huang <lhuang> | ||||||
| Component: | gdb | Assignee: | Sergio Durigan Junior <sergiodj> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Miroslav Franc <mfranc> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 7.1 | CC: | dyuan, gdb-bugs, jan.kratochvil, lhuang, mcermak, mfranc, mzhan, ohudlick, palves | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | gdb-7.6.1-72.el7 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
The ptrace system call requires resuming threads individually. While resuming threads of a process, for example, with the "continue" command, if an already-resumed thread N causes a process exit, for example, by calling the _exit() function, the resumption of the remaining threads fails with the ESRCH (No such process) error because the process no longer exists. Previously, GDB did not expect this scenario and aborted the resumption command with a "ptrace: No such process." error message. Consequently, a subsequent "detach" command then failed an internal assertion due to inconsistent internal state. With this update, if a thread disappears while being resumed, GDB proceeds to collect the thread's exit status. As a result, the resumption command ends with the expected "Inferior exited normally" message and the subsequent "detach" command no longer crashes in the described scenario.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-11-19 13:02:50 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1493675 | ||||||||
| Attachments: |
|
||||||||
Hi Luyao, I tried to reproduce this bug but I failed. I followed your instructions, but when I do a "systemctl restart libvirtd" the command does not return, so I cannot really restart libvirtd when it is being debugged by GDB. I also tried many variations of your instructions, but nothing "worked" (in the sense that I could not reproduce the bug). Can you please review the instructions and make sure that they are really able to reproduce the issue? Hi Sergio Durigan Junior,
I cannot reproduce this issue every time. Fortunately, I still can reproduce this issue every easy in my machine, and i found some key points to reproduce this issue:
1.
# gdb libvirtd `pidof libvirtd`
(gdb) br netcfStateCleanup
Breakpoint 1 at 0x7f1ea46cf8f0: file interface/interface_backend_netcf.c, line 102.
(gdb) c
Continuing.
2.open another terminal to restart libvirtd, this step won't return, just do next step:
# service libvirtd restart
Redirecting to /bin/systemctl restart libvirtd.service
<-----command block
3.
in first terminal will get:
Program received signal SIGTERM, Terminated.
4. then continue
Program received signal SIGTERM, Terminated.
0x00007fefdb9a3b7d in poll () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) c
Continuing.
[Thread 0x7fefcc470700 (LWP 11213) exited]
[Thread 0x7fefcfc77700 (LWP 11206) exited]
[Thread 0x7fefcb46e700 (LWP 11215) exited]
[Thread 0x7fefcf476700 (LWP 11207) exited]
[Thread 0x7fefce474700 (LWP 11209) exited]
[Thread 0x7fefccc71700 (LWP 11212) exited]
[Thread 0x7fefcdc73700 (LWP 11210) exited]
[Thread 0x7fefcec75700 (LWP 11208) exited]
[Thread 0x7fefcd472700 (LWP 11211) exited]
[Thread 0x7fefcbc6f700 (LWP 11214) exited]
5. waste sometime to meet "ptrace: No such process." this error.
Breakpoint 1, netcfStateCleanup () at interface/interface_backend_netcf.c:105
105 if (!driver)
(gdb) n
...
264 if (klass->dispose)
(gdb) n
265 klass->dispose(obj);
(gdb)
266 klass = klass->parent;
(gdb)
263 while (klass) {
(gdb)
264 if (klass->dispose)
(gdb)
266 klass = klass->parent;
(gdb)
263 while (klass) {
(gdb)
270 memset(obj, 0, obj->klass->objectSize);
(gdb)
271 obj->u.s.magic = 0xDEADBEEF;
(gdb) n
272 obj->klass = (void*)0xDEADBEEF;
(gdb)
273 VIR_FREE(obj);
(gdb) n
271 obj->u.s.magic = 0xDEADBEEF;
(gdb)
272 obj->klass = (void*)0xDEADBEEF;
(gdb)
273 VIR_FREE(obj);
(gdb)
276 return !lastRef;
(gdb) n
277 }
(gdb)
netcfStateCleanup () at interface/interface_backend_netcf.c:114
114 driver = NULL;
(gdb)
115 return 0;
(gdb)
116 }
(gdb) c
Continuing.
netcfStateCleanup () at interface/interface_backend_netcf.c:116
116 }
ptrace: No such process.
(gdb) ^CQuit
(gdb) quit
A debugging session is active.
Inferior 1 [process 11205] will be detached.
Quit anyway? (y or n) y
[Thread 0x7fefdf322880 (LWP 11205) exited]
../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y
../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) y
Aborted (core dumped)
And i will attach the full steps.
Created attachment 1013784 [details]
full steps of gdb
This patch probably fixes it: https://sourceware.org/ml/gdb-patches/2015-03/msg00597.html From: ~~~ (gdb) c Continuing. [Thread 0x7fefcc470700 (LWP 11213) exited] [Thread 0x7fefcfc77700 (LWP 11206) exited] [Thread 0x7fefcb46e700 (LWP 11215) exited] [Thread 0x7fefcf476700 (LWP 11207) exited] [Thread 0x7fefce474700 (LWP 11209) exited] [Thread 0x7fefccc71700 (LWP 11212) exited] [Thread 0x7fefcdc73700 (LWP 11210) exited] [Thread 0x7fefcec75700 (LWP 11208) exited] [Thread 0x7fefcd472700 (LWP 11211) exited] [Thread 0x7fefcbc6f700 (LWP 11214) exited] 5. waste sometime to meet "ptrace: No such process." this error. ~~~ It all indicates that this is the bug fixed by the patch in the url pasted above. That is, just while GDB is resuming all threads for your "continue", some thread exits the whole process: ~~~ (gdb) c Continuing. netcfStateCleanup () at interface/interface_backend_netcf.c:116 116 } ptrace: No such process. ~~~ That is, GDB resumes thread 1, 2, 3, 4, 5, and while resuming thread 5, thread 1 exits the process (exit/_exit, etc). The internal error is more a consequence of the "No such process" mishandling, than the real bug. Any change you could try current upstream mainline? Also a gdb log with "set debug infrun 1 + set debug lin-lwp 1" would probably help. (In reply to Pedro Alves from comment #5) > > The internal error is more a consequence of the "No such process" > mishandling, than the real bug. > > Any change you could try current upstream mainline? Also a gdb log with > "set debug infrun 1 + set debug lin-lwp 1" would probably help. I will attach the log after set debug infrun 1 + set debug lin-lwp 1. And test with upstream gdb and didn't meet the gdb crash: 0x00007f4092e23b7d in poll () from /lib64/libc.so.6 (gdb) br netcfStateCleanup Breakpoint 1 at 0x7f4081edfb30 (gdb) c Continuing. Program received signal SIGTERM, Terminated. [Switching to Thread 0x7f4087327700 (LWP 12125)] 0x00007f409310b705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) c Continuing. Program received signal SIGCONT, Continued. [Switching to Thread 0x7f40969e8880 (LWP 12124)] 0x00007f4092e23b7d in poll () from /lib64/libc.so.6 (gdb) c Continuing. [Thread 0x7f408331f700 (LWP 12133) exited] [Thread 0x7f4083b20700 (LWP 12132) exited] [Thread 0x7f4084b22700 (LWP 12130) exited] [Thread 0x7f4085323700 (LWP 12129) exited] [Thread 0x7f4085b24700 (LWP 12128) exited] [Thread 0x7f4086325700 (LWP 12127) exited] [Thread 0x7f4086b26700 (LWP 12126) exited] [Thread 0x7f4087327700 (LWP 12125) exited] [Thread 0x7f4082b1e700 (LWP 12134) exited] [Thread 0x7f4084321700 (LWP 12131) exited] Breakpoint 1, 0x00007f4081edfb30 in netcfStateCleanup () from /usr/lib64/libvirt/connection-driver/libvirt_driver_interface.so (gdb) l No symbol table is loaded. Use the "file" command. (gdb) n Single stepping until exit from function netcfStateCleanup, which has no line number information. 0x00007f4095f8c2a8 in virStateCleanup () from /lib64/libvirt.so.0 (gdb) n Single stepping until exit from function virStateCleanup, which has no line number information. Program terminated with signal SIGKILL, Killed. The program no longer exists. (gdb) n The program is not being run. (gdb) c The program is not being run. (gdb) c The program is not being run. (gdb) ^CQuit (gdb) quit Created attachment 1028521 [details]
gdb log with debug 1
On 05/22/2015 04:43 AM, bugzilla wrote: > > And test with upstream gdb and didn't meet the gdb crash: Thanks for testing that! > Program received signal SIGTERM, Terminated. > [Switching to Thread 0x7f4087327700 (LWP 12125)] > 0x00007f409310b705 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > (gdb) c > Continuing. > > Program received signal SIGCONT, Continued. > [Switching to Thread 0x7f40969e8880 (LWP 12124)] > 0x00007f4092e23b7d in poll () from /lib64/libc.so.6 ... > (gdb) n > Single stepping until exit from function netcfStateCleanup, > which has no line number information. > 0x00007f4095f8c2a8 in virStateCleanup () from /lib64/libvirt.so.0 > (gdb) n > Single stepping until exit from function virStateCleanup, > which has no line number information. > > Program terminated with signal SIGKILL, Killed. > The program no longer exists. > (gdb) n So this shows that something outside GDB is killing the process with SIGKILL. On Linux, a ptracer cannot intercept that signal before it kills the process. We also see that something is sending SIGTERM and SIGCONT to the process, but those gdb can intercept. We don't see the SIGKILL in the debug logs with the crashy gdb, but I think that it's the same. We see: (gdb) sigchld ^^^^^^^ n infrun: clear_proceed_status_thread (Thread 0x7fa2889eb880 (LWP 18511)) infrun: proceed (addr=0xffffffffffffffff, signal=144, step=1) infrun: resume (step=RESUME_STEP_USER, signal=0), trap_expected=0, current thread [Thread 0x7fa2889eb880 (LWP 18511)] at 0x7fa287ef20ac LLR: Preparing to step Thread 0x7fa2889eb880 (LWP 18511), 0, inferior_ptid Thread 0x7fa2889eb880 (LWP 18511) virObjectUnref (anyobj=<error reading variable: Cannot access memory at address 0x7fffacf83308>) at util/virobject.c:270 270 memset(obj, 0, obj->klass->objectSize); ptrace: No such process. (gdb) infrun that "sigchld" indicates a ptracee changed state (and so gdb's SIGCHLD handler is called). That must have been the SIGKILL, and then when gdb tries to step LWP 18511, that fails with ESRCH, because the whole process is gone now (killed by SIGKILL), hence the "No such process." That error/exception makes GDB to not ever reach the waitpid call again, so the logs don't get to show SIGKILL's wait status. This is exactly the sort of scenario that that patch upstream addresses (though more could be done). Thanks, Pedro Alves Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2089.html |
Description of problem: gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed Version-Release number of selected component (if applicable): gdb-7.6.1-64.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. # gdb libvirtd `pidof libvirtd` (gdb) br netcfStateCleanup Breakpoint 1 at 0x7f1ea46cf8f0: file interface/interface_backend_netcf.c, line 102. (gdb) c Continuing. 2.use another terminal to restart libvirtd # service libvirtd restart Redirecting to /bin/systemctl restart libvirtd.service 3.in the first terminal, use some time debug and will meet a error:ptrace: No such process. then exit gdb use ctrl+D : Breakpoint 1, netcfStateCleanup () at interface/interface_backend_netcf.c:102 102 { (gdb) n 103 if (!driverState) { (gdb) 109 if (virObjectUnref(driverState)) { (gdb) s virObjectUnref (anyobj=0x7f1e9c0942c0) at util/virobject.c:249 .... (debug the code) .... (gdb) netcfStateCleanup () at interface/interface_backend_netcf.c:115 115 driverState = NULL; (gdb) 116 return 0; (gdb) 117 } (gdb) c Continuing. netcfStateCleanup () at interface/interface_backend_netcf.c:117 117 } ptrace: No such process. (gdb) ^CQuit (gdb) quit A debugging session is active. Inferior 1 [process 25484] will be detached. Quit anyway? (y or n) y [Thread 0x7f1ebab4c880 (LWP 25484) exited] ../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n ../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Create a core file of GDB? (y or n) y Actual results: seems gdb crashed when exit debug a process Expected results: can exit success Additional info: backtrace: (gdb) bt #0 0x00007f91332ed5d7 in raise () from /lib64/libc.so.6 #1 0x00007f91332eecc8 in abort () from /lib64/libc.so.6 #2 0x0000000000691a46 in dump_core () at ../../gdb/utils.c:761 #3 0x0000000000694255 in internal_vproblem (problem=0xc12330 <internal_error_problem>, file=<optimized out>, line=1869, fmt=0x7aa2b0 "%s: Assertion `%s' failed.", ap=0x7fffb33fa990) at ../../gdb/utils.c:919 #4 0x00000000006942c9 in internal_verror (file=<optimized out>, line=<optimized out>, fmt=<optimized out>, ap=ap@entry=0x7fffb33fa990) at ../../gdb/utils.c:944 #5 0x000000000069436f in internal_error (file=file@entry=0x7b4595 "../../gdb/linux-nat.c", line=line@entry=1869, string=<optimized out>) at ../../gdb/utils.c:954 #6 0x00000000004cb89d in linux_nat_detach (ops=0x14636a0, args=0x0, from_tty=1) at ../../gdb/linux-nat.c:1869 #7 0x00000000004d3385 in thread_db_detach (ops=<optimized out>, args=0x0, from_tty=1) at ../../gdb/linux-thread-db.c:1367 #8 0x00000000005f7ad1 in target_detach (args=0x0, from_tty=1) at ../../gdb/target.c:2605 #9 0x000000000068fe1c in kill_or_detach (inf=0x14e1590, args=0x7fffb33face0) at ../../gdb/top.c:1217 #10 0x00000000006b0f54 in iterate_over_inferiors (callback=callback@entry=0x68fda0 <kill_or_detach>, data=data@entry=0x7fffb33face0) at ../../gdb/inferior.c:395 #11 0x000000000068fb21 in quit_target (arg=arg@entry=0x7fffb33face0) at ../../gdb/top.c:1298 #12 0x00000000005cee0a in catch_errors (func=func@entry=0x68fb10 <quit_target>, func_args=func_args@entry=0x7fffb33face0, errstring=errstring@entry=0x84979a "Quitting: ", mask=mask@entry=6) at ../../gdb/exceptions.c:546 #13 0x0000000000690732 in quit_force (args=0x0, from_tty=1) at ../../gdb/top.c:1336 #14 0x00000000006901ba in execute_command (p=0x7c52ba "", p@entry=0x7c52b6 "quit", from_tty=1) at ../../gdb/top.c:487 #15 0x00000000005d8622 in command_handler (command=command@entry=0x0) at ../../gdb/event-top.c:431 #16 0x00000000005d8bef in command_line_handler (rl=0x0) at ../../gdb/event-top.c:505 #17 0x00007f91352b2c6e in rl_callback_read_char () at ../callback.c:220 #18 0x00000000005d8639 in rl_callback_read_char_wrapper (client_data=<optimized out>) at ../../gdb/event-top.c:164 #19 0x00000000005d71f4 in process_event () at ../../gdb/event-loop.c:342 #20 0x00000000005d7587 in gdb_do_one_event () at ../../gdb/event-loop.c:406 #21 0x00000000005d77b7 in start_event_loop () at ../../gdb/event-loop.c:431 #22 0x00000000005d0623 in captured_command_loop (data=data@entry=0x0) at ../../gdb/main.c:259 #23 0x00000000005cee0a in catch_errors (func=func@entry=0x5d0610 <captured_command_loop>, func_args=func_args@entry=0x0, errstring=errstring@entry=0x7b91db "", mask=mask@entry=6) at ../../gdb/exceptions.c:546 #24 0x00000000005d12d6 in captured_main (data=data@entry=0x7fffb33fb070) at ../../gdb/main.c:1134 #25 0x00000000005cee0a in catch_errors (func=func@entry=0x5d0a40 <captured_main>, func_args=func_args@entry=0x7fffb33fb070, errstring=errstring@entry=0x7b91db "", mask=mask@entry=6) at ../../gdb/exceptions.c:546 #26 0x00000000005d1f04 in gdb_main (args=args@entry=0x7fffb33fb070) at ../../gdb/main.c:1144 #27 0x00000000004572ee in main (argc=<optimized out>, argv=<optimized out>) at ../../gdb/gdb.c:34