1184724 – gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1184724 - gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed

Summary: gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) ==...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	gdb
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Sergio Durigan Junior
QA Contact:	Miroslav Franc
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1493675
TreeView+	depends on / blocked

Reported:	2015-01-22 06:51 UTC by Luyao Huang
Modified:	2017-09-20 19:23 UTC (History)
CC List:	9 users (show)
Fixed In Version:	gdb-7.6.1-72.el7
Doc Type:	Bug Fix
Doc Text:	The ptrace system call requires resuming threads individually. While resuming threads of a process, for example, with the "continue" command, if an already-resumed thread N causes a process exit, for example, by calling the _exit() function, the resumption of the remaining threads fails with the ESRCH (No such process) error because the process no longer exists. Previously, GDB did not expect this scenario and aborted the resumption command with a "ptrace: No such process." error message. Consequently, a subsequent "detach" command then failed an internal assertion due to inconsistent internal state. With this update, if a thread disappears while being resumed, GDB proceeds to collect the thread's exit status. As a result, the resumption command ends with the expected "Inferior exited normally" message and the subsequent "detach" command no longer crashes in the described scenario.
Clone Of:
Environment:
Last Closed:	2015-11-19 13:02:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
full steps of gdb (5.35 KB, text/plain) 2015-04-13 01:46 UTC, Luyao Huang	no flags	Details
gdb log with debug 1 (138.15 KB, text/plain) 2015-05-22 03:44 UTC, Luyao Huang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2089	0	normal	SHIPPED_LIVE	gdb bug fix and enhancement update	2015-11-19 11:24:00 UTC

Description Luyao Huang 2015-01-22 06:51:45 UTC

Description of problem:
gdb output a internal-error: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed

Version-Release number of selected component (if applicable):
gdb-7.6.1-64.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.
# gdb libvirtd `pidof libvirtd`
(gdb) br netcfStateCleanup
Breakpoint 1 at 0x7f1ea46cf8f0: file interface/interface_backend_netcf.c, line 102.
(gdb) c
Continuing.

2.use another terminal to restart libvirtd
# service libvirtd restart
Redirecting to /bin/systemctl restart  libvirtd.service

3.in the first terminal, use some time debug and will meet a
error:ptrace: No such process. then exit gdb use ctrl+D :

Breakpoint 1, netcfStateCleanup () at interface/interface_backend_netcf.c:102
102	{
(gdb) n
103	    if (!driverState) {
(gdb) 
109	    if (virObjectUnref(driverState)) {
(gdb) s
virObjectUnref (anyobj=0x7f1e9c0942c0) at util/virobject.c:249
....
(debug the code)
....
(gdb) 
netcfStateCleanup () at interface/interface_backend_netcf.c:115
115	    driverState = NULL;
(gdb) 
116	    return 0;
(gdb) 
117	}
(gdb) c
Continuing.
netcfStateCleanup () at interface/interface_backend_netcf.c:117
117	}
ptrace: No such process.
(gdb) ^CQuit
(gdb) quit
A debugging session is active.

	Inferior 1 [process 25484] will be detached.

Quit anyway? (y or n) y
[Thread 0x7f1ebab4c880 (LWP 25484) exited]

../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) y


Actual results:
seems gdb crashed when exit debug a process

Expected results:
can exit success

Additional info:

backtrace:

(gdb) bt
#0  0x00007f91332ed5d7 in raise () from /lib64/libc.so.6
#1  0x00007f91332eecc8 in abort () from /lib64/libc.so.6
#2  0x0000000000691a46 in dump_core () at ../../gdb/utils.c:761
#3  0x0000000000694255 in internal_vproblem (problem=0xc12330 <internal_error_problem>, file=<optimized out>, line=1869, fmt=0x7aa2b0 "%s: Assertion `%s' failed.", ap=0x7fffb33fa990) at ../../gdb/utils.c:919
#4  0x00000000006942c9 in internal_verror (file=<optimized out>, line=<optimized out>, fmt=<optimized out>, ap=ap@entry=0x7fffb33fa990) at ../../gdb/utils.c:944
#5  0x000000000069436f in internal_error (file=file@entry=0x7b4595 "../../gdb/linux-nat.c", line=line@entry=1869, string=<optimized out>) at ../../gdb/utils.c:954
#6  0x00000000004cb89d in linux_nat_detach (ops=0x14636a0, args=0x0, from_tty=1) at ../../gdb/linux-nat.c:1869
#7  0x00000000004d3385 in thread_db_detach (ops=<optimized out>, args=0x0, from_tty=1) at ../../gdb/linux-thread-db.c:1367
#8  0x00000000005f7ad1 in target_detach (args=0x0, from_tty=1) at ../../gdb/target.c:2605
#9  0x000000000068fe1c in kill_or_detach (inf=0x14e1590, args=0x7fffb33face0) at ../../gdb/top.c:1217
#10 0x00000000006b0f54 in iterate_over_inferiors (callback=callback@entry=0x68fda0 <kill_or_detach>, data=data@entry=0x7fffb33face0) at ../../gdb/inferior.c:395
#11 0x000000000068fb21 in quit_target (arg=arg@entry=0x7fffb33face0) at ../../gdb/top.c:1298
#12 0x00000000005cee0a in catch_errors (func=func@entry=0x68fb10 <quit_target>, func_args=func_args@entry=0x7fffb33face0, errstring=errstring@entry=0x84979a "Quitting: ", mask=mask@entry=6)
    at ../../gdb/exceptions.c:546
#13 0x0000000000690732 in quit_force (args=0x0, from_tty=1) at ../../gdb/top.c:1336
#14 0x00000000006901ba in execute_command (p=0x7c52ba "", p@entry=0x7c52b6 "quit", from_tty=1) at ../../gdb/top.c:487
#15 0x00000000005d8622 in command_handler (command=command@entry=0x0) at ../../gdb/event-top.c:431
#16 0x00000000005d8bef in command_line_handler (rl=0x0) at ../../gdb/event-top.c:505
#17 0x00007f91352b2c6e in rl_callback_read_char () at ../callback.c:220
#18 0x00000000005d8639 in rl_callback_read_char_wrapper (client_data=<optimized out>) at ../../gdb/event-top.c:164
#19 0x00000000005d71f4 in process_event () at ../../gdb/event-loop.c:342
#20 0x00000000005d7587 in gdb_do_one_event () at ../../gdb/event-loop.c:406
#21 0x00000000005d77b7 in start_event_loop () at ../../gdb/event-loop.c:431
#22 0x00000000005d0623 in captured_command_loop (data=data@entry=0x0) at ../../gdb/main.c:259
#23 0x00000000005cee0a in catch_errors (func=func@entry=0x5d0610 <captured_command_loop>, func_args=func_args@entry=0x0, errstring=errstring@entry=0x7b91db "", mask=mask@entry=6) at ../../gdb/exceptions.c:546
#24 0x00000000005d12d6 in captured_main (data=data@entry=0x7fffb33fb070) at ../../gdb/main.c:1134
#25 0x00000000005cee0a in catch_errors (func=func@entry=0x5d0a40 <captured_main>, func_args=func_args@entry=0x7fffb33fb070, errstring=errstring@entry=0x7b91db "", mask=mask@entry=6)
    at ../../gdb/exceptions.c:546
#26 0x00000000005d1f04 in gdb_main (args=args@entry=0x7fffb33fb070) at ../../gdb/main.c:1144
#27 0x00000000004572ee in main (argc=<optimized out>, argv=<optimized out>) at ../../gdb/gdb.c:34

Comment 1 Sergio Durigan Junior 2015-04-10 21:22:46 UTC

Hi Luyao,

I tried to reproduce this bug but I failed.  I followed your instructions, but when I do a "systemctl restart libvirtd" the command does not return, so I cannot really restart libvirtd when it is being debugged by GDB.  I also tried many variations of your instructions, but nothing "worked" (in the sense that I could not reproduce the bug).

Can you please review the instructions and make sure that they are really able to reproduce the issue?

Comment 2 Luyao Huang 2015-04-13 01:45:58 UTC

Hi Sergio Durigan Junior,

I cannot reproduce this issue every time. Fortunately, I still can reproduce this issue every easy in my machine, and i found some key points to reproduce this issue:

1.
# gdb libvirtd `pidof libvirtd`
(gdb) br netcfStateCleanup
Breakpoint 1 at 0x7f1ea46cf8f0: file interface/interface_backend_netcf.c, line 102.
(gdb) c
Continuing.

2.open another terminal to restart libvirtd, this step won't return, just do next step:

# service libvirtd restart
Redirecting to /bin/systemctl restart  libvirtd.service
                                                         <-----command block 
3.
in first terminal will get:

Program received signal SIGTERM, Terminated.

4. then continue

Program received signal SIGTERM, Terminated.
0x00007fefdb9a3b7d in poll () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) c
Continuing.
[Thread 0x7fefcc470700 (LWP 11213) exited]
[Thread 0x7fefcfc77700 (LWP 11206) exited]
[Thread 0x7fefcb46e700 (LWP 11215) exited]
[Thread 0x7fefcf476700 (LWP 11207) exited]
[Thread 0x7fefce474700 (LWP 11209) exited]
[Thread 0x7fefccc71700 (LWP 11212) exited]
[Thread 0x7fefcdc73700 (LWP 11210) exited]
[Thread 0x7fefcec75700 (LWP 11208) exited]
[Thread 0x7fefcd472700 (LWP 11211) exited]
[Thread 0x7fefcbc6f700 (LWP 11214) exited]


5. waste sometime to meet "ptrace: No such process." this error.

Breakpoint 1, netcfStateCleanup () at interface/interface_backend_netcf.c:105

105	    if (!driver)
(gdb) n

...

264	            if (klass->dispose)
(gdb) n
265	                klass->dispose(obj);
(gdb) 
266	            klass = klass->parent;
(gdb) 
263	        while (klass) {
(gdb) 
264	            if (klass->dispose)
(gdb) 
266	            klass = klass->parent;
(gdb) 
263	        while (klass) {
(gdb) 
270	        memset(obj, 0, obj->klass->objectSize);
(gdb) 
271	        obj->u.s.magic = 0xDEADBEEF;
(gdb) n
272	        obj->klass = (void*)0xDEADBEEF;
(gdb) 
273	        VIR_FREE(obj);
(gdb) n
271	        obj->u.s.magic = 0xDEADBEEF;
(gdb) 
272	        obj->klass = (void*)0xDEADBEEF;
(gdb) 
273	        VIR_FREE(obj);
(gdb) 
276	    return !lastRef;
(gdb) n
277	}
(gdb) 
netcfStateCleanup () at interface/interface_backend_netcf.c:114
114	    driver = NULL;
(gdb) 
115	    return 0;
(gdb) 
116	}
(gdb) c
Continuing.
netcfStateCleanup () at interface/interface_backend_netcf.c:116
116	}
ptrace: No such process.
(gdb) ^CQuit
(gdb) quit
A debugging session is active.

	Inferior 1 [process 11205] will be detached.

Quit anyway? (y or n) y
[Thread 0x7fefdf322880 (LWP 11205) exited]

../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y
../../gdb/linux-nat.c:1869: internal-error: linux_nat_detach: Assertion `num_lwps (GET_PID (inferior_ptid)) == 1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) y
Aborted (core dumped)

And i will attach the full steps.

Comment 3 Luyao Huang 2015-04-13 01:46:40 UTC

Created attachment 1013784 [details]
full steps of gdb

Comment 4 Pedro Alves 2015-04-13 09:28:22 UTC

This patch probably fixes it:

  https://sourceware.org/ml/gdb-patches/2015-03/msg00597.html

Comment 5 Pedro Alves 2015-05-21 18:56:17 UTC

From:

~~~

(gdb) c
Continuing.
[Thread 0x7fefcc470700 (LWP 11213) exited]
[Thread 0x7fefcfc77700 (LWP 11206) exited]
[Thread 0x7fefcb46e700 (LWP 11215) exited]
[Thread 0x7fefcf476700 (LWP 11207) exited]
[Thread 0x7fefce474700 (LWP 11209) exited]
[Thread 0x7fefccc71700 (LWP 11212) exited]
[Thread 0x7fefcdc73700 (LWP 11210) exited]
[Thread 0x7fefcec75700 (LWP 11208) exited]
[Thread 0x7fefcd472700 (LWP 11211) exited]
[Thread 0x7fefcbc6f700 (LWP 11214) exited]


5. waste sometime to meet "ptrace: No such process." this error.
~~~

It all indicates that this is the bug fixed by the patch in the url pasted above.  That is, just while GDB is resuming all threads for your "continue", some thread exits the whole process:

~~~
(gdb) c
Continuing.
netcfStateCleanup () at interface/interface_backend_netcf.c:116
116	}
ptrace: No such process.
~~~

That is, GDB resumes thread 1, 2, 3, 4, 5, and while resuming thread 5, thread 1 exits the process (exit/_exit, etc).

The internal error is more a consequence of the "No such process" mishandling, than the real bug.

Any change you could try current upstream mainline?  Also a gdb log with "set debug infrun 1 + set debug lin-lwp 1" would probably help.

Comment 6 Luyao Huang 2015-05-22 03:43:49 UTC

(In reply to Pedro Alves from comment #5)
> 
> The internal error is more a consequence of the "No such process"
> mishandling, than the real bug.
> 
> Any change you could try current upstream mainline?  Also a gdb log with
> "set debug infrun 1 + set debug lin-lwp 1" would probably help.

I will attach the log after set debug infrun 1 + set debug lin-lwp 1.

And test with upstream gdb and didn't meet the gdb crash:

0x00007f4092e23b7d in poll () from /lib64/libc.so.6
(gdb) br netcfStateCleanup
Breakpoint 1 at 0x7f4081edfb30
(gdb) c
Continuing.

Program received signal SIGTERM, Terminated.
[Switching to Thread 0x7f4087327700 (LWP 12125)]
0x00007f409310b705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) c
Continuing.

Program received signal SIGCONT, Continued.
[Switching to Thread 0x7f40969e8880 (LWP 12124)]
0x00007f4092e23b7d in poll () from /lib64/libc.so.6
(gdb) c
Continuing.
[Thread 0x7f408331f700 (LWP 12133) exited]
[Thread 0x7f4083b20700 (LWP 12132) exited]
[Thread 0x7f4084b22700 (LWP 12130) exited]
[Thread 0x7f4085323700 (LWP 12129) exited]
[Thread 0x7f4085b24700 (LWP 12128) exited]
[Thread 0x7f4086325700 (LWP 12127) exited]
[Thread 0x7f4086b26700 (LWP 12126) exited]
[Thread 0x7f4087327700 (LWP 12125) exited]
[Thread 0x7f4082b1e700 (LWP 12134) exited]
[Thread 0x7f4084321700 (LWP 12131) exited]

Breakpoint 1, 0x00007f4081edfb30 in netcfStateCleanup () from /usr/lib64/libvirt/connection-driver/libvirt_driver_interface.so
(gdb) l
No symbol table is loaded.  Use the "file" command.
(gdb) n
Single stepping until exit from function netcfStateCleanup,
which has no line number information.
0x00007f4095f8c2a8 in virStateCleanup () from /lib64/libvirt.so.0
(gdb) n
Single stepping until exit from function virStateCleanup,
which has no line number information.

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) n
The program is not being run.
(gdb) c
The program is not being run.
(gdb) c
The program is not being run.
(gdb) ^CQuit
(gdb) quit

Comment 7 Luyao Huang 2015-05-22 03:44:33 UTC

Created attachment 1028521 [details]
gdb log with debug 1

Comment 8 Pedro Alves 2015-05-22 15:39:43 UTC

On 05/22/2015 04:43 AM, bugzilla wrote:
>
> And test with upstream gdb and didn't meet the gdb crash:

Thanks for testing that!

> Program received signal SIGTERM, Terminated.
> [Switching to Thread 0x7f4087327700 (LWP 12125)]
> 0x00007f409310b705 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> (gdb) c
> Continuing.
>
> Program received signal SIGCONT, Continued.
> [Switching to Thread 0x7f40969e8880 (LWP 12124)]
> 0x00007f4092e23b7d in poll () from /lib64/libc.so.6

...

> (gdb) n
> Single stepping until exit from function netcfStateCleanup,
> which has no line number information.
> 0x00007f4095f8c2a8 in virStateCleanup () from /lib64/libvirt.so.0
> (gdb) n
> Single stepping until exit from function virStateCleanup,
> which has no line number information.
>
> Program terminated with signal SIGKILL, Killed.
> The program no longer exists.
> (gdb) n

So this shows that something outside GDB is killing the process
with SIGKILL.  On Linux, a ptracer cannot intercept that signal
before it kills the process.  We also see that something is sending
SIGTERM and SIGCONT to the process, but those gdb can intercept.

We don't see the SIGKILL in the debug logs with the crashy gdb,
but I think that it's the same.  We see:

(gdb) sigchld
      ^^^^^^^
n
infrun: clear_proceed_status_thread (Thread 0x7fa2889eb880 (LWP 18511))
infrun: proceed (addr=0xffffffffffffffff, signal=144, step=1)
infrun: resume (step=RESUME_STEP_USER, signal=0), trap_expected=0, current thread [Thread 0x7fa2889eb880 (LWP 18511)] at 0x7fa287ef20ac
LLR: Preparing to step Thread 0x7fa2889eb880 (LWP 18511), 0, inferior_ptid Thread 0x7fa2889eb880 (LWP 18511)
virObjectUnref (anyobj=<error reading variable: Cannot access memory at address 0x7fffacf83308>) at util/virobject.c:270
270	        memset(obj, 0, obj->klass->objectSize);
ptrace: No such process.
(gdb)
infrun

that "sigchld" indicates a ptracee changed state (and so gdb's SIGCHLD handler
is called).  That must have been the SIGKILL, and then when gdb tries to
step LWP 18511, that fails with ESRCH, because the whole process is gone
now (killed by SIGKILL), hence the "No such process."  That error/exception
makes GDB to not ever reach the waitpid call again, so the logs don't get
to show SIGKILL's wait status.

This is exactly the sort of scenario that that patch upstream
addresses (though more could be done).

Thanks,
Pedro Alves

Comment 14 errata-xmlrpc 2015-11-19 13:02:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2089.html

Note You need to log in before you can comment on or make changes to this bug.