Bug 1737552

Summary: Async unsafe call in batch mode can cause top to crash
Product: Red Hat Enterprise Linux 7 Reporter: Paulo Andrade <pandrade>
Component: procps-ngAssignee: Jan Rybar <jrybar>
Status: CLOSED ERRATA QA Contact: Karel Volný <kvolny>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: fkrska, jrybar, kvolny, nbansal
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: procps-ng-3.3.10-27.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1739370 1739371 1739372 (view as bug list) Environment:
Last Closed: 2020-03-31 19:38:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1739370, 1739371, 1739372    

Description Paulo Andrade 2019-08-05 15:55:24 UTC
Backtrace looks like this:

#0  0x00007f71cdd6d8f1 in __GI__IO_putc (c=10, fp=0x7f71ce0bd400 <_IO_2_1_stdout_>) at putc.c:31
#1  0x00007f71ce2dd30a in tputs () from /lib64/libtinfo.so.5
#2  0x00000000004055c2 in bye_bye (str=str@entry=0x0) at top.c:563
#3  0x0000000000406b3f in sig_endpgm (dont_care_sig=<optimized out>) at top.c:625
#4  <signal handler called>
#5  0x00007f71cddef087 in munmap () at ../sysdeps/unix/syscall-template.S:81
#6  0x00007f71cdd72b62 in __GI__IO_setb (f=f@entry=0x7f71ce0bd400 <_IO_2_1_stdout_>, b=b@entry=0x0, eb=eb@entry=0x0, a=a@entry=0) at genops.c:402
#7  0x00007f71cdd70f40 in _IO_new_file_close_it (fp=fp@entry=0x7f71ce0bd400 <_IO_2_1_stdout_>) at fileops.c:194
#8  0x00007f71cdd64018 in _IO_new_fclose (fp=0x7f71ce0bd400 <_IO_2_1_stdout_>) at iofclose.c:59
#9  0x0000000000411131 in close_stream (stream=0x7f71ce0bd400 <_IO_2_1_stdout_>) at ../lib/fileutils.c:25
#10 0x00000000004111ad in close_stdout () at ../lib/fileutils.c:37
#11 0x00007f71cdd2fb69 in __run_exit_handlers (status=status@entry=0, listp=0x7f71ce0bc6c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#12 0x00007f71cdd2fbb7 in __GI_exit (status=status@entry=0) at exit.c:99
#13 0x00000000004055b8 in bye_bye (str=str@entry=0x0) at top.c:564
#14 0x0000000000402f2c in do_key (ch=<optimized out>) at top.c:4987
#15 main (dont_care_argc=<optimized out>, argv=<optimized out>) at top.c:5721

  The problem is that top apparently is being killed if it takes too much time to
execute, on a loaded system.

  Code is spwaning top from a loop, in batch mode, with a small timeout (-d option),
and one iteration (-n option), from third party code, to fetch some information
from running processes.

  The most likely solution with as few as possible chances of unexpected side
effects probably would be the pseudo-patch:

-   if (Batch) putp("\n");
+   if (Batch) {
+      if (Frames_signal == BREAK_sig)
+         write(STDOUT_FILENO, "\n", 1);
+      else
+         putp("\n");
+   }
    exit(EXIT_SUCCESS);
 } // end: bye_bye

in top/top.c:bye_bye()

Comment 4 Paulo Andrade 2019-08-06 12:30:00 UTC
  Procedures to reproduce are very close to a similar bug report at
https://bugzilla.redhat.com/show_bug.cgi?id=1403971

  The root cause of the problem should have been, for some reason, a slow procfs
read, and the top command being killed with a signal handled by sig_endpgm:

"""
         case SIGALRM: case SIGHUP:  case SIGINT:
         case SIGPIPE: case SIGQUIT: case SIGTERM:
         case SIGUSR1: case SIGUSR2:
            sa.sa_handler = sig_endpgm;
"""

  Need two terminals and run under gdb. For example, call #1 one terminal
and #2 another terminal:

<<< Start under gdb, with debuginfo packages installed, and stop on the
    close_stdout function >>>

#1 gdb -q --args top -b -n 1
[...]
(gdb) b close_stdout
Breakpoint 1 at 0x411260: file ../lib/fileutils.c, line 37.
(gdb) r
Starting program: /usr/bin/top -b -n 1
[...]

Breakpoint 1, close_stdout () at ../lib/fileutils.c:37
37		if (close_stream(stdout) != 0 && !(errno == EPIPE)) {

<<< Now stop in the next munmap >>>

(gdb) b munmap
Breakpoint 2 at 0x7ffff70593f0: file ../sysdeps/unix/syscall-template.S, line 81.
(gdb) c
Continuing.

Breakpoint 2, munmap () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

<<< On munmap, step instructions a bit until returning from the syscall >>>

(gdb) si
0x00007ffff70593f5	81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) si
0x00007ffff70593f7	81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) 
0x00007ffff70593fd	81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) disassemble 
Dump of assembler code for function munmap:
   0x00007ffff70593f0 <+0>:	mov    $0xb,%eax
   0x00007ffff70593f5 <+5>:	syscall 
   0x00007ffff70593f7 <+7>:	cmp    $0xfffffffffffff001,%rax
=> 0x00007ffff70593fd <+13>:	jae    0x7ffff7059400 <munmap+16>
[...]

<<< get the top pid >>>

(gdb) call getpid()
$1 = 7520

<<< tell gdb just to be sure, to pass SIGTERM, and not stop on it >>

(gdb) handle SIGTERM nostop pass
Signal        Stop	Print	Pass to program	Description
SIGTERM       No	Yes	Yes		Terminated


<<< Now the important part to reproduce, in the second terminal, send the signal >>>

#2 kill -TERM 7520


<<< And on gdb, let it continue executing >>>

(gdb) c
Continuing.

Program received signal SIGTERM, Terminated.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6fdcd19 in _IO_new_file_overflow (
    f=0x7ffff7328400 <_IO_2_1_stdout_>, ch=10) at fileops.c:851
851	  *f->_IO_write_ptr++ = ch;
(gdb) bt
#0  0x00007ffff6fdcd19 in _IO_new_file_overflow (
    f=0x7ffff7328400 <_IO_2_1_stdout_>, ch=10) at fileops.c:851
#1  0x00007ffff6fd8999 in __GI__IO_putc (c=<optimized out>, 
    fp=0x7ffff7328400 <_IO_2_1_stdout_>) at putc.c:29
#2  0x00007ffff754830a in tputs (string=string@entry=0x411433 "\n", 
    affcnt=affcnt@entry=1, outc=outc@entry=0x7ffff7548020 <_nc_putchar>)
    at ../../ncurses/tinfo/lib_tputs.c:337
#3  0x00007ffff75485b1 in putp (string=string@entry=0x411433 "\n")
    at ../../ncurses/tinfo/lib_tputs.c:208
#4  0x0000000000405532 in bye_bye (str=str@entry=0x0) at top.c:563
#5  0x0000000000406aaf in sig_endpgm (dont_care_sig=<optimized out>)
    at top.c:625
#6  <signal handler called>
[...]

  Because the munmap, it is unlikely it will work by accident.

  Note that because the 'bye_bye' function is called a second time, it might
not be required to write the '\n'. The suggested patch is just to have the
same behaviour, and not crash, but most likely it will write it twice, once
when normally exiting, and again when receiving the signal, after returning
from main.

Comment 7 Jan Rybar 2019-08-07 11:58:38 UTC
(In reply to Paulo Andrade from comment #4)

>   Note that because the 'bye_bye' function is called a second time, it might
> not be required to write the '\n'. The suggested patch is just to have the
> same behaviour, and not crash, but most likely it will write it twice, once
> when normally exiting, and again when receiving the signal, after returning
> from main.

Maybe just using 'static volatile sig_atomic_t in_exit' and asking for it before calling exit() might prevent calling exit() twice and outputting '\n' twice also.

Comment 8 Paulo Andrade 2019-08-07 20:28:11 UTC
(In reply to Jan Rybar from comment #7)
> (In reply to Paulo Andrade from comment #4)
> 
> >   Note that because the 'bye_bye' function is called a second time, it might
> > not be required to write the '\n'. The suggested patch is just to have the
> > same behaviour, and not crash, but most likely it will write it twice, once
> > when normally exiting, and again when receiving the signal, after returning
> > from main.
> 
> Maybe just using 'static volatile sig_atomic_t in_exit' and asking for it
> before calling exit() might prevent calling exit() twice and outputting '\n'
> twice also.

  Yes. That probably would better match the expected behaviour. The suggested
patch was just to prevent the crash, as depending on when the signal is received,
while exiting, it might not crash, and still output twice '\n'.

Comment 19 errata-xmlrpc 2020-03-31 19:38:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1018