Bug 60998

Summary: Top dumps core with floating point exception w/ long uptime
Product: [Retired] Red Hat Linux Reporter: Basil Hussain <basil.hussain>
Component: procpsAssignee: Alexander Larsson <alexl>
Status: CLOSED RAWHIDE QA Contact: Aaron Brown <abrown>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: benjamin.weigand, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-08-08 08:39:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Core Dump (gzipped) none

Description Basil Hussain 2002-03-11 17:45:52 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461)

Description of problem:
When running top, after displaying the uptime, no. of processes and first CPU's 
state (on an SMP system, anyway) it crashes with a floating point exception 
error, plus a core dump.

Version-Release number of selected component (if applicable):
procps-2.0.7-3

How reproducible:
Always

Steps to Reproduce:
1. Login as any user.
2. Issue the command 'top'.	

Actual Results:  Here's an example of console output:

  5:22pm  up 259 days,  1:38,  1 user,  load average: 0.00, 0.00, 0.00
66 processes: 65 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user, 100.0% system,  0.0% nice,  0.0% idle
Floating point exception (core dumped)

Expected Results:  Top should run as normal.

Additional info:

Strangely, it works perfectly every time when run under strace. For 
example: 'strace -o somefile.txt top'.

Also, I think this may possibly be related to uptime. This problem only started 
occurring at around 250 days uptime.

Finally, ps works as normal and I have core files and trace logs if needed.

Comment 1 Basil Hussain 2002-03-11 17:50:48 UTC
Created attachment 48164 [details]
Core Dump (gzipped)

Comment 2 Benjamin Weigand 2002-04-25 18:15:12 UTC
I have experienced the same problem on several RedHat 7.1 servers. 
Kernel: 2.4.2-2smp
Uptime: 200+ days (a reboot cleared the problem on all)
Summary: 'top' exits with a segmentation fault (floating point exception) after 
displaying header. When run under 'strace', 'top' does not fail. No other apps 
or utilities seem to be experiencing problems. Following reboot, 'top' runs 
fine.
 


Comment 3 John Stubbs 2002-06-24 14:22:17 UTC
Have seen the same thing with Redhat 7.0, kernel2.2.16-22smp, procps-2.0.7-3
with an uptime of 364 days.  Have not yet tried rebooting.

Comment 4 Basil Hussain 2002-08-02 10:43:34 UTC
Problem still evident at 400+ days (yay!) of uptime. Rebooted (boo!), which 
obviously did indeed cure the problem.

I've just remembered an old e-mail I had received from Chad Schmutzer 
(schmutze.edu) noting the same problem:

> I noticed your post on the dell linux-poweredge mailing list from March. 
> (I pasted it below). The reason I am writing you is that I just discovered 
> the same problem with my Dell Poweredge server. When I run 'top' it does 
> this:
> 
> 12:00am  up 271 days,  8:13,  1 user,  load average: 0.59, 0.21, 0.12
> 58 processes: 56 sleeping, 1 running, 1 zombie, 0 stopped
> CPU0 states: 50.0% user, 50.0% system,  0.0% nice,  0.0% idle
> Floating point exception (core dumped)
> 
> Everything else seems normal as far as I can tell, but it is concerning me 
> a bit.
> 
> Anyhow, I have 2 identical servers, with pretty much the same OS levels 
> and kernels installed. My kernel is 2.2.19, which I compiled myself.
> 
> Do you have multiple processors?
> 
> I did notice that besides both of us running a 2.2 kernel, we both have 
> very long uptimes. Mine is 271 days, your was 252 days. My second 
> identical server only has an uptime of 178 days. On this server top runs 
> fine. Perhaps there is some sort of a bug with our specific hardware and 
> the 2.2 kernel after so many days of uptime?
> 
> I am curious if your problem went away, and if so, how did you fix it? My 
> first thought is to reboot, but if this is a sign of a failing processor, 
> I need to be prepared if the system does not come back up.

Comment 5 Phil Copeland 2002-08-07 11:49:22 UTC
From my SMP x86 running on a 2.2.19 kernel (7.1 install, upgraded on the fly to 7.2)


285 processes: 282 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states: 72.0% user, 27.0% system,  0.0% nice,  0.0% idle
CPU1 states: 96.0% user,  3.1% system,  0.0% nice,  0.0% idle
Mem:   386136K av,  383132K used,    3004K free,  928164K shrd,    7844K buff
Swap:  248968K av,   15540K used,  233428K free                   93876K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 6615 jlbec     11   0 30952  30M 28308 R    96.5  8.0 657:20 infobot
26370 root       5   0  1148 1092   928 R    49.0  0.2   0:00 sshd
26369 root      10   0  1216 1216   840 R    22.5  0.3   0:00 top
27640 root       0   0   440  308   240 S     1.5  0.0  51:32 sshd
16511 apache     0   0 23660  23M 22812 S     1.5  6.1   0:08 httpd
20277 ftp       10  10   912  912   716 S N   0.7  0.2   8:21 vsftpd
...etc

Phil
=--=


Comment 6 Alexander Larsson 2002-08-07 11:51:03 UTC
Any chance someone with the problem could upgrade to procps from 7.2 or 7.3 and
test?


Comment 7 Basil Hussain 2002-08-07 13:19:17 UTC
I downloaded procps-2.0.7-12 from the 7.3 distribution and installed it on one 
server that hadn't been rebooted (i.e. still 400+ days uptime).

It seems to have had some effect, but not a cure. Whereas before top would dump 
with a floating point exception without fail, now it is just intermittent. I 
can't seem to find a pattern with it's working/non-working behaviour. Also, it 
now sometimes doesn't even get to showing CPU zero's state before crashing - 
for example:

  2:16pm  up 407 days, 21:24,  1 user,  load average: 0.00, 0.00, 0.00
26 processes: 25 sleeping, 1 running, 0 zombie, 0 stopped
Floating point exception (core dumped)

Comment 8 Basil Hussain 2002-08-07 13:30:23 UTC
Actually, I have noticed just one thing right now that seems fairly consistent. 
When top does manage to run, CPU usage is reported as being high. For example:

  2:08pm  up 407 days, 21:15,  1 user,  load average: 0.00, 0.02, 0.00
25 processes: 24 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states: 20.0% user, 80.0% system,  0.0% nice,  0.0% idle
CPU1 states:  0.0% user, 100.0% system,  0.0% nice,  0.0% idle
Mem:   516996K av,  310164K used,  206832K free,    6552K shrd,  239788K buff
Swap:  530104K av,    1452K used,  528652K free                   45156K cached

Compare and contrast with one identical server (still with standard procps-
2.0.7-3) that I have recently rebooted and thus has a short uptime:

  2:22pm  up 5 days, 23:33,  1 user,  load average: 0.00, 0.00, 0.00
50 processes: 49 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.1% system,  0.0% nice, 99.4% idle
CPU1 states:  0.1% user,  0.2% system,  0.0% nice, 99.2% idle
Mem:   516996K av,  444424K used,   72572K free,  106256K shrd,  256228K buff
Swap:  530104K av,    1596K used,  528508K free                  132824K cached

Comment 9 Alexander Larsson 2002-08-07 14:20:05 UTC
Can you try to install:
http://people.redhat.com/alexl/RPMS/procps-2.0.7-12.3test.i386.rpm
It is built with debug info.

Then run it in gdb to give a backtrace:
gdb top
<wait for segfault>
bt

and paste the gdb output here.


Comment 10 Basil Hussain 2002-08-07 15:10:53 UTC
You missed a step! I was slightly perplexed until I realised I needed to 
type "run" from within gdb... Anyway, here's the output:

#0  0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01438904, pass=1) at 
top.c:1488
#1  0x804c3e1 in show_procs () at top.c:1180
#2  0x804a078 in main (argc=1, argv=0xbffffa84) at top.c:502
#3  0x4008db65 in __libc_start_main (main=0x8049b74 <main>, argc=1, 
ubp_av=0xbffffa84, init=0x8049078 <_init>, fini=0x804ed60 <_fini>, 
    rtld_fini=0x4000df24 <_dl_fini>, stack_end=0xbffffa7c) 
at ../sysdeps/generic/libc-start.c:111

Also, I presume "kill"-ing top before exiting gdb is the right thing to do, yes?

Comment 11 Alexander Larsson 2002-08-07 16:26:17 UTC
What does gdb say when it crashes?

Can you redo the same thing, but after getting the backtrace (assuming it is the
same) give me the output of the following commands:

p i
p t_ticks
p u_ticks
p s_ticks
p n_ticks
p i_ticks
p u_ticks_o[i]
p s_ticks_o[i]
p n_ticks_o[i]
p i_ticks_o[i]

Please also give me the contents of /proc/stat at the same time.

You don't have to kill top. Just exit gdb with "quit" and it will kill the
process. Sorry about missing the run part.


Comment 12 Basil Hussain 2002-08-07 16:35:36 UTC
Okay, here is the complete output from gdb:

  5:30pm  up 408 days, 37 min,  1 user,  load average: 0.06, 0.01, 0.00
27 processes: 26 sleeping, 1 running, 0 zombie, 0 stopped

Program received signal SIGFPE, Arithmetic exception.
0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01345396, pass=1) at 
top.c:1488
1488    top.c: No such file or directory.
(gdb) bt
#0  0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01345396, pass=1) at 
top.c:1488
#1  0x804c3e1 in show_procs () at top.c:1180
#2  0x804a078 in main (argc=1, argv=0xbffffa84) at top.c:502
#3  0x4008db65 in __libc_start_main (main=0x8049b74 <main>, argc=1, 
ubp_av=0xbffffa84, init=0x8049078 <_init>, fini=0x804ed60 <_fini>, 
    rtld_fini=0x4000df24 <_dl_fini>, stack_end=0xbffffa7c) 
at ../sysdeps/generic/libc-start.c:111
(gdb) p i
$1 = 0
(gdb) p t_ticks
$2 = 0
(gdb) p u_ticks
$3 = 6431537
(gdb) p s_ticks
$4 = 25386489
(gdb) p n_ticks
$5 = 1
(gdb) p i_ticks
$6 = 2147483647
(gdb) p u_ticks_o[i]
$7 = 6431537
(gdb) p s_ticks_o[i]
$8 = 25386489
(gdb) p n_ticks_o[i]
$9 = 1
(gdb) p i_ticks_o[i]
$10 = 2147483647

Here is the contents of /proc/stat:

cpu  12819927 1 50595496 2692321490
cpu0 6431537 1 25386489 3493534078
cpu1 6388390 0 25209007 3493754708
disk 26968499 0 0 0
disk_rio 512950 0 0 0
disk_wio 26455549 0 0 0
disk_rblk 4090616 0 0 0
disk_wblk 211580702 0 0 0
page 1056600 34430144
swap 1230 2090
intr 2438576004 3525352105 3543 0 0 3 0 3 3116362360 1 0 0 0 0 1 6 0 64880821 0 
0 0 0 0 0 0 0 26944367 0 45 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0
ctxt 493584508
btime 993484360
processes 80051651

Comment 13 Alexander Larsson 2002-08-07 16:53:36 UTC
Oh, i see. It's a signed vs unsigned bug. I will spin new packages tomorrow that
should fix this.


Comment 14 Alexander Larsson 2002-08-08 08:15:41 UTC
Can you try:
http://people.redhat.com/alexl/RPMS/procps-2.0.7-12.4test.i386.rpm

Comment 15 Basil Hussain 2002-08-08 08:39:48 UTC
Success! Top now runs every time. Plus, reported CPU usage isn't all over the 
place, but as one would expect:

  9:31am  up 408 days, 16:38,  1 user,  load average: 0.05, 0.01, 0.00
26 processes: 25 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.1% system,  0.0% nice, 99.4% idle
CPU1 states:  0.0% user,  0.2% system,  0.0% nice, 99.3% idle
Mem:   516996K av,  314588K used,  202408K free,    8224K shrd,  239788K buff
Swap:  530104K av,    1424K used,  528680K free                   49064K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 1368 root      17   0  1008 1008   820 R     0.3  0.1   0:00 top
 1288 root       2   0  1908 1836  1452 S     0.1  0.3   0:00 sshd
[etc...]

Comment 16 Alexander Larsson 2002-08-08 11:28:45 UTC
Should be fixed in 2.0.7-23 in rawhide.