Bug 60998
Summary: | Top dumps core with floating point exception w/ long uptime | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Basil Hussain <basil.hussain> | ||||
Component: | procps | Assignee: | Alexander Larsson <alexl> | ||||
Status: | CLOSED RAWHIDE | QA Contact: | Aaron Brown <abrown> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.0 | CC: | benjamin.weigand, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2002-08-08 08:39:53 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Basil Hussain
2002-03-11 17:45:52 UTC
Created attachment 48164 [details]
Core Dump (gzipped)
I have experienced the same problem on several RedHat 7.1 servers. Kernel: 2.4.2-2smp Uptime: 200+ days (a reboot cleared the problem on all) Summary: 'top' exits with a segmentation fault (floating point exception) after displaying header. When run under 'strace', 'top' does not fail. No other apps or utilities seem to be experiencing problems. Following reboot, 'top' runs fine. Have seen the same thing with Redhat 7.0, kernel2.2.16-22smp, procps-2.0.7-3 with an uptime of 364 days. Have not yet tried rebooting. Problem still evident at 400+ days (yay!) of uptime. Rebooted (boo!), which
obviously did indeed cure the problem.
I've just remembered an old e-mail I had received from Chad Schmutzer
(schmutze.edu) noting the same problem:
> I noticed your post on the dell linux-poweredge mailing list from March.
> (I pasted it below). The reason I am writing you is that I just discovered
> the same problem with my Dell Poweredge server. When I run 'top' it does
> this:
>
> 12:00am up 271 days, 8:13, 1 user, load average: 0.59, 0.21, 0.12
> 58 processes: 56 sleeping, 1 running, 1 zombie, 0 stopped
> CPU0 states: 50.0% user, 50.0% system, 0.0% nice, 0.0% idle
> Floating point exception (core dumped)
>
> Everything else seems normal as far as I can tell, but it is concerning me
> a bit.
>
> Anyhow, I have 2 identical servers, with pretty much the same OS levels
> and kernels installed. My kernel is 2.2.19, which I compiled myself.
>
> Do you have multiple processors?
>
> I did notice that besides both of us running a 2.2 kernel, we both have
> very long uptimes. Mine is 271 days, your was 252 days. My second
> identical server only has an uptime of 178 days. On this server top runs
> fine. Perhaps there is some sort of a bug with our specific hardware and
> the 2.2 kernel after so many days of uptime?
>
> I am curious if your problem went away, and if so, how did you fix it? My
> first thought is to reboot, but if this is a sign of a failing processor,
> I need to be prepared if the system does not come back up.
From my SMP x86 running on a 2.2.19 kernel (7.1 install, upgraded on the fly to 7.2) 285 processes: 282 sleeping, 3 running, 0 zombie, 0 stopped CPU0 states: 72.0% user, 27.0% system, 0.0% nice, 0.0% idle CPU1 states: 96.0% user, 3.1% system, 0.0% nice, 0.0% idle Mem: 386136K av, 383132K used, 3004K free, 928164K shrd, 7844K buff Swap: 248968K av, 15540K used, 233428K free 93876K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 6615 jlbec 11 0 30952 30M 28308 R 96.5 8.0 657:20 infobot 26370 root 5 0 1148 1092 928 R 49.0 0.2 0:00 sshd 26369 root 10 0 1216 1216 840 R 22.5 0.3 0:00 top 27640 root 0 0 440 308 240 S 1.5 0.0 51:32 sshd 16511 apache 0 0 23660 23M 22812 S 1.5 6.1 0:08 httpd 20277 ftp 10 10 912 912 716 S N 0.7 0.2 8:21 vsftpd ...etc Phil =--= Any chance someone with the problem could upgrade to procps from 7.2 or 7.3 and test? I downloaded procps-2.0.7-12 from the 7.3 distribution and installed it on one server that hadn't been rebooted (i.e. still 400+ days uptime). It seems to have had some effect, but not a cure. Whereas before top would dump with a floating point exception without fail, now it is just intermittent. I can't seem to find a pattern with it's working/non-working behaviour. Also, it now sometimes doesn't even get to showing CPU zero's state before crashing - for example: 2:16pm up 407 days, 21:24, 1 user, load average: 0.00, 0.00, 0.00 26 processes: 25 sleeping, 1 running, 0 zombie, 0 stopped Floating point exception (core dumped) Actually, I have noticed just one thing right now that seems fairly consistent. When top does manage to run, CPU usage is reported as being high. For example: 2:08pm up 407 days, 21:15, 1 user, load average: 0.00, 0.02, 0.00 25 processes: 24 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 20.0% user, 80.0% system, 0.0% nice, 0.0% idle CPU1 states: 0.0% user, 100.0% system, 0.0% nice, 0.0% idle Mem: 516996K av, 310164K used, 206832K free, 6552K shrd, 239788K buff Swap: 530104K av, 1452K used, 528652K free 45156K cached Compare and contrast with one identical server (still with standard procps- 2.0.7-3) that I have recently rebooted and thus has a short uptime: 2:22pm up 5 days, 23:33, 1 user, load average: 0.00, 0.00, 0.00 50 processes: 49 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 0.1% system, 0.0% nice, 99.4% idle CPU1 states: 0.1% user, 0.2% system, 0.0% nice, 99.2% idle Mem: 516996K av, 444424K used, 72572K free, 106256K shrd, 256228K buff Swap: 530104K av, 1596K used, 528508K free 132824K cached Can you try to install: http://people.redhat.com/alexl/RPMS/procps-2.0.7-12.3test.i386.rpm It is built with debug info. Then run it in gdb to give a backtrace: gdb top <wait for segfault> bt and paste the gdb output here. You missed a step! I was slightly perplexed until I realised I needed to type "run" from within gdb... Anyway, here's the output: #0 0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01438904, pass=1) at top.c:1488 #1 0x804c3e1 in show_procs () at top.c:1180 #2 0x804a078 in main (argc=1, argv=0xbffffa84) at top.c:502 #3 0x4008db65 in __libc_start_main (main=0x8049b74 <main>, argc=1, ubp_av=0xbffffa84, init=0x8049078 <_init>, fini=0x804ed60 <_fini>, rtld_fini=0x4000df24 <_dl_fini>, stack_end=0xbffffa7c) at ../sysdeps/generic/libc-start.c:111 Also, I presume "kill"-ing top before exiting gdb is the right thing to do, yes? What does gdb say when it crashes? Can you redo the same thing, but after getting the backtrace (assuming it is the same) give me the output of the following commands: p i p t_ticks p u_ticks p s_ticks p n_ticks p i_ticks p u_ticks_o[i] p s_ticks_o[i] p n_ticks_o[i] p i_ticks_o[i] Please also give me the contents of /proc/stat at the same time. You don't have to kill top. Just exit gdb with "quit" and it will kill the process. Sorry about missing the run part. Okay, here is the complete output from gdb: 5:30pm up 408 days, 37 min, 1 user, load average: 0.06, 0.01, 0.00 27 processes: 26 sleeping, 1 running, 0 zombie, 0 stopped Program received signal SIGFPE, Arithmetic exception. 0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01345396, pass=1) at top.c:1488 1488 top.c: No such file or directory. (gdb) bt #0 0x804cefe in do_stats (p=0x805e488, elapsed_time=1.01345396, pass=1) at top.c:1488 #1 0x804c3e1 in show_procs () at top.c:1180 #2 0x804a078 in main (argc=1, argv=0xbffffa84) at top.c:502 #3 0x4008db65 in __libc_start_main (main=0x8049b74 <main>, argc=1, ubp_av=0xbffffa84, init=0x8049078 <_init>, fini=0x804ed60 <_fini>, rtld_fini=0x4000df24 <_dl_fini>, stack_end=0xbffffa7c) at ../sysdeps/generic/libc-start.c:111 (gdb) p i $1 = 0 (gdb) p t_ticks $2 = 0 (gdb) p u_ticks $3 = 6431537 (gdb) p s_ticks $4 = 25386489 (gdb) p n_ticks $5 = 1 (gdb) p i_ticks $6 = 2147483647 (gdb) p u_ticks_o[i] $7 = 6431537 (gdb) p s_ticks_o[i] $8 = 25386489 (gdb) p n_ticks_o[i] $9 = 1 (gdb) p i_ticks_o[i] $10 = 2147483647 Here is the contents of /proc/stat: cpu 12819927 1 50595496 2692321490 cpu0 6431537 1 25386489 3493534078 cpu1 6388390 0 25209007 3493754708 disk 26968499 0 0 0 disk_rio 512950 0 0 0 disk_wio 26455549 0 0 0 disk_rblk 4090616 0 0 0 disk_wblk 211580702 0 0 0 page 1056600 34430144 swap 1230 2090 intr 2438576004 3525352105 3543 0 0 3 0 3 3116362360 1 0 0 0 0 1 6 0 64880821 0 0 0 0 0 0 0 0 26944367 0 45 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 493584508 btime 993484360 processes 80051651 Oh, i see. It's a signed vs unsigned bug. I will spin new packages tomorrow that should fix this. Success! Top now runs every time. Plus, reported CPU usage isn't all over the place, but as one would expect: 9:31am up 408 days, 16:38, 1 user, load average: 0.05, 0.01, 0.00 26 processes: 25 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 0.1% system, 0.0% nice, 99.4% idle CPU1 states: 0.0% user, 0.2% system, 0.0% nice, 99.3% idle Mem: 516996K av, 314588K used, 202408K free, 8224K shrd, 239788K buff Swap: 530104K av, 1424K used, 528680K free 49064K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 1368 root 17 0 1008 1008 820 R 0.3 0.1 0:00 top 1288 root 2 0 1908 1836 1452 S 0.1 0.3 0:00 sshd [etc...] Should be fixed in 2.0.7-23 in rawhide. |