Description of Problem: System log on one of our AMD Athlon based hosts is somewhat unstable; it will freqently get into a state where it fails to respond to most input. It is not possible to log into it from gdm or text console. The host respoinds as expected to ping, but rsh and ssh gives no feedback. telnet displays welcome message, then stops. On at least one of the events where this happened, the system log contained a "kernel BUG" message; the full output is included below. Version-Release number of selected component (if applicable): kernel-2.4.18-10 (athlon) How Reproducible: Happens frequencly, perhaps most often when the system is idle (but possibly running cron jobs etc.) Steps to Reproduce: 1. Run the system for a while... 2. Leave it on overnight 3. Try to log in the next morning Actual Results: Se above. I'm not sure if this always happens, but like I said, the problem seems most likely to occur when nobody is working actively on the host. Expected Results: Obvious, isn't it. Additional Information: I've seen similar behaviour on other system, but that was a long time ago; we have other systems with the same type of mainboard and CPU, and they work just fine. Syslog messages: Sep 25 04:06:18 ringerike kernel: kernel BUG at dcache.c:362! Sep 25 04:06:18 ringerike kernel: invalid operand: 0000 Sep 25 04:06:18 ringerike kernel: ide-cd cdrom es1371 ac97_codec gameport soundcore parport_pc lp parport binfmt Sep 25 04:06:18 ringerike kernel: CPU: 0 Sep 25 04:06:18 ringerike kernel: EIP: 0010:[<c014b79c>] Not tainted Sep 25 04:06:18 ringerike kernel: EFLAGS: 00010282 Sep 25 04:06:18 ringerike kernel: Sep 25 04:06:18 ringerike kernel: EIP is at prune_dcache [kernel] 0xac (2.4.18-10) Sep 25 04:06:18 ringerike kernel: eax: 0000001c ebx: c2762c58 ecx: 00000001 edx: 000021df Sep 25 04:06:18 ringerike kernel: esi: c2762c40 edi: c1387130 ebp: 00000061 esp: c13bdf64 Sep 25 04:06:18 ringerike kernel: ds: 0018 es: 0018 ss: 0018 Sep 25 04:06:18 ringerike kernel: Process kswapd (pid: 5, stackpage=c13bd000) Sep 25 04:06:18 ringerike kernel: Stack: c022e3ba 0000016a c13bc000 00000000 00000000 ffffffff c02d03c8 00000000 Sep 25 04:06:18 ringerike kernel: 00000000 000001c3 c0131863 000001d0 000001c3 00000000 00000000 c014bb60 Sep 25 04:06:18 ringerike kernel: 000005e6 c013203c 00000006 000001d0 000001d0 c13bc000 00000000 00000000 Sep 25 04:06:18 ringerike kernel: Call Trace: [<c0131863>] page_launder [kernel] 0x2b3 Sep 25 04:06:19 ringerike kernel: [<c014bb60>] shrink_dcache_memory [kernel] 0x20 Sep 25 04:06:19 ringerike kernel: [<c013203c>] do_try_to_free_pages [kernel] 0x1c Sep 25 04:06:19 ringerike kernel: [<c0132331>] kswapd [kernel] 0x101 Sep 25 04:06:19 ringerike kernel: [<c0105000>] stext [kernel] 0x0 Sep 25 04:06:19 ringerike kernel: [<c0107136>] kernel_thread [kernel] 0x26 Sep 25 04:06:19 ringerike kernel: [<c0132230>] kswapd [kernel] 0x0 Sep 25 04:06:19 ringerike kernel: Sep 25 04:06:19 ringerike kernel: Sep 25 04:06:19 ringerike kernel: Code: 0f 0b 5f 58 8d 4e 10 8b 51 04 8b 46 10 89 50 04 89 02 89 4e
Created attachment 77184 [details] Output of 'dmidecode'
Same kernel BUG() here. In the middle of the night, when the machine had nothing else to do except to run daily cron-jobs (redhat default ones), our server crashed. System info : RedHat 7.3 with latest patches (as of 1-Sep-2002) Kernel 2.4.18-10 UP Pentium II (Klamath) 266, stepping 4 The kernel was updated from 2.4.18-5 20-Sep-2002, and had been running stable for a long time. After crash, I could still login remotely (X console was hung) using ssh and issue some commands (dmesg, less /var/log/messages), but then using piped output (ps ax | less) hung the session, and after that new ssh sessions hung failed too. Here's what I managed to gather before total hang : dmesg ===== kernel BUG at dcache.c:362! invalid operand: 0000 nls_iso8859-1 sb sb_lib uart401 sound soundcore parport_pc lp parport autofs n CPU: 0 EIP: 0010:[<c014a04c>] Not tainted EFLAGS: 00010282 EIP is at prune_dcache [kernel] 0xac (2.4.18-10) eax: 0000001c ebx: cd8be8b8 ecx: 00000001 edx: 00003343 esi: cd8be8a0 edi: c1387130 ebp: 0000005b esp: c1393f64 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 5, stackpage=c1393000) Stack: c022961a 0000016a c1392000 00000000 00000000 ffffffff c02c7ae8 00000000 00000000 00000161 c0130133 000001d0 00000161 00000000 00000000 c014a410 00000502 c013090c 00000006 000001d0 000001d0 c1392000 00000000 00000000 Call Trace: [<c0130133>] page_launder [kernel] 0x2b3 [<c014a410>] shrink_dcache_memory [kernel] 0x20 [<c013090c>] do_try_to_free_pages [kernel] 0x1c [<c0130c01>] kswapd [kernel] 0x101 [<c0105000>] stext [kernel] 0x0 [<c0107136>] kernel_thread [kernel] 0x26 [<c0130b00>] kswapd [kernel] 0x0 Code: 0f 0b 5f 58 8d 4e 10 8b 51 04 8b 46 10 89 50 04 89 02 89 4e ps axl ====== F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 100 0 1 0 15 0 1368 432 wakeup D ? 0:07 init 040 0 2 1 15 0 0 0 contex SW ? 0:00 [keventd] 040 0 3 1 15 0 0 0 schedu SW ? 0:02 [kapmd] 040 0 4 1 34 19 0 0 ksofti SWN ? 0:01 [ksoftirqd_ 144 0 5 1 16 0 0 0 do_exi Z ? 1:24 [kswapd <de 040 0 6 1 15 0 0 0 bdflus SW ? 0:00 [bdflush] 040 0 7 1 15 0 0 0 schedu SW ? 0:02 [kupdated] 040 0 8 1 25 0 0 0 md_thr SW ? 0:00 [mdrecovery 040 0 14 1 25 0 0 0 down_i SW ? 0:00 [scsi_eh_0] 040 0 17 1 15 0 0 0 end SW ? 0:51 [kjournald] 040 0 96 1 15 0 0 0 end SW ? 0:00 [khubd] 040 0 222 1 15 0 0 0 end SW ? 0:00 [kjournald] 040 0 640 1 15 0 1428 488 wakeup D ? 0:02 syslogd -m 140 0 645 1 15 0 1364 440 do_sys S ? 0:00 klogd -x 140 32 665 1 15 0 1508 424 schedu S ? 0:00 portmap 140 29 693 1 18 0 1560 572 schedu S ? 0:00 rpc.statd 040 0 790 1 15 0 0 0 end SW ? 1:01 [rpciod] 040 0 791 1 18 0 0 0 schedu SW ? 0:00 [lockd] 140 0 812 1 15 0 1360 404 schedu S ? 0:00 /usr/sbin/a 140 38 832 1 15 0 1884 1876 wakeup DL ? 0:18 ntpd -U ntp 140 0 884 1 15 0 2624 824 schedu S ? 0:21 /usr/sbin/s 140 0 917 1 15 0 2200 584 schedu S ? 0:00 xinetd -sta 140 0 927 1 15 0 2080 824 wakeup D ? 0:00 /usr/knox/b 040 4 952 1 15 0 7200 836 wakeup D ? 0:15 lpd Waiting 140 0 983 1 15 0 4608 984 wakeup D ? 0:16 sendmail: a 140 0 1002 1 15 0 1400 380 schedu S ? 0:19 gpm -t ps/2 040 1 1021 1 15 0 1896 504 wakeup D ? 0:00 cannaserver 040 0 1039 1 15 0 1536 544 wakeup D ? 0:01 crond 140 43 1091 1 15 0 5840 2088 wakeup D ? 0:08 xfs -droppr 040 2 1127 1 15 0 1404 456 schedu S ? 0:00 /usr/sbin/a 100 0 1150 1 15 0 1344 328 schedu S tty1 0:00 /sbin/minge 100 0 1151 1 16 0 1344 328 schedu S tty2 0:00 /sbin/minge 100 0 1152 1 16 0 1344 328 schedu S tty3 0:00 /sbin/minge 100 0 1153 1 16 0 1344 328 schedu S tty4 0:00 /sbin/minge 100 0 1154 1 16 0 1344 328 schedu S tty5 0:00 /sbin/minge 100 0 1155 1 16 0 1344 328 schedu S tty6 0:00 /sbin/minge 100 0 1156 1 15 0 2628 780 schedu S ? 0:00 /usr/bin/kd 100 0 1164 1156 15 0 31780 6504 wakeup D ? 6731:35 /usr/X11R6 140 0 1165 1156 16 0 3268 712 wait4 S ? 0:00 -:0 100 864 1177 1165 15 0 2216 836 wait4 S ? 0:00 /bin/sh /us 040 864 1261 1 15 0 19468 3400 schedu S ? 0:00 kdeinit: Ru 040 864 1264 1 15 0 19392 3800 schedu S ? 0:03 kdeinit: dc 040 864 1267 1 15 0 20440 4364 schedu S ? 0:00 kdeinit: kl 040 864 1269 1 15 0 20672 4824 wakeup D ? 81:32 kdeinit: kd 040 864 1289 1 15 0 23252 4652 schedu S ? 0:27 kdeinit: kn 000 864 1290 1177 15 0 1416 272 schedu S ? 0:08 kwrapper ks 040 864 1292 1 15 0 20432 4300 schedu S ? 0:01 kdeinit: ks 040 864 1312 1261 15 0 21996 6232 schedu S ? 0:55 kdeinit: kw 040 864 1314 1 15 0 23068 7484 wakeup D ? 5:37 kdeinit: kd 040 864 1317 1 15 0 23440 7444 wakeup D ? 18:06 kdeinit: ki 000 864 1319 1261 15 0 1804 424 schedu S ? 18:37 autorun -l 040 864 1322 1 15 0 20784 4848 wakeup D ? 7:04 kdeinit: kl 040 864 1325 1 15 0 21188 4420 schedu S ? 0:01 kdeinit: kw 040 864 1329 1 15 0 20528 4716 schedu S ? 0:01 korgac --mi 040 864 1332 1 15 0 20260 4232 schedu S ? 0:01 kalarmd --l 040 864 8938 1 15 0 20488 4364 schedu S ? 0:01 kdeinit: kc 040 864 8940 1 15 0 13664 2336 wakeup D ? 0:01 kdesud 140 0 24115 1 15 0 4512 900 schedu S ? 0:00 smbd -D 140 0 24120 1 15 0 3612 904 wakeup D ? 0:50 nmbd -D 040 864 25329 1 15 0 21068 6160 schedu S ? 0:00 kdeinit: ki 000 864 25341 1 15 0 10076 3272 wakeup D ? 90:07 artsd -F 10 040 864 25342 25341 15 0 10076 3272 wakeup D ? 0:01 artsd -F 10 000 864 1188 1261 15 0 33664 5900 wakeup D ? 17:46 xmms 040 864 1189 1188 15 0 33664 5900 wakeup D ? 0:01 xmms 040 864 1190 1189 15 0 33664 5900 wakeup D ? 0:00 xmms 040 864 1191 1189 15 0 33664 5900 schedu S ? 0:15 xmms 000 987 20002 1 18 0 1772 496 wait4 S ? 0:00 make -C tes 000 987 20005 20002 19 0 2200 804 wait4 S ? 0:00 /bin/sh -c 000 987 20006 20005 15 0 211744 22592 wakeup D ? 0:12 java -class 040 987 20007 20006 15 0 211744 22592 wakeup D ? 0:00 java -class 040 987 20008 20007 15 0 211744 22592 schedu S ? 0:04 java -class 040 987 20009 20007 15 0 211744 22592 rt_sig S ? 0:00 java -class 040 987 20010 20007 15 0 211744 22592 rt_sig S ? 0:00 java -class 040 987 20011 20007 15 0 211744 22592 schedu S ? 0:44 java -class 040 987 20012 20007 20 0 211744 22592 rt_sig S ? 0:00 java -class 040 987 20013 20007 20 0 211744 22592 rt_sig S ? 0:00 java -class 040 987 20014 20007 15 0 211744 22592 rt_sig S ? 0:05 java -class 040 987 20018 20007 15 0 211744 22592 wakeup D ? 0:01 java -class 040 987 20021 20007 15 0 211744 22592 wakeup D ? 0:02 java -class 040 987 20022 20007 15 0 211744 22592 wakeup D ? 0:07 java -class 140 0 22150 24115 15 0 4920 1716 wakeup D ? 0:00 smbd -D 000 864 22163 1314 15 0 20636 9808 schedu S ? 0:01 /usr/bin/kd 000 864 22164 22163 34 19 17780 8372 wakeup DN ? 5:12 kspace.kss 040 0 22622 1039 16 0 1548 644 pipe_w S ? 0:00 CROND 100 0 22623 22622 15 0 1944 880 wait4 S ? 0:00 /bin/bash / 000 0 22973 22623 16 0 1940 864 wait4 S ? 0:00 /bin/sh /et 000 0 22974 22623 15 0 1740 616 pipe_w S ? 0:00 awk -v prog 000 0 22976 22973 25 0 8388 5768 wakeup D ? 0:51 /usr/sbin/t 100 0 22977 927 16 0 3424 2016 wakeup D ? 0:00 /usr/knox/b 140 0 23026 884 15 0 3488 1708 schedu S ? 0:00 /usr/sbin/s 100 0 23027 23026 15 0 2492 1316 wait4 S pts/1 0:00 -bash 100 0 23069 23027 16 0 3128 1172 - R pts/1 0:00 ps axl
"System log on one of our AMD Athlon based hosts is somewhat unstable" - I can't believe I said that... What I meant was obviously that the host is unstable, and the system log contains error messages. Anyhow, I just wanted to point out that virtually all hangs, crashes etc. we've had on our Red Hat systems have occured while the system was idle. Any idea why?
we've fixed a few bugs in this area and are working on an erratum; but QA takes time...
Any news on this? Would upgrade to kernel-2.4.18-18.7.x help?
A possibly related problem: The filesystem cache sometimes seems to get corrupted on this host - files appear to change for no apparent reason, but after 'sync + re-read' they look OK. Files are usually read from NFS volume when this happens.