Description of problem: I'm seeing this consistently after one to maybe at most 5 hours after every reboot. It started in the .185?_FC5 kernels. The FC5t2 kernel didn't show that problem, it allowed me to run out of memory after 3 days due to a leak. When the problem starts not all processes freeze at the same time. I was able, on time, to start a new process under strace. It was just dd which would have created a file on /tmp. strace showed the process got stuck in the open(O_CREAT) call. I could stop the process with Ctrl-C and retry with the same result. It might be interesting to know that /tmp has it's own partition on my system, mounted with noexec. I wiped out the partition and used mkfs without finding any problems with the disk nor healing the problem in the process. After the first process gets stuck the load is sky-rocketing. top doesn't show any running processes, though. After a few minutes all processes are affected. I cannot say whether this is due to the load or whether they all want to create files. The machine becomes unusable. Version-Release number of selected component (if applicable): 2.6.15-1.185?_FC5 up to 2.6.15-1.1863_F5 How reproducible: not on demand, but reliably Steps to Reproduce: 1.get my system: Intel ICH6 925X chipset, two SATA drives two RAID volumes: one RAID0, one RAID1 (see below) NVidia card which requires the binary driver due to twinview 2.work a bit 3.see system come to a halt Actual results: nothing works after some time Expected results: 400+ days uptime (instead of on average 3 hours) Additional info: /dev/md0: Version : 00.90.03 Creation Time : Tue Dec 7 15:49:17 2004 Raid Level : raid1 Array Size : 51199040 (48.83 GiB 52.43 GB) Device Size : 51199040 (48.83 GiB 52.43 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Jan 21 09:42:55 2006 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : cdaf33a1:d6834d33:92004050:4d985a0b Events : 0.10907486 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 8 17 1 active sync /dev/sdb1 /dev/md1: Version : 00.90.03 Creation Time : Tue Dec 7 15:55:10 2004 Raid Level : raid0 Array Size : 117876480 (112.42 GiB 120.71 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Dec 7 15:55:10 2004 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Chunk Size : 256K UUID : 4a608e21:71b9722f:452d727c:17e95e5d Events : 0.1 Number Major Minor RaidDevice State 0 8 8 0 active sync /dev/sda8 1 8 22 1 active sync /dev/sdb6 rootfs / rootfs rw 0 0 /dev/root / ext3 rw,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 none /selinux selinuxfs rw 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 none /dev/pts devpts rw 0 0 /dev/sda1 /boot ext3 rw,noexec,data=ordered 0 0 none /dev/shm tmpfs rw,noexec 0 0 /dev/md0 /home ext3 rw,data=ordered 0 0 /dev/sdb2 /tmp ext3 rw,noexec,data=ordered 0 0 /dev/sda5 /usr ext3 rw,data=ordered 0 0 /dev/sda6 /var ext3 rw,data=ordered 0 0 /dev/sdb3 /var/tmp ext3 rw,noexec,data=ordered 0 0 /dev/md1 /work ext3 rw,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 automount(pid2537) /misc autofs rw 0 0 automount(pid2539) /net autofs rw 0 0 nfsd /proc/fs/nfsd nfsd rw 0 0 00:00.0 Host bridge: Intel Corporation 925X/XE Memory Controller Hub (rev 04) 00:01.0 PCI bridge: Intel Corporation 925X/XE PCI Express Root Port (rev 04) 00:1c.0 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 1 (rev 03) 00:1c.1 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 2 (rev 03) 00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #1 (rev 03) 00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #2 (rev 03) 00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #3 (rev 03) 00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #4 (rev 03) 00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB2 EHCI Controller (rev 03) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d3) 00:1e.2 Multimedia audio controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller (rev 03) 00:1f.0 ISA bridge: Intel Corporation 82801FB/FR (ICH6/ICH6R) LPC Interface Bridge (rev 03) 00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) IDE Controller (rev 03) 00:1f.2 IDE interface: Intel Corporation 82801FR/FRW (ICH6R/ICH6RW) SATA Controller (rev 03) 00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller (rev 03) 01:00.0 VGA compatible controller: nVidia Corporation NV37GL [Quadro FX 330/Quadro NVS280] (rev a2) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751 Gigabit Ethernet PCI Express (rev 01)
A few more bits of information: - the machine remains ping-able - I can switch to the text console (at least the times I tried it) - logins are never possible, neither locally nor via ssh - sometimes top can start, even multiple times. Sometimes it hangs before printing anything - the strace a dd process test I mentioned in the report is also not reliable. Sometimes nothing is shown at all (no message from strace) - often the processes cannot be killed (stuck in D state). No Ctrl-C, no Ctrl-Z, no kill command - this might be because I didn't have to notice it so because I never had my machine crash this often in the last years: the RAID code seems slow. My md0 device (RAID1, for /home, which is 50G in size of which only about 900M are filled) covers very slowly. The reconstruction in the background takes 30 mins or more. During this time the load is > 2 and the system practically crawls. I haven't noticed this before the .185? kernels. smartd isn't running because it cannot work with sata but the drives do not report any problems. - I cannot really go back to an older kernel because I need to compile the nvidia module which in that case would require gcc 4.0. That's too much to download for my sucky connection.
On another machines (x86, P4 HT) I experience hangs as well. I cannot say whether they are the same but from the outside they sure look like it. There is even a possibility that I can provide a reproducer (will try this later). Anyhow, on that machine I got a backtrace (don't remember which kernel version it was): BUG: write_lock lockup _raw_write_lock+0x7d eax A5F88A00 EBX C13EDAF8 ECX 00000001 EDX CFFB1988 ESI 09F3B049 EDI 00000000 EBP CFFB1988 c017b466 set_fd_pwd+0x18 c016f1fd permission+0x8e c0162df2 sys_fchdir+0x5f c0103d21 syscall_call+0x7 The reproducer uses some of the new *at syscalls. Hopefully more later.
That's of course set_fs_pwd in the backtrace.
Created attachment 124171 [details] add missing unlock This patch should fix at least my second machine's hangs. A lock wasn't unlocked in case of an error. The patch is against the current upstream kernel.
As expected, the 1909 kernel which includes the patch I attached here fixes the problem for my x86 machines. I'll now investigate whether it fixes my x86-64 machines as well.
It didn't take long to find out this patch does not fix the original problem. I.e., the x86 freeze was a separate problem. Not really surprising as I never thought the x86-64 machine was using the new syscalls at the time the machine was freezing. Anyway, I got the null modem cable connected now.
Well, when the machine gets stuck there is no output on the serial console. The machine is still pingable but not even the screen saver password widget comes up when I press a key. Switching to the text console is also not working. I couldn't get the sysrq magic to work over the serial console, even when the machine was working normally. Maybe the problem is that I'm using minicom at the other end? Any hints on what to do?
I have a similar (same?) problem as well with the latest kernels in x86_64. It looks like all disk IO in the machine gets stuck. No new commands succeed for me even dmesg fails so it might be a different issue. Existing ssh connections to remote systems keep working so the system isn't tottaly dead. I am using an md device for my lvm volume group (see below). My other x86_64 system (without an md device) seems quite stable so far which is weird. I wonder if it's md causing the problem. minicom should be ok AFAIK, you are using CTRL-a f sysrqcommand right? Unfortunately I can not try with a serial connection but I'll try to reproduce the problem while on console to see if I can get a backtrace. BTW you can use smartd on SATA drives with recent kernels now. /dev/root / ext3 rw,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 none /selinux selinuxfs rw 0 0 /dev/devpts /dev/pts devpts rw 0 0 /dev/md1 /boot ext3 rw,data=ordered 0 0 /dev/shm /dev/shm tmpfs rw 0 0 /dev/rootvg/optvol /opt ext3 rw,data=ordered 0 0 /dev/rootvg/tmpvol /tmp ext3 rw,data=ordered 0 0 /dev/rootvg/usrvol /usr ext3 rw,data=ordered 0 0 /dev/rootvg/varvol /var ext3 rw,data=ordered 0 0 /dev/datavg/scratchvol /srv/data/scratch ext3 rw,noatime,data=writeback 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 automount(pid2380) /home autofs rw 0 0 nfsd /proc/fs/nfsd nfsd rw 0 0 nfssrv.example.com:/home/username /home/username nfs4 rw,nosuid,v4,rsize=32768,wsize=32768,hard,intr,lock,proto=tcp,addr=example.com 0 0
The problem in comment 8 certainly sounds like the same one. Whether ssh works or not probably depends on the devices used. If it has to access a file on a device which blocks it's game over. The commonality is x86-64 (probably 64-bit-ness). I'm not using LVM but two partitions are RAID. And yes, I used the minicom sequence to send break but it has no result. No idea why. I surely enabled sysrq handling. As for smartd, no, seems not to work.
My LVM's physical volume is on RAID as well, that is why I suspect the md code. My cpu is an athlon64 with an ATI graphics card (I use the xorg driver) so I think it's unlikely that this is a hardware driver problem. I am back to the 1863 kernel at the moment and the machine is up for 13 hours without any problems btw. I really have no ide why sysrq over serial isn't working I am afraid :( I can not find the original bugzilla entry about about the smart passthru in libata but have a look at #174095. smartctl -d ata -a /dev/sd[ab] works fine for me.
Created attachment 124297 [details] sysrq-t output after the system hangs I managed to get sysrq working over the serial console. The attached file is the output of the entire system run.
I'm on
The same thing happens on my system. (i386, Pentium-D "dual processor" system). Without looking very closely I too have suspected the md driver. Reading /prod/mdstat always hangs when it's happening.
I have the 1928 kernel running on my machine for almost 2 days now. Seems the problem is fixed. Ingo mentioned some RAID1 problem which has been fixed upstream recently. My bet is that this is the solution.
FWIW 1928 fixed it for me too. (Also running RAID1)
1939+ seems fine for me as well, time to close the bug I guess.