Description of problem: We use mysql with big files up to 130Gb and many small text files (up to 10Kb, 10000 files). Under hi-load kernel report us about problems: " Oct 1 03:47:48 XXXX kernel: 08:11: rw=0, want=1357438156, limit=1144856128 Oct 1 03:47:48 XXXX kernel: attempt to access beyond end of device Oct 1 03:47:48 XXXX kernel: 08:11: rw=0, want=1556136140, limit=1144856128 Oct 1 03:47:48 XXXX kernel: attempt to access beyond end of device Oct 1 03:47:48 XXXX kernel: 08:11: rw=0, want=1422444748, limit=1144856128 Oct 1 03:47:48 XXXX kernel: attempt to access beyond end of device Oct 1 03:47:48 XXXX kernel: 08:11: rw=0, want=1353769156, limit=1144856128 Oct 1 03:47:48 XXXX kernel: attempt to access beyond end of device " ^^^ |||-for this we didn't call applications like 'dd'. it's very strange to have this. and later we can see data corruption in our files. and next time kernel printed lines: " Oct 19 04:03:04 hostdb kernel: EXT3-fs error (device sd(8,5)): ext3_readdir: bad entry in directory #48110: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Oct 19 10:20:50 hostdb kernel: EXT3-fs error (device sd(8,5)): ext3_readdir: bad entry in directory #48110: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Oct 19 11:18:03 hostdb kernel: GDT: Unknown SCSI command 0x4d to cache service ! Oct 19 11:18:03 hostdb last message repeated 4 times Oct 19 11:18:33 hostdb kernel: GDT: Unknown SCSI command 0x4d to cache service ! Oct 19 11:18:33 hostdb last message repeated 4 times Oct 19 11:23:55 hostdb kernel: GDT: Unknown SCSI command 0x4d to cache service ! Oct 19 11:23:55 hostdb last message repeated 4 times Oct 19 11:32:16 hostdb kernel: EXT3-fs error (device sd(8,5)): ext3_readdir: bad entry in directory #48110: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 " We don't know how to reproduce this problem but we have this twice in last year. Usually we can't recover our storage and just make fresh ext3 filesystem. Version-Release number of selected component (if applicable): Red Hat Enteripse Linux 3 (Update 3) kernel-smp-2.4.21-20.EL.i686 qlogic 7.07.06 How reproducible: n/a Steps to Reproduce: 1. n/a 2. 3. Actual results: data corruption filesystem corruption Expected results: normal work system without internal filesystem problems and data loss problems. Additional info: Our system based on 4 CPU Intel Xeon 2.4G and 2Gb RAM as data storage we use RAID on qlogic 2x00 Fiber Controller and system installed on RAID: GDT: Storage RAID Controller Driver. Version: 2.05 GDT: Found 1 PCI Storage RAID Controllers GDT CTR0: Configuring GDT-PCI HA at 4/8 IRQ 48 scsi0 : SRCZCR Vendor: Intel Model: Host Drive #00 Rev: Type: Direct-Access ANSI SCSI revision: 02 Vendor: ESG-TSD Model: SCA HSBP M23 Rev: 1.05 Type: Processor ANSI SCSI revision: 02
additional info: major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq 8 0 143331930 sda 1872107 19531040 171139390 14335430 4663839 35925827 324919600 3514019 0 20869670 17958709 8 1 152586 sda1 538 9401 19878 1950 77 67 288 4650 0 4430 6600 8 2 5116702 sda2 1163371 17126186 146315778 9829230 896466 22965737 191014184 28593164 0 6263820 38430804 8 3 4610655 sda3 128015 294559 3380162 720500 286994 232944 4159568 5437350 0 2162110 6157730 8 4 1 sda4 3 0 6 10 0 0 0 0 0 10 10 8 5 4096543 sda5 17919 59512 619066 205500 81609 97295 1431240 2240840 0 2020740 2446320 8 6 3582463 sda6 41009 164807 1645850 1066920 191747 667659 6921520 7832120 0 2570280 8900060 8 7 3582463 sda7 14912 63858 629482 219230 686481 1076459 14107176 24504980 0 5006930 24724140 8 8 2096451 sda8 287339 125 2299424 1144290 283704 291429 4614664 42280490 0 9861480 568347 8 9 2048256 sda9 36446 36630 584642 186280 446021 1683332 17036272 8172990 0 5074990 8359200 8 10 1534176 sda10 65680 121989 1500882 334020 404936 991110 11170984 6968230 0 5252400 7302190 8 11 4891761 sda11 5269 37066 338002 35130 655381 511099 9335608 39476730 0 2063580 39511880 8 12 9775521 sda12 110836 1614095 13798770 589460 730423 7408696 65128096 9801507 0 6681620 10393727 8 16 1144860672 sdb 7436866 77894686 642758068 25197464 13030831 79000796 628373272 36912771 0 24654780 19191062 8 17 1144856128 sdb1 7436852 77894652 642757972 25198124 13030831 79000796 628373272 36913051 0 24654810 19195572 1144856128 Our server use fixed devices and we didn't change devices on runnig system. I don't know why our external RAID moved from 8:17 -> 8:11 May be something 'shit' revalidate partitions? But how???
Did you fsck the filesystem? What did it find?
e2fsck -fn /dev/sdb1 e2fsck 1.32 (09-Nov-2002) Warning: skipping journal recovery because doing a read-only filesystem check. Pass 1: Checking inodes, blocks, and sizes Inode 32774, i_size is 136314880, should be 144703488. Fix? no Inode 32774, i_blocks is 266512, should be 282912. Fix? no Inode 37218 has illegal block(s). Clear? no Now I can't reproduce fsck output because we did fresh filesystem (on 22 October) tune2fs -l /dev/sdb1 tune2fs 1.32 (09-Nov-2002) Filesystem volume name: MAIN_ARCHIVE Last mounted on: <not available> Filesystem UUID: 36e50578-61fa-43c3-a4ff-fd77497a90e9 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 143114240 Block count: 286214032 Reserved block count: 14310701 Free blocks: 264838496 Free inodes: 143035334 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Mon Oct 22 05:12:03 2007 Last mount time: Mon Oct 22 08:21:04 2007 Last write time: Mon Oct 22 08:21:04 2007 Mount count: 2 Maximum mount count: 20 Last checked: Mon Oct 22 05:12:03 2007 Check interval: 864000 (1 week, 3 days) Next check after: Thu Nov 1 04:12:03 2007 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal UUID: <none> Journal inode: 8 Journal device: 0x0000 First orphan inode: 0
usually these messages have to do with the fs thinking its bigger than the underlying disk and trying to write to a spot outside of the disk. Have you figured out a way to reproduce this on a regular basis? Have you reproduced on the U9 kernel?
No, when we switched to U9 kernel then we don't see any problems with ext3.
Ok I'm going to close this out. Feel free to re-open it if you run into any more issues.